• Recent
    • Unsolved
    • Tags
    • Popular
    • Users
    • Groups
    • Search
    • Register
    • Login

    High and permanent load with no task

    Scheduled Pinned Locked Moved
    General
    4
    12
    958
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • F
      Foglalt @Tom Elliott
      last edited by

      @Tom-Elliott

      99421084-7d85-4958-b2d0-441b4de21497-kép.png

      it is not a virtual machive, an actual machine.

      Private network or not, it is not my domain to decide. I kicked some ass today seeing that even echo is alloved to that machine from the outside (btw they said, netadmins, other traffic is not allowed. i will need to seel some ass to make sure it is…)

      What other information do you suggest to collect?

      Tom ElliottT 1 Reply Last reply Reply Quote 0
      • Tom ElliottT
        Tom Elliott @Foglalt
        last edited by

        @Foglalt I don’t know the generation of the CPU, but 4 cores w/ Hyperthreading = 8 available to Linux. (Some i7’s had hyperthreading, though I know they’re moving away from it.)

        I understand you don’t own the network, so you cannot control it, but that can be a cause of the issue.

        Does the server and the client reside on the same network? Are they a part of the same subnet?

        Kernels could be a part of the cause the slowness on the client. What version of the kernels are you using? Are you able to upgrade to 1.5.8 or 1.5.9-RC? Maybe something was pushed for that.

        About the “slowness” on the client system, what system is it to begin with? (Make model, etc…)

        To me, the slowness you’re seeing is not due to the FOG Server load (though I could be wrong). If the imaging part is moving as fast as it can, I highly doubt it’s the FOG Server load causing the slowness within the client at all. (Just my thoughts).

        Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG! Get in contact with me (chat bubble in the top right corner) if you want to join in.

        Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

        Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

        F 1 Reply Last reply Reply Quote 0
        • F
          Foglalt @Tom Elliott
          last edited by

          @Tom-Elliott

          Our production network is a strange thing nowadays. Almost all computer has an ip what is considered public address, but ofc it is not true in reality. First of all, our pcs can see the outside world, can communicate. The world outside cant communicate with us only if allowed explicitely. So, the network is theoretically public, but actually it is not public. Traffic is routed and walled if needed. If no anomalies found, no traffic from the outside at all to normal machines or servers. This setup is because of many of our projects need public things with less or no routing.

          The imaging worked fine since 1.5.7 version, we had this kinda issue with 1.4.x when i had to change the client kernel for a strange “delay during saving mbr” issue. And now, it “sounds” same. This is why I first changed a few kernels for testing purposes to see what happens. (I still do tests with previosly used kernels)

          As for the load. The load is, you are right, not surely the reason for the issue. I just wanted to give as many details as possible. As we had zero tasks running the load was really strange. the same os version with almost same service setups (web, some php, mysql data backend, zero high throughput data, like in a iddle fog server) does 0.1 0.1 0.1 load (even with an lot older machine, less memory).

          And I am practically sure that the actual fog machine had a lot less load previously. So something happened, or happening, I still need to discover what it is. 😞 During my tests I forgot to start the stopped apache and the load fell to 3.0 somethings from the 7.0 somethings. Normally fog consist of not infinite amount of services but the true shock for me what that it does something but hides it 🙂 no running or stuck process but load…

          One of my thoughts was a failing hdd or some, but smart says it is ok. Not intact and virgin, but is ok.

          So, Elliott, slowness is not the result of the load (especially that the gui or ssh is responsive and fast). I agree. And here comes the “but” part 🙂 Any suggestion? (I will try to upgrade to current version, but less options in covid situation. I dont want to kick the table from under my colleague who have to be in building if I dont have to 🙂

          1 Reply Last reply Reply Quote 0
          • Tom ElliottT
            Tom Elliott
            last edited by

            What kernel version are you running?

            You can see this from FOG Configuration -> Version -> Your Storage node -> expand -> bzImage and bzImage_32 version?

            Just for sanity: does the host you’re trying to image have a custom kernel attached to it?

            Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG! Get in contact with me (chat bubble in the top right corner) if you want to join in.

            Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

            Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

            F 1 Reply Last reply Reply Quote 0
            • F
              Foglalt @Tom Elliott
              last edited by

              @Tom-Elliott
              Ah, sorry, I forgot to answer this question of you:

              bzImage Version: 4.19.118
              bzImage32 Version: 4.19.118

              And no, atm we have zero special hosts, so no need special kernel. In old days we had, but atm no, only one.

              1 Reply Last reply Reply Quote 0
              • S
                Sebastian Roth Moderator
                last edited by Sebastian Roth

                @Foglalt It’s interesting you have a load average of nearly 7 but CPUs seem pretty much idle. Not saying I have a solution but you might find this helpful:

                • https://martincarstenbach.wordpress.com/2013/06/25/troubleshooting-high-load-average-on-linux/
                • https://www.tummy.com/articles/isolating-heavy-load/

                The slowness at certain points when imaging might be connected to the load but could also be unrelated. Possibly with the FOS kernels used there is a network driver issue causing the issue and “rpc-srv/tcp: nfsd: sent only 18600 when sending 32900 bytes - shutting down socket” messages as well?! You’d need to switch to different kernel versions to see if it makes any difference.

                Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                1 Reply Last reply Reply Quote 0
                • F
                  Foglalt
                  last edited by Foglalt

                  We did a few more investigations and came to a seemingly working solution. Sorry for the long wait, but in virus situation we have limited hw access and those are only on fixed days.

                  Part one of the case: high cpu load. This was the easiest. It was a disk issue (smart showed no valid error, but I insisted on a test with a new disk. faster, bigger etc. It was a time to buy and make it running). So, the load became ok (back to zero or 0.1, as was normally).

                  Part two, the slowness. We had a massive new hw to image, most of them are almost same hw, but we found out the the slow ones has some undocumented hw difference maybe. (Meaning they should be identical, but actually they are not). Solution: we disabled the uefi mode and now it is now properly working (drawback it needs some finishing at the end… but doable). I dont know what is the true hw that gives this error, but legacy mode seems ok atm.

                  For future investigations, or for the logs, here is the actual hw that we found guilty in a few percentages:

                  c30e8a81-908a-4f3a-9300-edb5d350d75a-kép.png

                  It is a “hp elitebook” laptop. All should be same, but somehow somewhere they are not. The problematic hw fails on many place in speed. Sometimes even the bzimage download is “visible” (normally it is 100% ok at once), sometimes disk partitioning is stuck for a long time, etc).

                  I think and I hope the case is closed. How can I mark it “solved”?

                  (oh, forgot to mention: we did changes in kernels, no difference with those some)

                  george1421G 1 Reply Last reply Reply Quote 0
                  • george1421G
                    george1421 Moderator @Foglalt
                    last edited by

                    @Foglalt For the target system slowness, could you point to a specific bit of hardware that was causing the slowness or what it the entire chassis that was causing the slowness? I remember one post where if they had brand X NVMe drive installed the iPXE stuff was slow in just downloading the background image. But if the OP switched to brand Y NVMe drive the system acted normal. So the question is the problem a replaceable component or is it something with the mobo that is causing the slowness where a firmware update won’t fix it?

                    Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                    F 1 Reply Last reply Reply Quote 0
                    • F
                      Foglalt @george1421
                      last edited by

                      @george1421

                      The actual hardware part was not identified, we had not enough time and the computer was not available to disassamble (it is a brand new laptop, which has garantee issue if opened up). The process where the slowness was noticed was various. Some of the ipxe boot process (like downloading bzimage), some adjusting the disk (partition writing, mbr saving, etc). Fun fact that the actual disk loading with image was not hindered with slowness. Considering that the image deploy is a massive amount of data writing it is strange. Compared to the bzimage’s tiny size, it is not clear what caused the actual slowness.

                      It was during disk io and can be during network traffic. During the process we couldnt detect more error than actual slowness only. All done, but insanely slowly. When we switched back to legacy mode, it was like a charm. Fast and easy.

                      george1421G 1 Reply Last reply Reply Quote 0
                      • george1421G
                        george1421 Moderator @Foglalt
                        last edited by

                        @Foglalt What would be interesting to know from a running windows OS, to look at the installed nvme hard drive. Were the computers that used that exact hard drive slow where the same model that used brand Y of the nvme drive OK? We seen this condition on a dell computer where they intermixed nvme drives on a single model.

                        Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                        1 Reply Last reply Reply Quote 0
                        • 1 / 1
                        • First post
                          Last post

                        176

                        Online

                        12.1k

                        Users

                        17.3k

                        Topics

                        155.4k

                        Posts
                        Copyright © 2012-2024 FOG Project