High and permanent load with no task

Foglalt

99421084-7d85-4958-b2d0-441b4de21497-kép.png

it is not a virtual machive, an actual machine.

Private network or not, it is not my domain to decide. I kicked some ass today seeing that even echo is alloved to that machine from the outside (btw they said, netadmins, other traffic is not allowed. i will need to seel some ass to make sure it is…)

What other information do you suggest to collect?

Tom Elliott

@Foglalt I don’t know the generation of the CPU, but 4 cores w/ Hyperthreading = 8 available to Linux. (Some i7’s had hyperthreading, though I know they’re moving away from it.)

I understand you don’t own the network, so you cannot control it, but that can be a cause of the issue.

Does the server and the client reside on the same network? Are they a part of the same subnet?

Kernels could be a part of the cause the slowness on the client. What version of the kernels are you using? Are you able to upgrade to 1.5.8 or 1.5.9-RC? Maybe something was pushed for that.

About the “slowness” on the client system, what system is it to begin with? (Make model, etc…)

To me, the slowness you’re seeing is not due to the FOG Server load (though I could be wrong). If the imaging part is moving as fast as it can, I highly doubt it’s the FOG Server load causing the slowness within the client at all. (Just my thoughts).

Foglalt

@Tom-Elliott

Our production network is a strange thing nowadays. Almost all computer has an ip what is considered public address, but ofc it is not true in reality. First of all, our pcs can see the outside world, can communicate. The world outside cant communicate with us only if allowed explicitely. So, the network is theoretically public, but actually it is not public. Traffic is routed and walled if needed. If no anomalies found, no traffic from the outside at all to normal machines or servers. This setup is because of many of our projects need public things with less or no routing.

The imaging worked fine since 1.5.7 version, we had this kinda issue with 1.4.x when i had to change the client kernel for a strange “delay during saving mbr” issue. And now, it “sounds” same. This is why I first changed a few kernels for testing purposes to see what happens. (I still do tests with previosly used kernels)

As for the load. The load is, you are right, not surely the reason for the issue. I just wanted to give as many details as possible. As we had zero tasks running the load was really strange. the same os version with almost same service setups (web, some php, mysql data backend, zero high throughput data, like in a iddle fog server) does 0.1 0.1 0.1 load (even with an lot older machine, less memory).

And I am practically sure that the actual fog machine had a lot less load previously. So something happened, or happening, I still need to discover what it is. During my tests I forgot to start the stopped apache and the load fell to 3.0 somethings from the 7.0 somethings. Normally fog consist of not infinite amount of services but the true shock for me what that it does something but hides it no running or stuck process but load…

One of my thoughts was a failing hdd or some, but smart says it is ok. Not intact and virgin, but is ok.

So, Elliott, slowness is not the result of the load (especially that the gui or ssh is responsive and fast). I agree. And here comes the “but” part Any suggestion? (I will try to upgrade to current version, but less options in covid situation. I dont want to kick the table from under my colleague who have to be in building if I dont have to

Tom Elliott

What kernel version are you running?

You can see this from FOG Configuration -> Version -> Your Storage node -> expand -> bzImage and bzImage_32 version?

Just for sanity: does the host you’re trying to image have a custom kernel attached to it?

Foglalt

@Tom-Elliott
Ah, sorry, I forgot to answer this question of you:

bzImage Version: 4.19.118
bzImage32 Version: 4.19.118

And no, atm we have zero special hosts, so no need special kernel. In old days we had, but atm no, only one.

Sebastian Roth

@Foglalt It’s interesting you have a load average of nearly 7 but CPUs seem pretty much idle. Not saying I have a solution but you might find this helpful:

The slowness at certain points when imaging might be connected to the load but could also be unrelated. Possibly with the FOS kernels used there is a network driver issue causing the issue and “rpc-srv/tcp: nfsd: sent only 18600 when sending 32900 bytes - shutting down socket” messages as well?! You’d need to switch to different kernel versions to see if it makes any difference.

Foglalt

We did a few more investigations and came to a seemingly working solution. Sorry for the long wait, but in virus situation we have limited hw access and those are only on fixed days.

Part one of the case: high cpu load. This was the easiest. It was a disk issue (smart showed no valid error, but I insisted on a test with a new disk. faster, bigger etc. It was a time to buy and make it running). So, the load became ok (back to zero or 0.1, as was normally).

Part two, the slowness. We had a massive new hw to image, most of them are almost same hw, but we found out the the slow ones has some undocumented hw difference maybe. (Meaning they should be identical, but actually they are not). Solution: we disabled the uefi mode and now it is now properly working (drawback it needs some finishing at the end… but doable). I dont know what is the true hw that gives this error, but legacy mode seems ok atm.

For future investigations, or for the logs, here is the actual hw that we found guilty in a few percentages:

c30e8a81-908a-4f3a-9300-edb5d350d75a-kép.png

It is a “hp elitebook” laptop. All should be same, but somehow somewhere they are not. The problematic hw fails on many place in speed. Sometimes even the bzimage download is “visible” (normally it is 100% ok at once), sometimes disk partitioning is stuck for a long time, etc).

I think and I hope the case is closed. How can I mark it “solved”?

(oh, forgot to mention: we did changes in kernels, no difference with those some)

george1421

@Foglalt For the target system slowness, could you point to a specific bit of hardware that was causing the slowness or what it the entire chassis that was causing the slowness? I remember one post where if they had brand X NVMe drive installed the iPXE stuff was slow in just downloading the background image. But if the OP switched to brand Y NVMe drive the system acted normal. So the question is the problem a replaceable component or is it something with the mobo that is causing the slowness where a firmware update won’t fix it?

Foglalt

@george1421

The actual hardware part was not identified, we had not enough time and the computer was not available to disassamble (it is a brand new laptop, which has garantee issue if opened up). The process where the slowness was noticed was various. Some of the ipxe boot process (like downloading bzimage), some adjusting the disk (partition writing, mbr saving, etc). Fun fact that the actual disk loading with image was not hindered with slowness. Considering that the image deploy is a massive amount of data writing it is strange. Compared to the bzimage’s tiny size, it is not clear what caused the actual slowness.

It was during disk io and can be during network traffic. During the process we couldnt detect more error than actual slowness only. All done, but insanely slowly. When we switched back to legacy mode, it was like a charm. Fast and easy.

george1421

@Foglalt What would be interesting to know from a running windows OS, to look at the installed nvme hard drive. Were the computers that used that exact hard drive slow where the same model that used brand Y of the nvme drive OK? We seen this condition on a dell computer where they intermixed nvme drives on a single model.

High and permanent load with no task

121

12.2k

17.4k

155.5k