High and permanent load with no task

Foglalt

Hello guys!

We have noticed an issue what we try to investigate but I need a bit of information and help as I am stuck ATM. We found out that some imaging (upload and download, too) became insanely slow. The deploy part seems (ATM!) that is target hw dependent, which is good, as it is not on our side. The problem became more interesting when it seemed that even the “master imaging” became related to the issue. The master image is created regulary basis on a virtual machine (win default one, not virtualbox or such; hyperV). Uploads are slooow.

We have discovered that “preparation stages are slow, too, not the actual imageing” (so not only cloning process, but the scripting ones).

At a point we had an issue about dealing with filesystem, partitioning. Once we had such, that was a kernel issue, now it is not, or a different one, as I tried few versions.

Server officially does practically nothing. Sole job is the foging. As with no task it shows strange issues, I started another investigations. Found out the it has HIGH loads, and that load is practically permanent. With not too much done on the server.

b52cf205-c3a6-4967-8447-b04b29e5df61-kép.png

The load is almost all time at 7.0. Very little oscillation. Can we somehow locate the reason fog-wise? (another point is a possible hw failure ofc, but atm I dont really think so I can detect that as virus broke physical bonds with machines so I try to detect software issues at first.

Can I have suggestions? My problem is that I am not a performance tuner type, I have not too much experience in such. All I see that web or ssh side server has zero delay to respond. But the imaging process has delays at points where previously it has zero. I dont know that the high load is the issue, or it is a result. I can collect all data if needed, as I have access remotely, but atm I am stuck atm.

We have 3 issue:

high and no-reason load permanently, with zero task on server
in the imaging between actuly data load we have steps where things became “stuck” for 10s of minutes (partitioning, mounting fs, post scripts entry points)
and I see network traffic on a permanent basis what was not there if i recall well.

I made a few screenshots if it may help.

00aeab23-0b16-41b7-a574-4b8cffd68180-kép.png
here it stucks for a long time.

2ac0ce50-ea03-4e5a-a0c0-f1626864611f-kép.png
here, the partition deployment is fast, then the start of the next partition is delayed a LOT (win10, 3 small and 1 big data partition, as it creates them). Normally the 3 little ones are only “blinks of an eye”, not we can even drink coffee between them). Data throughput is fast.

df2ec345-6650-43f7-947d-d037da4a01a0-kép.png
Here it does something, again strangely slow.

36669a4f-0842-46ac-9d95-e5bc0cfe80bb-kép.png
Here is the truely slow one. (this is the one that i mentioned that previously it was a kernel issue with other fog versions, what I could solve with new kernels).

We currently run the 1.5.7 fog version, on a debian (stretch, 9.12). The machine has 8G of ram, 500G of disk for active things, and 1T disk for “prebackup” (mounted only in case, otherwise inactive)

(UPDATE: I see things in dmesg what I havent seen before:
[456483.853291] rpc-srv/tcp: nfsd: sent only 18600 when sending 32900 bytes - shutting down socket

And many of this… I found a bug that caused such years ago, i dont think that is it, (i have version: ii nfs-kernel-server 1:1.3.4-2.1 )

Any suggestion what does this poor machine does in his free time that need to be killed? ATM smart show no disk error, what I know).

Tom Elliott

How many CPUs are available to the FOG Server?

Using Public ranged IP addresses is never fun, maybe this is a part of the problem?

What network is the FOG Server running on? (Is it also on 146.110.108.X?)

See, the fun part is,
https://search.arin.net/rdap/?query=146.110.108.59

Are you indeed the individual this is coming up as?

High load on a server is very subjective. What do I mean? if uptime/top CPU score > number of cores available to machine, then the load average is high. At or below the number of cores available = not under high load. Sure some processes may spike slightly. If you had a 4 core machine with 7 load, chances are likely you would not be able to even have as much success as you are seeing.

Something leads me to think we need a lot more information.

Of note, I would highly recommend getting your machines and fog server to a private network and (as possible) the same private network for the client and the fog server.

Foglalt

@Tom-Elliott

99421084-7d85-4958-b2d0-441b4de21497-kép.png

it is not a virtual machive, an actual machine.

Private network or not, it is not my domain to decide. I kicked some ass today seeing that even echo is alloved to that machine from the outside (btw they said, netadmins, other traffic is not allowed. i will need to seel some ass to make sure it is…)

What other information do you suggest to collect?

Tom Elliott

@Foglalt I don’t know the generation of the CPU, but 4 cores w/ Hyperthreading = 8 available to Linux. (Some i7’s had hyperthreading, though I know they’re moving away from it.)

I understand you don’t own the network, so you cannot control it, but that can be a cause of the issue.

Does the server and the client reside on the same network? Are they a part of the same subnet?

Kernels could be a part of the cause the slowness on the client. What version of the kernels are you using? Are you able to upgrade to 1.5.8 or 1.5.9-RC? Maybe something was pushed for that.

About the “slowness” on the client system, what system is it to begin with? (Make model, etc…)

To me, the slowness you’re seeing is not due to the FOG Server load (though I could be wrong). If the imaging part is moving as fast as it can, I highly doubt it’s the FOG Server load causing the slowness within the client at all. (Just my thoughts).

Foglalt

@Tom-Elliott

Our production network is a strange thing nowadays. Almost all computer has an ip what is considered public address, but ofc it is not true in reality. First of all, our pcs can see the outside world, can communicate. The world outside cant communicate with us only if allowed explicitely. So, the network is theoretically public, but actually it is not public. Traffic is routed and walled if needed. If no anomalies found, no traffic from the outside at all to normal machines or servers. This setup is because of many of our projects need public things with less or no routing.

The imaging worked fine since 1.5.7 version, we had this kinda issue with 1.4.x when i had to change the client kernel for a strange “delay during saving mbr” issue. And now, it “sounds” same. This is why I first changed a few kernels for testing purposes to see what happens. (I still do tests with previosly used kernels)

As for the load. The load is, you are right, not surely the reason for the issue. I just wanted to give as many details as possible. As we had zero tasks running the load was really strange. the same os version with almost same service setups (web, some php, mysql data backend, zero high throughput data, like in a iddle fog server) does 0.1 0.1 0.1 load (even with an lot older machine, less memory).

And I am practically sure that the actual fog machine had a lot less load previously. So something happened, or happening, I still need to discover what it is. During my tests I forgot to start the stopped apache and the load fell to 3.0 somethings from the 7.0 somethings. Normally fog consist of not infinite amount of services but the true shock for me what that it does something but hides it no running or stuck process but load…

One of my thoughts was a failing hdd or some, but smart says it is ok. Not intact and virgin, but is ok.

So, Elliott, slowness is not the result of the load (especially that the gui or ssh is responsive and fast). I agree. And here comes the “but” part Any suggestion? (I will try to upgrade to current version, but less options in covid situation. I dont want to kick the table from under my colleague who have to be in building if I dont have to

Tom Elliott

What kernel version are you running?

You can see this from FOG Configuration -> Version -> Your Storage node -> expand -> bzImage and bzImage_32 version?

Just for sanity: does the host you’re trying to image have a custom kernel attached to it?

Foglalt

@Tom-Elliott
Ah, sorry, I forgot to answer this question of you:

bzImage Version: 4.19.118
bzImage32 Version: 4.19.118

And no, atm we have zero special hosts, so no need special kernel. In old days we had, but atm no, only one.

Sebastian Roth

@Foglalt It’s interesting you have a load average of nearly 7 but CPUs seem pretty much idle. Not saying I have a solution but you might find this helpful:

The slowness at certain points when imaging might be connected to the load but could also be unrelated. Possibly with the FOS kernels used there is a network driver issue causing the issue and “rpc-srv/tcp: nfsd: sent only 18600 when sending 32900 bytes - shutting down socket” messages as well?! You’d need to switch to different kernel versions to see if it makes any difference.

Foglalt

We did a few more investigations and came to a seemingly working solution. Sorry for the long wait, but in virus situation we have limited hw access and those are only on fixed days.

Part one of the case: high cpu load. This was the easiest. It was a disk issue (smart showed no valid error, but I insisted on a test with a new disk. faster, bigger etc. It was a time to buy and make it running). So, the load became ok (back to zero or 0.1, as was normally).

Part two, the slowness. We had a massive new hw to image, most of them are almost same hw, but we found out the the slow ones has some undocumented hw difference maybe. (Meaning they should be identical, but actually they are not). Solution: we disabled the uefi mode and now it is now properly working (drawback it needs some finishing at the end… but doable). I dont know what is the true hw that gives this error, but legacy mode seems ok atm.

For future investigations, or for the logs, here is the actual hw that we found guilty in a few percentages:

c30e8a81-908a-4f3a-9300-edb5d350d75a-kép.png

It is a “hp elitebook” laptop. All should be same, but somehow somewhere they are not. The problematic hw fails on many place in speed. Sometimes even the bzimage download is “visible” (normally it is 100% ok at once), sometimes disk partitioning is stuck for a long time, etc).

I think and I hope the case is closed. How can I mark it “solved”?

(oh, forgot to mention: we did changes in kernels, no difference with those some)

george1421

@Foglalt For the target system slowness, could you point to a specific bit of hardware that was causing the slowness or what it the entire chassis that was causing the slowness? I remember one post where if they had brand X NVMe drive installed the iPXE stuff was slow in just downloading the background image. But if the OP switched to brand Y NVMe drive the system acted normal. So the question is the problem a replaceable component or is it something with the mobo that is causing the slowness where a firmware update won’t fix it?

Foglalt

@george1421

The actual hardware part was not identified, we had not enough time and the computer was not available to disassamble (it is a brand new laptop, which has garantee issue if opened up). The process where the slowness was noticed was various. Some of the ipxe boot process (like downloading bzimage), some adjusting the disk (partition writing, mbr saving, etc). Fun fact that the actual disk loading with image was not hindered with slowness. Considering that the image deploy is a massive amount of data writing it is strange. Compared to the bzimage’s tiny size, it is not clear what caused the actual slowness.

It was during disk io and can be during network traffic. During the process we couldnt detect more error than actual slowness only. All done, but insanely slowly. When we switched back to legacy mode, it was like a charm. Fast and easy.

george1421

@Foglalt What would be interesting to know from a running windows OS, to look at the installed nvme hard drive. Were the computers that used that exact hard drive slow where the same model that used brand Y of the nvme drive OK? We seen this condition on a dell computer where they intermixed nvme drives on a single model.

High and permanent load with no task

159

12.1k

17.3k

155.4k