Hello guys!
We have noticed an issue what we try to investigate but I need a bit of information and help as I am stuck ATM. We found out that some imaging (upload and download, too) became insanely slow. The deploy part seems (ATM!) that is target hw dependent, which is good, as it is not on our side. The problem became more interesting when it seemed that even the “master imaging” became related to the issue. The master image is created regulary basis on a virtual machine (win default one, not virtualbox or such; hyperV). Uploads are slooow.
We have discovered that “preparation stages are slow, too, not the actual imageing” (so not only cloning process, but the scripting ones).
At a point we had an issue about dealing with filesystem, partitioning. Once we had such, that was a kernel issue, now it is not, or a different one, as I tried few versions.
Server officially does practically nothing. Sole job is the foging. As with no task it shows strange issues, I started another investigations. Found out the it has HIGH loads, and that load is practically permanent. With not too much done on the server.
The load is almost all time at 7.0. Very little oscillation. Can we somehow locate the reason fog-wise? (another point is a possible hw failure ofc, but atm I dont really think so I can detect that as virus broke physical bonds with machines so I try to detect software issues at first.
Can I have suggestions? My problem is that I am not a performance tuner type, I have not too much experience in such. All I see that web or ssh side server has zero delay to respond. But the imaging process has delays at points where previously it has zero. I dont know that the high load is the issue, or it is a result. I can collect all data if needed, as I have access remotely, but atm I am stuck atm.
We have 3 issue:
- high and no-reason load permanently, with zero task on server
- in the imaging between actuly data load we have steps where things became “stuck” for 10s of minutes (partitioning, mounting fs, post scripts entry points)
- and I see network traffic on a permanent basis what was not there if i recall well.
I made a few screenshots if it may help.
here it stucks for a long time.
here, the partition deployment is fast, then the start of the next partition is delayed a LOT (win10, 3 small and 1 big data partition, as it creates them). Normally the 3 little ones are only “blinks of an eye”, not we can even drink coffee between them). Data throughput is fast.
Here it does something, again strangely slow.
Here is the truely slow one. (this is the one that i mentioned that previously it was a kernel issue with other fog versions, what I could solve with new kernels).
We currently run the 1.5.7 fog version, on a debian (stretch, 9.12). The machine has 8G of ram, 500G of disk for active things, and 1T disk for “prebackup” (mounted only in case, otherwise inactive)
(UPDATE: I see things in dmesg what I havent seen before:
[456483.853291] rpc-srv/tcp: nfsd: sent only 18600 when sending 32900 bytes - shutting down socket
And many of this… I found a bug that caused such years ago, i dont think that is it, (i have version: ii nfs-kernel-server 1:1.3.4-2.1 )
Any suggestion what does this poor machine does in his free time that need to be killed? ATM smart show no disk error, what I know).