storage node sha512sum at 100% CPU/HDD usage

Tom Elliott

@mp12 based on what I’m seeing the hash is working properly it’s just not finding the right file size. Unless you have two nodes being replicated to?

The lines showing the file name with the numbers is the file sizes it’s seeing. First one is the masternodes and the second is the remote nodes. All seem to show the file size of the remote as being 0.

Tom Elliott

I guess I need to see / understand how your nodes are setup. Are they all regular storage nodes (installed just like a server/node would be) or are they nas or other type of custom layout.

Knowing this information will certainly help me find the problem and possibly a solution if they are normal install nodes.

mp12

@tom-elliott just normal installed. One master server and one storage node. Maybe this is the mistake? Some time ago we had the location plugin enabled. But the problem existed before, I guess. We had a secound storage node some time ago, which I removed from the group. Both (all) running as VM.

Tom Elliott

I did a bit of work on the fogservice in regards to getting the replication tested more appropriately. I just wanted to give you a heads up that I hadn’t forgotten about this, just been kind of busy.

If you would be so kind as to install the github working branch and see if this helps out? I did a bit of work toward getting the filesize and hashes a bit more correctly as it seemed things were in a double check before (which would explain maxing out the sha512sum too).

mp12

Thanks @Tom-Elliott. I switched to working. Deleted all images on fog node to see if replication is working properly. I will report back tomorrow.

mp12

The load from sha512sum process has decreased to 80% on the node. But now the master is also involved checking hash with high load at around 80%. Smaller files seem to be synced/checked properly. Bigger files obvisly do not match?

Inside the fogreplication.log I am getting messages like this:

[05-17-18 3:38:59 pm]  | Image Name: xxx_win7_26042018
[05-17-18 3:39:00 pm]  | xxx_win7_26042018: No need to sync d1.fixed_size_partitions file to node
[05-17-18 3:39:00 pm]  | xxx_win7_26042018: No need to sync d1.mbr file to node
[05-17-18 3:39:00 pm]  | xxx_win7_26042018: No need to sync d1.minimum.partitions file to node
[05-17-18 3:39:00 pm]  | xxx_win7_26042018: No need to sync d1.original.fstypes file to node
[05-17-18 3:39:01 pm]  | xxx_win7_26042018: No need to sync d1.original.swapuuids file to node
[05-17-18 3:39:01 pm]  | xxx_win7_26042018: No need to sync d1.partitions file to node
[05-17-18 3:39:01 pm]  | xxx_win7_26042018: No need to sync d1p1.img file to node
[05-17-18 3:51:16 pm]  | Files do not match on server: node
[05-17-18 3:51:16 pm]  | Deleting remote file: /images/xxx_win7_26042018/d1p2.img
[05-17-18 3:51:16 pm]  * Starting Sync Actions
[05-17-18 3:51:16 pm]  | CMD: lftp -e 'set xfer:log 1; set xfer:log-file "/opt/fog/log/fogreplicator.xxx_win7_26042018.transfer.node.log";set ftp:list-options -a;set net:max-retries 10;set net:timeout 30;  mirror -c --parallel=20 -R --ignore-time -vvv --exclude ".srvprivate" "/images/xxx_win7_26042018" "/images/xxx_win7_26042018"; exit' -u fog,[Protected] x.x.x.x

Last entry /opt/fog/log/fogreplicator.xxx_win7_26042018.transfer.node.log is in the past. I don’t think that the d1p2.img file was copied to the node again. I checked the Bandwidthmonitor for 30 minutes. Nothing higher than 3Mbps.

2018-05-17 10:12:37 /images/xxx_win7_26042018/d1p2.img -> ftp://xxx@x.x.x.x/%2Fimages/xxx_win7_26042018/d1p2.img 0-101176680804 27.78 MiB/s

So I think that the sync/check process now only has problems with larger files.
Size of d1p2.img is around 95G.

Tom Elliott

@mp12 I’m not seeing the same issue. I suppose it could be how large the files are but that still seems unlikely. Is it possible the Lftp command is stuck itself? On the transferring node you can see running lftp processes via ps -ef|grep lftp and I’m going to guess there is none running. Maybe disk space on the receiving node is out?

mp12

@tom-elliott there is no lftp running. I am having one default image that passes the check normally. The filed1p3.img is around 11G.

[05-18-18 7:29:19 am] | Image Name: xxx_Default_Win10
[05-18-18 7:29:19 am] | xxx_Default_Win10: No need to sync d1.fixed_size_partitions file to node
[05-18-18 7:29:19 am] | xxx_Default_Win10: No need to sync d1.mbr file to node
[05-18-18 7:29:20 am] | xxx_Default_Win10: No need to sync d1.minimum.partitions file to node
[05-18-18 7:29:20 am] | xxx_Default_Win10: No need to sync d1.original.fstypes file to node
[05-18-18 7:29:20 am] | xxx_Default_Win10: No need to sync d1.original.swapuuids file to node
[05-18-18 7:29:20 am] | xxx_Default_Win10: No need to sync d1.original.uuids file to node
[05-18-18 7:29:20 am] | xxx_Default_Win10: No need to sync d1.partitions file to node
[05-18-18 7:29:21 am] | xxx_Default_Win10: No need to sync d1p1.img file to node
[05-18-18 7:29:21 am] | xxx_Default_Win10: No need to sync d1p2.img file to node
[05-18-18 7:30:49 am] | xxx_Default_Win10: No need to sync d1p3.img file to node
[05-18-18 7:30:52 am] | xxx_Default_Win10: No need to sync d1p4.img file to node
[05-18-18 7:30:52 am] * All files synced for this item.

Free disk space around 70G. I will move one image to free up some space and reboot both maschines.

Tom Elliott

@mp12 Any word on this? It really seems there could be a disk usage type issue, though your free space does appear to fit the size needed for that image. Then again, if there’s other images also trying to replicate at the same time, this could pose a bit of a problem as well.

mp12

@tom-elliott I will set up a new node and check if something goes wrong in replication. If not I will remove the old node.

Tom Elliott

Any word on this @mp12? I want to nail this one down, though you already admitted that it seems a little better. Hopefully a new node will help out a bit?

Thanks for the feedback.

mp12

@tom-elliott
just wanted to see if things improved over the weekend. But the new node has the same problem as discribed before. I will recheck the master and try a Zstd split.

mp12

@Tom-Elliott after replicating 492 files and waiting one night, I can say that nothing goes wrong with small files. The check of the 200M files takes 2-3 seconds. Large files are still not checked correctly. I don’t know exactly at what size the test fails. A file with 11G is checked correctly. Files from 85G lead to errors.

mp12

@Tom-Elliott after I noticed that you can’t distribute splitfiles via multicast, I discarded this possibility.

Sebastian Roth

Should be solved in latest as we have re-worked the replication stuff a fair bit.

mp12

@Sebastian-Roth
Thanks for the reply! Did an update to 1.5.5.1 two weeks ago. Everything works fine so far!

jeffersondv

This post is deleted!

storage node sha512sum at 100% CPU/HDD usage

161

12.3k

17.4k

155.8k