Storage Node Re-Writing Images Daily and Crushing My Network

FallingWax

Running Version 1.3.0
SVN Revision: 6050

While I was troubleshooting some network speed issues came across this error in the wireshark:

[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

This was communication between my main fog and a storage node. I went and looked at the image files and it looks like they are rewriting the same images. I haven’t created anything new but the date on the images has changed everyday. This is crushing my network speeds so badly I have stop the ImageReplicator service, which immediately fixed the problem. Any help would be appreciated!

Thanks

Tom Elliott

If your network is constantly erroring out retransmission would be expected. The FOG Replication stuff is aware of what’s replicating and what’s not, so if the transmission STOPs before the files are fully copied, it might be rewritting because something is killing the connections.

FallingWax

@Tom-Elliott Is there any logging that i might look at for the Image Replicator to see if that is what is happening?

Tom Elliott

@FallingWax /var/log/fog/fogreplicator.log and/or /var/log/fog/fogsnapinrep.log

Then there’s the nodes getting the files:

/var/log/fog/fogreplicator.log.transfer.<nodename>.log
And/or
/var/log/fog/fogsnapinrep.log.transfer.<nodename>.log

(There’s not much in regards to replicating)

FallingWax

Looks like i found three images that don’t finish replicating or don’t replicate properly and are consistently writing/deleting over and over. Those images work correctly on the Main machine so I would hesitate to remove them.

This is what i see in the log

| Image Name: Dell_7040_Win10_x64
[01-06-17 5:48:56 pm] | Dell_7040_Win10_x64: No need to sync d1.mbr file to 19$
[01-06-17 5:48:56 pm] | Dell_7040_Win10_x64: No need to sync d1.partitions fil$
[01-06-17 5:48:56 pm] | Dell_7040_Win10_x64: No need to sync d1p1.ebr file to $
[01-06-17 5:48:57 pm] | Dell_7040_Win10_x64: No need to sync d1p2.img file to $
[01-06-17 5:48:57 pm] | Files do not match.
[01-06-17 5:48:57 pm] * Deleting remote file: /images/Dell7040Win10x64/d1p3.img
[01-06-17 5:48:58 pm] | Files do not match.
[01-06-17 5:48:58 pm] * Deleting remote file: /images/Dell7040Win10x64/d1p4.img
[01-06-17 5:48:58 pm] | Dell_7040_Win10_x64: No need to sync d1p5.ebr file to $
[01-06-17 5:48:59 pm] | Dell_7040_Win10_x64: No need to sync d1p5.img file to $
[01-06-17 5:48:59 pm] * Starting Sync Actions
[01-06-17 5:48:59 pm] | CMD:
lftp -e 'set ftp:list-options -a;set net:max-retries 10$

Tom Elliott

What about disk usage? Is it possible your nodes (or your main server) are maxed out on disk space?

Wayne Workman

First check your disk usage on the remote node as Tom said. Check it with this command: df -h look for partitions with 99% or 100% usage.

@FallingWax I remember having this problem although I can’t remember what I called the thread title… it’s here in the forums somewhere.

But, basically I figured out that very large images were not completing replication within the grace window and the fog image replicator would just kill the old replication task and start it again.

I brought this issue up to @Tom-Elliott at the time and he coded a fix - the fix made the image replicator aware of prior spawned lftp instances, and it would wait for those instances to complete before trying to restart them.

Maybe something in the code base is goofed, I’m not sure. But you need to look at this setting and write down what it is:
Web Interface -> FOG Configuration -> FOG Settings -> FOG Linux Service Sleep Times -> IMAGEREPSLEEPTIME So write that down, it’s in seconds. Next you need to go through your replication logs. Tom pointed out the places in the filesystem but they are also available via the web interface here: Web Interface -> FOG Configuration -> Log Viewer -> Image Replicator. You need to figure out if the image replication sleep time is close to when the image replicator just restarts the transfer - or not. If it’s close to when it restarts, this could mean that there’s an issue with the image replicator keeping track of lftp instances that it created. There could of course be other issues that we don’t know about so you should be extra observant when looking through all of this stuff.

Storage Node Re-Writing Images Daily and Crushing My Network

252

12.1k

17.3k

155.4k