LFTP mirror copies files that are still being written
-
I’ve noticed that the FOGImageReplicator service calls
lftp
, which mirrors the/images
folder from one node to another. However, it also tries to sync files that are still being written to. I’m not sure how smartlftp
is, but it will waste traffic at best, or copy incorrectly at worst.This is a minor issue, since there is the
dev
folder, and if the last step is always to move the image being worked on fromdev
todev/..
, it happens so quickly thatlftp
would (AFAIK) not start copying before it arrives.Oh, wait… if for some reason
/images/dev
and/images
are not on the same filesystem,mv
is not atomic, andlftp
will try to transfer incomplete files.But maybe
lftp
is smart enough to account for this? I couldn’t find it online. I did, however, find a way to exclude files with a modification time of less than a few minutes ago: http://serverfault.com/a/693787/301389 -
Sure enough, when I restarted FOGImageReplicator, lftp decided that the file (the one it tried to mirror while I was writing to it) is no good, promptly deleting it on all the slaves and sending it again. This is not a bad outcome. But it means we sent unnecessary traffic before.
-
I’m just guessing here, but are you writing the files you need to a node other than the master? I agree we shouldn’t overwrite the data if it’s being written currently. The replication processes only occurs from master nodes. In the case of a file being associated to multiple groups, the designated “primary group” master node will replicate to the remote groups master node.
To make sure a file doesn’t incidentally get deleted from where you’re placing the files, you should put the files on the primary groups master node (or master node in the case of only one storage group).
If this is not true the next cycle compares the the local and remote and if they don’t match will delete the files and begin the sync again (if the files already exist on the master/primary master.)
This is all speculation and I have an idea of how to correct this already. I don’t think relying on oldest time of five minutes is accurate enough a check to warrant delete. Checking if the local file is being written to currently and skipping replication-delete if it is. Though I probably just sync the whole images directory…
Most likely I’m misreading too.
-
@Tom-Elliott I was writing on the master only. The problem is that lftp should wait until I’m done writing before it sends the file to the slaves.
-
Why would you setup /dev on a different disk? I would not recommend that.
-
@Wayne-Workman I didn’t, I’m just speculating. One possible use case might be if you have an SSD for dev to upload faster and a big slow HDD for the long term storage.
-
@dolf capturing is a lot slower than deploying. If anything, use an SSD for both.
-
@Wayne-Workman That sounds like a feature request… Having a deploy folder on SSD. Not something I would need though. We’re drifting off topic
Oh and by upload I meant capture. Stuck in the old terminology…
-
I should explain to the best of my abilities how replication happens, where relevant to this topic.
The fog image replicator replicates only images that have a definition in the database. What this means is the entire images directory is not necessarily replicated completely. If for instance you delete an image but do not delete the image data on the hard disk then that image will no longer be replicated. All uploads go to the dev directory and of course get moved after upload is complete to the images directory. I think the only case where replication might try to copy a file that is currently being written to is if the images directory and the dev directory are on two different disks. The dev directory is never replicated. So I don’t see how in probably 99.9% of cases how lftp would try to replicate an image that is being currently written to.
-
When you say is being written to, are you manually uploading the file yourself? Capture will work out of dev which is not replicated. However, moving from dev into /images would work. Most cases I think /images and dev are on the same disk. Probably often times on a spinner vs an ssd. If it’s to a spinner via San I imagine that being slowest form of all as not only is it running on spinner but also redirecting across network.
As dev is not replicated and you’re seeing this issue, it would seem to me this problem is not related to upload tasking. If you’re manually uploading the files to the server, as Wayne suggested, either don’t create an image definition or when creating the definition disable the replicate by unchecking the box. Perform your manual steps and once complete re-enable replication.
-
@Tom-Elliott you genius
Yes, I was copying files which I captured manually. Your three suggestions, to
- temporarily disable replication,
- not create the image definition,
or - use the dev folder
all work, and I’m not sure why I didn’t think of that… #facepalm