LFTP mirror copies files that are still being written

dolf

I’ve noticed that the FOGImageReplicator service calls lftp, which mirrors the /images folder from one node to another. However, it also tries to sync files that are still being written to. I’m not sure how smart lftp is, but it will waste traffic at best, or copy incorrectly at worst.

This is a minor issue, since there is the dev folder, and if the last step is always to move the image being worked on from dev to dev/.., it happens so quickly that lftp would (AFAIK) not start copying before it arrives.

Oh, wait… if for some reason /images/dev and /images are not on the same filesystem, mv is not atomic, and lftp will try to transfer incomplete files.

But maybe lftp is smart enough to account for this? I couldn’t find it online. I did, however, find a way to exclude files with a modification time of less than a few minutes ago: http://serverfault.com/a/693787/301389

dolf

Sure enough, when I restarted FOGImageReplicator, lftp decided that the file (the one it tried to mirror while I was writing to it) is no good, promptly deleting it on all the slaves and sending it again. This is not a bad outcome. But it means we sent unnecessary traffic before.

Tom Elliott

I’m just guessing here, but are you writing the files you need to a node other than the master? I agree we shouldn’t overwrite the data if it’s being written currently. The replication processes only occurs from master nodes. In the case of a file being associated to multiple groups, the designated “primary group” master node will replicate to the remote groups master node.

To make sure a file doesn’t incidentally get deleted from where you’re placing the files, you should put the files on the primary groups master node (or master node in the case of only one storage group).

If this is not true the next cycle compares the the local and remote and if they don’t match will delete the files and begin the sync again (if the files already exist on the master/primary master.)

This is all speculation and I have an idea of how to correct this already. I don’t think relying on oldest time of five minutes is accurate enough a check to warrant delete. Checking if the local file is being written to currently and skipping replication-delete if it is. Though I probably just sync the whole images directory…

Most likely I’m misreading too.

dolf

@Tom-Elliott I was writing on the master only. The problem is that lftp should wait until I’m done writing before it sends the file to the slaves.

Wayne Workman

Why would you setup /dev on a different disk? I would not recommend that.

dolf

@Wayne-Workman I didn’t, I’m just speculating. One possible use case might be if you have an SSD for dev to upload faster and a big slow HDD for the long term storage.

Wayne Workman

@dolf capturing is a lot slower than deploying. If anything, use an SSD for both.

dolf

@Wayne-Workman That sounds like a feature request… Having a deploy folder on SSD. Not something I would need though. We’re drifting off topic

Oh and by upload I meant capture. Stuck in the old terminology…

Wayne Workman

I should explain to the best of my abilities how replication happens, where relevant to this topic.

The fog image replicator replicates only images that have a definition in the database. What this means is the entire images directory is not necessarily replicated completely. If for instance you delete an image but do not delete the image data on the hard disk then that image will no longer be replicated. All uploads go to the dev directory and of course get moved after upload is complete to the images directory. I think the only case where replication might try to copy a file that is currently being written to is if the images directory and the dev directory are on two different disks. The dev directory is never replicated. So I don’t see how in probably 99.9% of cases how lftp would try to replicate an image that is being currently written to.

Tom Elliott

When you say is being written to, are you manually uploading the file yourself? Capture will work out of dev which is not replicated. However, moving from dev into /images would work. Most cases I think /images and dev are on the same disk. Probably often times on a spinner vs an ssd. If it’s to a spinner via San I imagine that being slowest form of all as not only is it running on spinner but also redirecting across network.

As dev is not replicated and you’re seeing this issue, it would seem to me this problem is not related to upload tasking. If you’re manually uploading the files to the server, as Wayne suggested, either don’t create an image definition or when creating the definition disable the replicate by unchecking the box. Perform your manual steps and once complete re-enable replication.

dolf

@Tom-Elliott you genius

Yes, I was copying files which I captured manually. Your three suggestions, to

temporarily disable replication,
not create the image definition,
or
use the dev folder

all work, and I’m not sure why I didn’t think of that… #facepalm

LFTP mirror copies files that are still being written

115

12.2k

17.3k

155.4k