FOG storage node and data replication
-
Rebooting the storage node appears to have started the replication /images/drivers but so far only the first file has replicated.
Looking at /opt/fog/logs/fogreplicator.log on the master node I see this error.
[10-22-15 8:19:52 pm] * shvstorage - SubProcess -> mirror: Fatal error: 500 OOPS: priv_sock_get_cmd
[10-22-15 8:21:08 pm] * shvstorage - SubProcess -> Mirroring directorydrivers' [10-22-15 8:21:08 pm] * shvstorage - SubProcess -> Making directory
drivers’
[10-22-15 8:21:08 pm] * shvstorage - SubProcess -> Transferring file `drivers/DN2820FYK.zip’the zip file is the only thing in /images/drivers on the storage node.
-
You’re running FOG 1.2.0, or trunk?
-
Its a trunk build 5040.
Looking at the drivers folder. I have a combination of files and sub folders. Depending on how smart the replicator is it may not handle or traverse the sub folders.
The structure of the drivers folder is such.
/images/drivers/OptiPlex7010.zip
/images/drivers/OptiPlex7010/audio
/images/drivers/OptiPlex7010/audio/<many files and sub folders>
/images/drivers/OptiPlex7010/video/<many files and sub folders>
<…>I suspect that the replicator was only designed to copy the image folder and one level of files below.
-
@george1421 said:
I suspect that the replicator was only designed to copy the image folder and one level of files below.
That’s likely the case.
You could just put them in the web directory and grab them via wget on the hosts…
-
My preference would be to not do something out of band if possible. It does appear that creating a fake image with its path set to /image/drivers is choking the FOG replicator because of the sub folders, so I’m going to back out that change. Because no replication is happening because of that error.
I haven’t dug into the fog replicator code yet, but I’m wondering if rsync wouldn’t be a better method to replicate the images from the master node to the other storage nodes. Rsync would give us a few more advanced options like data compression and only syncing files that were changed than just a normal file copy.
-
Following up on this in the AM, I now see all of the ‘files’ that were in /images/drivers on the master node now on the storage node. The FOG replicator only appears to copy just the files under the faux drivers image. What is missing now are the sub folders and files under /images/drivers (the actual driver files we push to the target just after imaging). So the idea to create a fake image kind of worked but not the way I needed it. As long as your files are located one level below the fake image folder then this is a workable solution.
Being very self centered I would like to see FOG support something like rsync to sync/manage the images folder, especially if the storage node is located across a slow WAN connection because these image files tend to be very large and we could benefit from the transmittal data compression intelligent replication such a tool could provide.
-
As the images are already compressed I’m not sure if rsync with compression would be of any benefit or might actually make copy times worse.
-
@Joseph-Hales said:
As the images are already compressed I’m not sure if rsync with compression would be of any benefit or might actually make copy times worse.
Testing shows that data compression “does work” but the actual amount of compression does not add any value in a WAN transfer. I took a 4.8GB captured image and gzip’d it. The resultant image was only 50MB smaller than the source image (not much to really matter). The but the other benefits of rsync would still be worth looking into for my project.
-
@george1421 can rsync be used without additional compression?
-
Yes it can. The image files appear to be packed pretty good as they are.
I need to check a bit more into the fog replicator. I noticed that if I restarted the fog replicator it goes through and starts replicating all of the files again, with 100% cpu on the master node. Almost like the replicator service was in a tight do loop until the file finished copying over. This is not very kind behavior, which is making me think that I should just go the rsync route and just disable the fog replicator service all together. Understand at this point I don’t have documented evidence that the fog replicator is at fault, only what I saw just before calling it a week.
-
@george1421 I’ve only had the opportunity to mess with the fog image replicator a handful of times in remote sessions with others. I can’t say I have seen this behavior but I’m not denying it either.
If you are able to configure rsync to properly follow all the settings set in the DB for replication and are able to document how to set it up, I wouldn’t be surprised if it was adopted.
There are a lot of settings, by the way… @Tom-Elliott explains it best… but from what I gather:
-
The master node replicates to all nodes in it’s storage group.
-
An image can belong to several storage groups. When this is the case, the master that has the image replicates to the other master. From there, the above step applies.
-
rsync must use the settings defined in /opt/fog/.fogsettings for the ftp credentials
-
rsync must use the replication bandwidth limitations set in the database
-
rsync must not re-compress images nor change files in any way.
-
-
Understand I’m not knocking the way it is today. (Actually from another recent thread it gave me an idea how to change this multi site POC by doing full installs at each site then pointing them back to the master node for the database information). The 1.x version of FOG that Tom revived has been great. For my deployment I see two issues/concerns.
-
The current fog replicator only copies the files under the image name directory. This is perfect for image replication. But in my case I have something slightly different in that I have a driver structure in the images directory that I need to replicate to all storage nodes. I need all storage nodes to have all copies of all files from the master node.
-
On the surface I see some strange behavior with the fog replicator consuming 100% cpu. This may be related to having that faux devices image in the system or something else going on. I also noted in the fog replicator log that when I restarted the fog replicator service (when debugging the 100% cpu issue) that it started copying files that already existed on the storage node. i.e.
[10-23-15 1:24:43 pm] * ncystorage - SubProcess -> Mirroring directory `WIN7PROSP1X86B03' [10-23-15 1:24:43 pm] * nycstorage - SubProcess -> Removing old file `WIN7PROSP1X86B03/d1p2.img' [10-23-15 1:24:43 pm] * nycstorage - SubProcess -> Transferring file `WIN7PROSP1X86B03/d1p2.img'
I will look into the rsync to see what commands it supports directly. It may be possible to stitch it into the current deployment by replacing the lftp calls (just a guess) with rsync calls. But you did give me a few ideals to check into and places to look for settings, thank you.
-
-
@george1421 I’m very confused.
Currently the command we run for this is:
lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; [<bandwidth limits if any>] mirror -R --ignore-time [-i <image folders or files if group->group transfer>] -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first <what we are sending> <destination removing if existing>; exit' -u <username>,<password> <ip of node> 2>&1
Basically we are sending all files recursively. I wonder if it’s just timing out as it’s sending.
All of the options can be seen here:
-
@george1421 said:
(Actually from another recent thread it gave me an idea how to change this multi site POC by doing full installs at each site then pointing them back to the master node for the database information).
It was originally Tom’s idea. And it’s proven to work. I’ve just been spreading the word.
-
Here is the command I’m working with so far.
rsync -arvPh --bwlimit=5000 /images fog@192.168.1.56:/images/
Where:
-a == archive
-r == recursive
-v == verbose (gives file names as it transfers, for my benefit)
-P == show progress stats (for my interactive benefit)
-h == display numbers in a human-readable format (for my interactive benefit)
–bwlimit == bandwidth limitations in KB/sI’m going to let it run overnight to see where I end up.
One interesting thing I found with rsync is that it runs in restartable mode. I stopped the transfer of a image mid stream. When I restarted the command it thought for a bit and then started the transfer where I broke the connection.
It looks like from Tom’s post that I may need to include the --exclude switch to exclude files from being copied over.
Some caveats so far:
It appears that passing the password inline isn’t possible. Maybe able to get around with ssh keys.
Rsync must be installed on both the master and storage nodes for it to work correctly.
You can use a ssh tunnel to encrypt the data in motion between the master node and storage node if you need additional data protection. This may be of little value if you are transferring data inside your organization.
ref: http://www.tecmint.com/rsync-local-remote-file-synchronization-commands/
-
@Wayne-Workman said:
@george1421 said:
(Actually from another recent thread it gave me an idea how to change this multi site POC by doing full installs at each site then pointing them back to the master node for the database information).
It was originally Tom’s idea. And it’s proven to work. I’ve just been spreading the word.
Well what I’m looking at is a multi site setup where each site would have a local storage node and a central master fog server at HQ. The idea is to start the deploy from the master node but have them deploy from the local storage nodes. But looking into a storage node a bit more it doesn’t look like the pxe environment is setup or the install isn’t complete (but I just did a few quick checks). But the idea of doing a full install at the remote sites but having them reference the master node’s database is brilliant. That way I have a full fog install at each site but only one database where everything exists.
If I can get the replication bits to work like I need, I think I’ll have a solid solution.
-
Well I guess I just need to set it up and go away for the weekend.
I just ran out of disk space on the storage node. Looking to see where the space went I look into the drivers folder and the sub folders and driver files were there. So if I circle back to Wayne’s first comment to create a faux drivers image. Given enough time the system as is will replicate the drivers folder and all sub files and folders over to the storage node. That still doesn’t explain the 100% cpu usage of the fog replicator service. But the system does work as is.
Do I think the rsync method is better the ftp, yes. Do I think I can setup this POC system as is without much hassle, yes.
-
@george1421 I’m curious how you’re making the clients get said drivers from the storage nodes ? It’s exported as read-only via NFS and the other available option without any changes is FTP.
You could use a secured Samba share for this… There is a script that will do it for you on Fedora/CentOS here: https://forums.fogproject.org/topic/5145/script-to-install-samba-with-settings-for-fog
-
@Wayne-Workman said:
@george1421 I’m curious how you’re making the clients get said drivers from the storage nodes ? It’s exported as read-only via NFS and the other available option without any changes is FTP.
Well that’s the bits I haven’t worked out yet. I needed to get the drivers to the storage node. On the master node today I’m running a post install script to copy the correct drivers to the target computer. It’s possible that I may not understand the concept of the storage node just yet. I may have to rethink my position. Without the files I can only guess.
But if I run a full install at the remote site that may address the driver deployment issue.
-
After about 12 hours of running and the FOG Replicator service is still running at 100% utilization. It appears to be working as it should by moving files from the master node to the storage node. So it IS working, just with high CPU usage. I tried to poke around in the code a bit and add 20 second sleep statements to see if I could hit on where its looping uncontrolled (just a guess). I’m suspecting its somewhere after lftp is being launched and then it enters a task wait function which should wait until the lftp file copy is done. But from there I lost the trace (and btw I’m not a programmer only a good guesser).
I think I’ll need to leave this to the developers to take a peek at.