Fog replicator process quits after remote image transfer

  • Hi,
    We have 3 Fogservers: 1 master node located locally, and 2 located remotely. They are connected over the WAN (via modems and firewalls) using IPSec. Our OS is Ubuntu 12.04, and FOG version 0.32.
    When uploading an image from the host, the replicator begins transferring shortly afer the upload to the first remote node. It takes many hours as it only transfers at less than 1Mbit/sec. When it completes, the replicator process stops and does not restart. It also will not transfer to the second remote node. Both IPsec tunnels are set with the same parameters, and I have had them working in the past. It appears to be sonething that has changed since 12.04 upgrade, but I can’t be certain of this.

    Log output below:

    Any help on this or similar experiences would be appreciated.
    Thanks, Mark

  • Moderator

    What is your normal data transfer rate through the IPSec tunnel using FTP, outside of FOG. If it’s slow doing a normal FTP, HTTP, or SMB file transfer from site to site, then you might try troubleshooting it from a VPN IPSec performance angle instead of FOG.

  • That works well chad-bisd, thanks. It appears to transfer with a maximum combined bandwidth of about 1Mbit/s. So transferring one file at a time or two in parallel will most likely take a similar overall time. Not sure what regulates this rate, can’t see any config in my IPSec tunnel to adjust speed. It appears that the other issue of the replication process “quitting” has gone, before I used your modified file. Perhaps it was a Ubuntu 12.04 update that fixed that. Thanks again.

  • That sounds great Chad-bisd! Just what is needed for remote sites. Sorry for not replying sooner, I’ve had other things happening. I’ll try this code next week and let you know the results.

  • Moderator

    Ok. I’ve done some mods. The files actually transfer to all storage nodes in the storage group at the same time. The tricky part is the output to the log file. I can easily start multiple file transfers to the storage nodes, but I can only monitor 1 at a time. So although you see in the log file that all the storage nodes started syncing, you won’t get the results of the rest of the storage nodes until the first one finishes all it’s transfers.

    If you had 5 nodes, you’ll see all of them start, then the progress for each file on node1 until node 1 has finished all it’s files, then the status for node2 until it’s finished, so forth and so on. The transfers are all happening at the same time, but are reported to the log file only after each node finishes.
    []Backup /opt/fog/service/FOGImageReplicator/FOGImageReplicator
    ]Save the attached file
    []Unzip it
    ]Copy the FOGImageReplicator file to /opt/fog/service/FOGImageReplicator/FOGImageReplicator, overwriting the existing file.
    []Restart the FOGImageReplicator daemon using: [CODE]sudo /etc/init.d/FOGImageReplicator restart[/CODE]
    ]Monitor the log file: [CODE]sudo tail -f /opt/fog/log/fogreplicator.log[/CODE]
    [*]Monitor the /images/ folder on your storage nodes to make sure the files are getting there.
    If you want to create a test dummy file so you have something to replicate without replicating a huge image, you can use:
    cd /images
    sudo dd if=/dev/zero of=/images/testfile100mb count=1024 bs=102400
    This creates an approx 100MB file in the /images/ folder named testfile100mb.
    Restart the FOGImageReplicator daemon using step 5 from above and watch the log file using step 6.
    Have an SSH session open to your storage nodes and you can watch the files being created on each node in parallel. I just “ls -l” on the /images/ directory on the storage nodes to see if the files are being created. If you are quick and have each storage node open, you can watch them all get the file.
    [B]All the normal disclaimers apply. No warranties, assurances, or guarantees this will work or not work, help or harm your systems.[/B] I can reasonably say it works based on about 4 hours of coding and testing using 1 main server and 2 storage nodes in VirtualBox and syncing 2 files of 100MB and 1GB respectively.


  • Moderator

    looks like we can convert $process into an array of resources, iterate through all the storage nodes in the group that are not the master storage node, and kick off the popen() for each one, storing the resulting resource into the $process array. iterate through the $process array to output the fgets() info for each ftp transfer call, and then iterate through it again outside the for loop to call pclose for each element in $process array.

    It might be a a week or two before I can test out the parallel. Might be easier to get the node info into an array so we don’t rely on the $conn staying open for hours with no activity. Your images will sync very slowing, waiting on one to finish before beginning the next, but you’ll at least have them.

    Maybe an option to specify you have slow WAN links to your storage nodes and to use parallel syncing can be added to the UI.

  • Thanks chad-bisd.
    Parallel syncing would be great, even if the process still times out it will have replicated both sites first.

  • Moderator

    I’m betting the MySQL connection is timing out because it is taking so long to push the image files to the first storage node.

    Looking at /opt/fog/service/FOGImageReplicator/FOGImageReplicator script line 88, the variable $res is a resultset from the mysql_query() call above it. Right below it is the call to transfer the image file to the first node via ftp and wait for the transfer to complete before moving on. This is taking 6 hours for you, during which time the $conn used by $res is probably closed since it has not been actively used for a very long time.

    The solution is probably to either run the node FTP updates in parallel OR bring the result set of the query into a multidimensional array and then close the connection since it will not be needed anymore. Then you can take your time “while’ing” through the list of storage nodes and sending the images via FTP.

    If I can figure out the code I’ll post it with instructions. I also need to look at the dev code to see how it’s being handled in 0.33 to see if that code could be handled better.