Replication Bandwidth Limiter - not totally working
-
Using svn 4338 cloud 5323
We set up a storage group this morning with two nodes in two different physical locations, one set as master.
The replication started very quickly and was consuming all of the 1gig link. We set replication bandwidth to
500000
which is a half-gig.The speeds on the transfer did lower according to the bandwidth chart, but it lowered to 800Mbps instead of 500Mbps.
We then lowered the replication bandwidth to
300000
and there was no effect, and the bandwidth was still capped at 800Mpbs.Just letting the @Developers know about this.
-
Looking in the code this is what I see for the replicator. It uses lftp to move the files.
"lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; ".$limit[$i]." mirror -c -R --ignore-time ".$includeFile[$i]." -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first ".$myItem[$i].' '.$remItem[$i]."; exit' -u ".$username[$i].','.$password[$i].' '.$ip[$i]."
The proper limit parameter for lftp is this: net:limit-rate 0:512000 where the first number is the download rate and the second number is the upload rate (in Kb/s).
Looking a bit deeper into the code I see this:
if ($limitmain > 0) $limitset = "set net:limit-total-rate 0:$limitmain;"; if ($limitsend > 0) $limitset .= "set net:limit-rate 0:$limitsend;"; $limit[] = $limitset;
I’m questioning the extra period in the second line and the format just seems a bit off.
-
@george1421 the second line’s extra period is suspicious. Google searching
.=
yields nothing at all. google searching.= Linux
also yields nothing. -
First I’m not a programmer. But I know the dot ( . ) string concatenation and .= may be short hand for “string = string + newstring”. Then that might make the statement accurate.
Right now I can’t think of a way to capture the environment that lftp is running in to see what is actually being set. It may be possible to hack the FOGImageReplicator to insert a few extra log statements to see what is actually being set by logging the variables.
-
-
@george1421 I’m all ears.
-
@Jbob Very nice, but still this does not explain why the bandwidht reported 800Mbps when the limit was set to 500000Kbps.
-
If you really want to go there, using the current svn trunk (if things go sideways, just rerun the fog installer to correct the files. No harm done then)
Edit this file:
/var/www/html/fog/lib/service/FOGService.class.php
Search for the first occurrence of the word fragment: limit.
You should see something like this:
$limitsend = $this->byteconvert($StorageNodeToSend->get('bandwidth')); if ($limitmain > 0) $limitset = "set net:limit-total-rate 0:$limitmain;"; if ($limitsend > 0) $limitset .= "set net:limit-rate 0:$limitsend;"; $limit[] = $limitset; } } unset($StorageNodeToSend); $this->outall(_(' * Starting Sync Actions')); foreach ((array)$nodename AS $i => &$name) { $process[$name] = popen("lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; ".$limit[$i]." mirror -c -R --ignore-time ".$includeFile[$i]." -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first ".$myItem[$i].' '.$remItem[$i]."; exit' -u ".$username[$i].','.$password[$i].' '.$ip[$i]." 2>&1","r");
Insert the following line after: $this->outall(_(’ * Starting Sync Actions’));
$this->outall(_(' * Speed limiter settings: $limitset'));
That should make it like this:
unset($StorageNodeToSend); $this->outall(_(' * Starting Sync Actions')); $this->outall(_(' * Speed limiter settings: $limitset')); foreach ((array)$nodename AS $i => &$name) {
Stop and restart the FOGImageReplicator. You may need to delete a file on the storage node for it to see a change. When the replication runs it should output the speed limits to /opt/fog/log/FOGImageReplicator
I’d do this in my test environment but its only partially rebuilt.
-
I’m more interested in knowing if this is really not limiting, or if it’s because of multiple instances of the replication being started at the same time.
-
@Tom-Elliott said:
I’m more interested in knowing if this is really not limiting, or if it’s because of multiple instances of the replication being started at the same time.
We are using two full server installations, with the MySQL stuff pointed at the master. Could it be related to multiple instances of the FOGImageReplicator running on multiple servers at once?
-
What I mean by the info, is I need to see the commands as they’re being sent.
The replicator, as far as I knew, actually did replicate with the proper limiters in place, however, if you’re attempting to replicate multiple images, it does not replicate them sequentially.
There are reasons behind this.
First, if you do it sequentially, the replicator only runs based on it’s time period after the last image completes replication. This, by itself, is not entirely bad, but just imagine a scenario where you have 15 images that need to replicate to 12 separate nodes. (All in the same group).
I’m not going to do the math, but our replicator defaults to a 10 minute cycle. But all of the replication must happen before the cycle can wait for it’s time period. Now limiting is fine and understandable, but it could take hours, days, or even weeks depending on the sizes of the images to get one replication cycle complete.
I replicate them, now, asynchronously and read the completion state synchronously so we can get all of the nodes syncing.
Because of this, the bandwidth limiter is kind of a misnomer, I guess, because the limiting is on a per instance basis, not an overall basis. I have not figured out HOW to get it to limit it in whole and if I could, I most definitely would.
-
FWIW: I have seen if you stop the fog image replicator while a transfer is underway the lftp process will continue. If you stop and restart the fog image replicator multiple times, you might end up with 3 or more lftp processes running at the same time each with their own bandwidth limitation and ignorant of the other running processes.
-
@Wayne-Workman @george1421 You’re correct it starts its own instances of the items, and it does NOT kill the original started instances.
I’ve now corrected this and commonized the command starter functions so MulticastManager, Snapin and Image replication will use these methods as well.
This means, under the latest svn, the ImageReplicator, SnapinReplicator, and MulticastManager will now all close their opened commands when the services are stopped/restarted.
Hopefully that should limit the clutter of multiple lftp commands using more and more bandwidth. It does not fix the issue of starting multiple instances, for all purposes, asynchronously and I think this is fine.
Unless you really want the items to replicate one at a time, which would do the limiting more properly, but it will also require that much more time for the data to actually transfer to the nodes/groups receiving it.
-
@Tom-Elliott In probably 99% of all scenarios, only one image ever needs replicated at any given time. Because who will upload two images simultaneously? That would be a rare occurrence, I think.
I’m glad to hear that the code has been cleaned up/improved in the replicator/multicast areas and is more manageable now.
-
I’ve decided to resolve the thread as the limiter IS in fact working. And most people, you’re right, will only be replicating one image or another.