Replication Bandwidth
-
I’m running 1.5.7 on Fedora 30 VM with 1 main server and 10 nodes. I’ve been using FOG since trunk began. This is the first time I remember the replication being throttled oddly. Replication bandwidth for server and nodes are set to 1Gb but images only transferring at 15-20 Mb. Any ideas on what could be causing this. Up until now the replication bandwidth has always been 1 Gb on server and 0 for nodes (which I though meant unlimited, and up until now has been replicating near the Gb speeds).
-
I don’t think this is a fog issue, specifically. BUT lets look at it a bit more.
FOG uses ftp to move files between the master node and the storage nodes. Specifically it uses the lftp utility. One of the parameters of the lftp program is bandwidth throttling (which one I don’t remember off the top of my head). Now I “think” in the fog replicator logs it will show you the lftp command with its calling parameters. Lets see if that bandwidth switch is set. If you can’t find it in the logs, when the replicator is running you can use this command to capture the lftp command and its parameters
ps aux|grep lftp
A ftp process needs to be happening at the exact time you run the ps command to get the parameters.So the first step is just to see if FOG is misbehaving poorly. The next step is to probably test by manually ftping a 100MB or larger file from the fog server to a storage node. Its possible that a change in the network infrastructure has slowed the ftp process.
-
root 8682 0.1 0.1 227832 8956 ? S 08:04 0:02 lftp -e set xfer:log 1; set xfer:log-file /opt/fog/log/fogreplicator.Sysprep-Win10EDU-X64.transfer.Shermanhigh.log;set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; set net:limit-rate 0:128000000; mirror -c --parallel=20 -R --ignore-time -vvv --exclude ".srvprivate" "/images/SysprepImagex64" "/images/SysprepImagex64";exit -u fogproject,wXdfeo5CxUKy x.x.x.x root 8734 0.1 0.1 227832 8892 ? S 08:04 0:02 lftp -e set xfer:log 1; set xfer:log-file /opt/fog/log/fogreplicator.SysprepWin10EDU.transfer.Shermanhigh.log;set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; set net:limit-rate 0:128000000; mirror -c --parallel=20 -R --ignore-time -vvv --exclude ".srvprivate" "/images/SysprepWin10EDU" "/images/SysprepWin10EDU";exit -u fogproject,wXdfeo5CxUKy x.x.x.x root 9033 0.1 0.1 228060 9268 ? S 08:04 0:03 lftp -e set xfer:log 1; set xfer:log-file /opt/fog/log/fogreplicator.Win10-1903-UEFI.transfer.Shermanhigh.log;set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; set net:limit-rate 0:128000000; mirror -c --parallel=20 -R --ignore-time -vvv --exclude ".srvprivate" "/images/Win10-1903-UEFI" "/images/Win10-1903-UEFI";exit -u fogproject,wXdfeo5CxUKy x.x.x.x bcs 39132 0.0 0.0 215744 896 pts/2 S+ 08:52 0:00 grep --color=au to lftp
This is currently while moving an image ~ 8Gb to a single node
[Mod note] Fixed the entry for readability -Geo
I’ll also note that this connection has and still is gigabit, and no changes to infrastructure.
-
@Hanz said in Replication Bandwidth:
set net:limit-rate 0:128000000;
OK from the call parameters we see this rate limit set, now I just have to decode it looks like its 128MB/s which is just above the 1GbE theoretical maximum.
From: https://www.toysdesk.com/2013/11/lftp-limit-bandwidth-upload-download/
set net:limit-rate 0:512000
The first value in net:limit-rate is the download limit, the second number is the upload limit (after the colon), so…
So in your case for download there is no limit, for upload its 128MB/s. So unless lftp is doing something strange it should not be rate limiting the transfer. Since this lftp command runs from the perspective of the FOG Master node the upload rate will limit the data speed coming out of the FOG Master node.
-
So the next question would be how could we prove that it is or isn’t the fog server at fault?
Maybe by stopping the fog replicator and killing off all of the lftp processes. Then manually copy a file from the fog server to the remote storage node. Then repeat the process from a windows computer on the same subnet as the FOG server to the same remote storage node. Basically is about creating a truth table of what works and what doesn’t. I can say with 3 ongoing ftp processes that is probably filling up the 1GbE link on your FOG server with just replication traffic. I am a bit surprised to see 3 sessions running at once since I thought the FOG replicator was serial in nature not parallel.
-
@george1421 The weird and suspicious part is that 1.5.5 was last known working version, 1.5.7…I’ll also note that the parrallel behavior has always performed that way, at least for quite some versions now. I could actually transfer a single new captured image to all nodes at the same time and they would all run at their perspective top speeds (I have 4 schools that only have 100Mb connection and the rest are all gigabit. This speed is definitely new though.
Note I just manually transfered the file via ftp and the speed was back to normal at another node…then on this node in question and it is indeed a port going bad or the cable itself only allowing a very minimal top speed of 10-15 Mb. This can be closed, but I appreciate the new command.