Fog Storage Nodes on Metered Connections Are Killing Us
When using the Location plug in, with one storage node per location, how does one turn off all file replication from the Main FOG Server to a Storage Node, but still allow the FOG Client to utilize the contents of the storage node at the location? We’re assuming when one unchecks replication on Snapins and Image, the FOG system will no long use the content on the Storage Node (PXE, Snapins and Images). Our content is up to date on all storage nodes (and unchanging), so we want the client to use the remote content, but we want the Main FOG server to stop talking to the Storage Nodes.
We’re using FOG 1.5.0 RC9 in production, with 8 storage nodes, all separated from the Main FOG server by VPN. We’re using the Location Plugin, so we have 9 locations total, one storage node per location. We have 40 or so Snapins (all of which are 15Kbytes) and 4 images, all configured in the FOG GUI to replicate from the Main FOG server to each storage node.
We’re seeing 1 GB/day of traffic between the Main FOG server and each Storage Node, with most traffic going from the main fog server to the storage node (70/30 split). 4 of our Storage nodes are on sites with 4GLTE (Metered) connections. This amount of data per day is unsustainable and costing thousands of dollars a month.
We have set all Storage Nodes to Replicate at 5 kb/s, hoping this would solve the data use issue, but it has had no effect.
Our content (Images and Snapins) is very static. We would like to manually sync to storage nodes if we have any changes at the Main FOG Server and we’ll do this very infrequently. We already have scripts for this. We would like to stop replication from the Main FOG Server to the Storage Nodes but still use the Storage Nodes and Location Plugin as if replication were constantly running.
If we can’t do this, we’ll have to turn FOG off everywhere. We just can’t sustain the cost.
Any suggestions would be appreciated.
Or write you own replicator function using rsync on both ends. IMO rsync is a bit more elegant solution than the fog replicator, but then by going with rsync you loose the tight integration with FOG. So there are drawbacks on both sides.
I’ve wanted to rewrite the replication stuff for a while. A brand-new daemon written in Python3 that strictly follows the existing behaviors of replication - but with extra features like operation times. I’d use all native Python3 calls, no subprocess calls, no shell commands.
Replication wasn’t responsible for any of the traffic we’re seeing. We disable replication in the FOG Configuration area and still saw the same 2 GB per day per FOG Storage Node. Our bandwidth monitor showed the traffic to be both TCP and HTTP. We then stopped the services FOGImageReplicator, FOGSnapinReplicator, and FOGSnapinHash services. Still no change. We saw a constant 160-190kb/s traffic from the Main FOG server to the Storage Node. So, I captured packets - only 1000, but that took only 5 seconds. I saw what looked like the storage node replication log transmitted from the storage node to the main FOG Server. Most of the traffic was from the main FOG server to the storage node, but I wouldn’t tell much from the packet.
In the end, I was able to stop almost all traffic by shutting down all FOG services, including the FOGScheduler.
So the questions now are:
Is there any documentation that explains what each service does and what we’re breaking by turning each off?
Are the negative effects mitigated by running any of the service daily?
Any assistance is appreciated.
At present, we’re working over VERY thin pipes, but for my future reference, is there any way to manage the frequency of image and snapin replication tasks?
The simplest method beyond the sleep times is to use cron to start and stop replicating on a set interval. Or write you own replicator function using rsync on both ends. IMO rsync is a bit more elegant solution than the fog replicator, but then by going with rsync you loose the tight integration with FOG. So there are drawbacks on both sides.
is there any way to manage the frequency of image and snapin replication tasks?
Yes, the service sleep times inside the fog configuration area.
Wow!!! I never would have expected that. I would assume an MD5 hash on each side would be the best way to go.
My design philosophy is always to go as thin as you can manage. It’s based on the notion that even though GB Ethernet is everywhere and my internet connection is 300 x 30, CPUs outperform disk, disks outperform networks, and WAN connection will remain a problem for the rest of my life.
At present, we’re working over VERY thin pipes, but for my future reference, is there any way to manage the frequency of image and snapin replication tasks? If I have a lot of images, I’d likely set replication for once a day, at most, especially with many locations.
Thanks for all your help.
Preliminary info indicates we may have made FOG traffic a footnote compare to all inter-site traffic, instead of being the top user by a factor of 10.
Thanks - more later…
I can’t understand how or why the Main FOG server would transmit 1 GB in 12 hours to a server that is in sync
The nodes exchange the last 10MB of data from each file in each image to see if it’s the same or not. If it’s not the same, replication is initiated. It’s the same for snapins. I thought at one point @tom-elliott had it so the hashing was done locally so the 10MB wouldn’t need to be sent across the wire.
@jim-graczyk I agree that keeping them identical might be a challenge. But depending on the cost, that challenge may be required.
As for the certificate, a simple scp command might make the certificates on the fog server all the same.
I also agree that your mobile workforce might be where things fall down a bit.
Also in /opt/fog/logs should be the replicators logs on the master node. They might give you an idea what the master server replicators are doing. If you need to dig deeper.
Thanks for this fine detailed info. Our possible fallback position might be a full fog server at the remote site, but we have only a dozen or less PC per site. I don’t look forward to trying to keep 5 FOG servers config’d identically.
Also, we are aware that each FOG server has it’s own cert, so moving PCs from server to server would be a challenge. Half of our PCs are laptops with users moving from location to location. With one FOG server, we can easily move a PC to a different location (where ever the user is that day), and re-image. Our approach to Snapins is automatically site aware (DFS and SAMBA), and our FOG Snapin is just a generic 15K stub. All our snapins are accessed over SMB. We rely on FOG to manage and execute Snapins, but not so much store them.
Our bandwidth monitoring is showing very little traffic from the FOG client to the Main FOG server. Personally, I can’t understand how or why the Main FOG server would transmit 1 GB in 12 hours to a server that is in sync. I wondered about the Snapin Hash Global Enable, but our 40 snapins amount to less than 1 MB of data, so I don’t see it there.
We’ve used the FOG Linux Service Enable section of FOG Configuration to uncheck these parameters:
We’ll report back our results after a day.
During normal operations the pxe booting target computers reach out to the master fog server to get its booting instructions. This is typically small http traffic until the remote computers find its local storage node. The storage nodes also reach out to the master fog server since it doesn’t have a local database, it uses the master fog server’s database. And lastly the FOG Client communicates with the master fog server to update inventory and look for things to do. This may cause additional traffic over that metered connection.
For the metered sites, you might consider placing a full fog server at those locations to keep all of the traffic local. You can still have full fog servers configured as storage nodes to get the replicated images. Also you will loose the ability to see all of your systems from one console. But this way (with a local fog server at the metered sites) your only wan traffic will be image replication.
We’re assuming when one unchecks replication on Snapins and Image, the FOG system will no long use the content on the Storage Node (PXE, Snapins and Images). Our content is up to date on all storage nodes (and unchanging), so we want the client to use the remote content, but we want the Main FOG server to stop talking to the Storage Nodes.
When you uncheck replication that only impacts replication to the storage nodes. It should not impact the ability to use the already replicated image. If the image has not been replicated, disabling replication, the replicator will skip that image/snapin from being sent to the remote storage nodes. That is the way it is suppose to work.
The other way to disable replication is the global replication flag in (fog settings I think). The last way is to actually stop/disable the replication service in linux.