Storage Nodes Not Providing Images
-
Hi everyone, first off as this is my first post on this forum I want to thank the wonderful FOG team for the outstanding product they have made and all of the great documentation on it.
On to the good stuff:
I’m facing an issue with the use of FOG Storage Nodes in my setup. I have three servers, all of identical hardware and specifications, running Centos 7 and FOG v1.4.4. I utilize FOG unicast imaging for deploying batches of computers to ship to clients with pre-made images; however the quantity I was imaging at a time were taking a tole on imaging speeds and throughput, thus slowing down imaging times substantially. Using the documentation on the Wiki and the youtube video suggest there as well, I (believe I) successfully created two storage nodes on the same network to help the Master Node load balance. The replicator script did run and the images appear to have all been copied over to the correct directories on the new servers.All of this being said, every machine is still taking its image from the Master Node, and thus the imaging speed problem persists. All images image from 10.0.0.1, and never from the storage nodes (IPs Below). I even set the max clients on each storage node under the storage management tab to 5 instead of the intended 10 to see if that would force the use of the other machines, but it did not change anything. imaging between 15 and 30 machines at once, they all still pulled from the master node (10.0.0.1).
I would greatly appreciate any help or advice the community could give me! Thank you all in advance! Below is a list of information on each of the servers, as well as a network map showing the general setup of the servers, imaging stations, and switch configuration. If there are any other questions I can answer or config files that would assist in fixing the issue, please let me know and I will answer/upload them as soon as possible. Thank you!
IP Addr | Storage Type | Max Clients | Cenots Ver. | Path | DHCP
10.0.0.1 / Master Node / 5 / 7.3.1611 / /images/ / yes
10.0.0.108 / Storage Node 1 / 5 / 7.4.1708 / /images/ / no
10.0.0.81 / Storage Node 2 / 5 / 7.4.1708 / /images/ / no -
@voison Nice information - but I’d like to see a screenshot from here please:
Web gui -> Storage Management
Please get all of the node info in that photo. -
Out side of your issue of FOG not picking the proper storage node for deployment you have a different issue here.
Since you are imaging 15-30 machines at once you are using the wrong technology. If they are all of the same image, you should be using multicast and not unicast images. Each fog server will fill its 1GbE network uplink connection with 3-4 simultaneous unicast sessions. To deploy 30 unicast streams you will need to add a bit more storage nodes.
The same goes for the switch to switch links. So how can you mitigate this?
- Upgrade to a 10GbE network
- Add more links in your network (LAG link groups)
- Use a multicast stream.
Just for a baseline number, for a single unicast stream, what does partclone indicate your transfer rates are in GB/min?
-
@george1421 said in Storage Nodes Not Providing Images:
Each fog server will fill its 1GbE network uplink connection with 3-4 simultaneous unicast sessions
In my experience, unicast will saturate a 1Gbps link with 3 simultaneous sessions. With 2 simultaneous sessions, it is almost saturated.
But yeah, you need to use Multicast. But still, we should explore why the nodes aren’t working. It’s probably something simple.
-
@wayne-workman said in Storage Nodes Not Providing Images:
In my experience, unicast will saturate a 1Gbps link with 3 simultaneous sessions. With 2 simultaneous sessions, it is almost saturated.
That was also my experience too. I was being a bit generous saying 3-4 where I included the performance hit of the disk subsystem too. But its exactly the 3rd stream where the link gets saturated and the retransmits shoot up, if all you are doing is moving data. If you have a slower disk subsystem then that releases some of the pressure on the network a bit.
ref: https://forums.fogproject.org/topic/10459/can-you-make-fog-imaging-go-fast/5
-
Thank you both for all the information. I have considered link aggregation but have not worked on implementing such a solution yet. I have not tried a multicast session yet but have just started reading more on it. In terms of imaging speeds, before I made the nodes I was usually getting about 2.9 to 5.3 GB/min according to Partclone, depending on the age of the hard drives in the machines that were being imaged. This was still while imaging 5-7 machines at least. When only imaging 2-4, the speeds were often higher. Each of the servers have 1TB SSDs, so they are pretty quick when it comes to their own disk speeds. I’ll look at the link saturations tomorrow when I get in the office and follow up here, thank you for the helpful links! Here is the Storage Node webpage from the GUI:
Here are the screen shots of the specific details per node. I played around with replication speeds but that did not make a difference.
-Master Node-
-Storage Node 1-
-Storage Node 2-
-
@voison I’ve not fully looked through the photos - but I did notice the passwords for the storage nodes are extremely short. The FOG installer would not have set such short passwords, it sets really long ones. So my first thought is that the passwords are simply wrong, which means replication never happened, which means these nodes don’t have copies of the images, which means they won’t be chosen by the fog server. Have a look at this article: https://wiki.fogproject.org/wiki/index.php?title=Troubleshoot_FTP
AGAIN, just a hunch. Maybe the passwords are right, idk. -
Well, an interesting find. No the passwords were not correct. I checked both of the nodes (/opt/fog/.fogsettings) and their passwords did differ from the the passwords on the web gui. I tried updating the web gui however it has not fixed the problem, all unicast images are still being pulled from the Master Node. I do remember watching the replicator log for a while when I first created the nodes. So, that being said, I made a little comparison of all three /images/ directors on each server. The master node does have more that the other two, but these are just old images that we deleted out of FOG but not off of the disk. Everything in the directories on the storage servers are the same as what is found in the web gui. I would say that the replicator did run at some point, even with the differing passwords. Is there any kind of manual system refresh I need to do after updating the passwords to associate the storage nodes to the master node?
-
@voison The fog installer has mechanisms built into it to correct an incorrect password for the local account used and for the credentials stored in the DB for the node the installer is running on. The easiest way to correct this stuff is to ensure you have your desired password for a node set inside of
/opt/fog/.fogsettings
and just rerun the installer on that node. These mechanisms won’t work though if there’s a DB connectivity problem between the nodes and the DB - There’s troubleshooting steps for that here: https://wiki.fogproject.org/wiki/index.php?title=Troubleshoot_MySQL -
@wayne-workman Good call on the MySQL issue. The database on the Master Node would not allow a login for ‘fogstorage’ from anything except localhost. I fixed the database issues for both storage nodes, testing all three machines abilities to access the MySQL server with success. After this I looked over the .fogsettings file on both storage nodes and configured them accordingly, then ran the installer again. All was successful and nothing failed. I attempted another batch of images, and nothing changed. I then went to see what was happening with the replicator, just out of curiosity, and it seemed that the Master Node was trying to replicate something, but neither of the storage nodes seemed to recognize anything. While writing this, they have both since come back with “disabled replication” messages:
[01-30-18 11:20:31 pm] * Starting ImageReplicator Service
[01-30-18 11:20:31 pm] * Checking for new items every 600 seconds
[01-30-18 11:20:31 pm] * Starting service loop
[01-30-18 11:20:31 pm] * | This is not the master node
[01-30-18 11:30:31 pm] * | This is not the master node
[01-30-18 11:55:59 pm] * * Image replication is globally disabledFor a very short brief moment both of the servers were showing up on both the home page in the bandwidth graph as well as on the
web gui > Fog Configuration > Kernel Versions
; however they have now disappeared from both. I’m not sure if that helps with the problem but I figured I would report it.Do I need to re-run the installer on the master node as well?
-
@voison Replication activities are only ‘conducted’ from master nodes. Meaning the non-master nodes don’t really do anything other than receive data that the master sends. So all of the logs that matter are on your master node. If it says it’s replicating, just give it some time.
Here’s a reference thread also where someone posted both master and non-master logs. In the non-master log, you can see the same message ‘Image replication is globally disabled’ but in the master one you can see it’s going.
https://forums.fogproject.org/topic/10891/image-replication-not-working/10@voison said in Storage Nodes Not Providing Images:
For a very short brief moment both of the servers were showing up on both the home page in the bandwidth graph as well as on the web gui > Fog Configuration > Kernel Versions; however they have now disappeared from both.
When a lot of bandwidth is being taken by replication, the graphs sort of crap out. They are not the best/most resilient graphs really, they are only meant to give you ‘an idea’ of what’s happening. If you want to know exactly what the throughputs are, you should use a CLI tool like
iftop
or something. -
Thank you wayne, that is good to know about the replication scripts.
Sorry I disappeared for a bit, I had to go out of town for a little while. Unfortunately now that I have returned, there has been no change in the status of this issue. I have confirmed that the replicator ran perfectly fine, and the /images directory on all three servers are identical. What would you all suggest for me to try next? At this point I am starting to consider exporting all of the images to a hard drive and starting over from scratch.
The server is still running well, even while supporting as many unicast streams as it does. However I would like to try to solve this issue. I have gone back over this evening and checked on all of the passwords, thoroughly eliminating that as a possible culprit.
-
@voison If you have some time this weekend, I am willing to help you out via a screen share - PM me.