Master, Storage nodes, and clients for gigantic network. Trying to design - need sanity check?

george1421

@p4cm4n Ok lets rewind this to a simpler approach

DR site: We may be able to leverage mysql replication to replicate the database from the root fog server to the dr fog server. This would be an active / passive configuration. So this will get the database to the remote site. If you make a storage group on the root FOG server and put the DR FOG server listed as a storage node the root fog server will replicate the image files to the DR FOG server. This would be completely automatic if you use something like keepalived. The next bit is that you will need to do some dns magic. When you setup your storage nodes, don’t point them to the IP address of the root fog node, but to a dns name that points to the root node. That way when/if the root node goes off line you update the dns entries and after some time the storage nodes will find the DR fog server. You will need to do the same thing for when you install the FOG clients, they need to point to a dns name and not an IP address. If the DR system was at the same site at the root server then for external access (storage node, fog client, web admin) you could use a floating VIP/CARP address that would stay with what ever node was master at the time.

Subnet group: This is where the FOG location plugin comes in. You create your locations, you assign your storage nodes to a location and then finally assign your computers to a location. That way when a computer PXE boots, it reaches out to the root node and asks for a local storage node. If workstation is not associated with a location, then it will image from the root FOG server node.

This Highly Available root fog server is something that I find interesting and something I think I would look into.

Just thinking big picture here if you have 14,000 hosts with the fog client installed all hitting the FOG server you will have some severe bottlenecks because the FOG server will be so busy servicing client checking that the Web UI will be very slow. We might (for performance reasons) want to split the fog database off from the FOG server and then create a second fog server for just client check-ins. This is more of an idea than a working model at the moment.

Sebastian Roth

@p4cm4n This is a huge project you are going to tackle. Wohooo.

Beside the things George already pointed out I might add a little to that. FOG was not made for such huge installs to start with. Not saying you can’t use it but I think it needs thorough consideration, testing and adjustments as you go.

One major thing I see is the fog-client software. Even 500-1000 clients can bomb out a decent server. Beside the load issue (which can be solved) you need to know that the current fog-client pins to one single FOG server via certificates. So George’s great advice on using DNS names for failover will only work if you clone all the certificates manually. I was hoping to remove this restriction as it causes way more problems than it adds security, especially when HTTPS is enabled on the FOG server. But we don’t seem to have enough people working on this project and so it’s still on a long list of todos.

As well having dozens of nodes can make the FOG web UI slow. We’ve tried to fix this issue some time ago but I am not exactly sure this is fully fixed yet.

I would definitely try to import host information directly to the database if I were you. While it might take some time to investigate and maybe come up with some scripts it will save you a lot of time and hours of sleep for such a huge amount of hosts.

Definitely look into moving your database to InnoDB right from the start!

p4cm4n

I think so far the adjustments you guys have mentioned will probably be made - but some may not be necessary. I don’t think enough of a requirement will be for the web portion to be ‘fast’ as we wrote a powershell GUI for the specific functionality that we needed with the fogAPI.

I’m still not really understanding some things about the master/storage/storage groups (with the added feature of ‘locations’)

Something I’m running into is that at specific places, I will need a new variation of an image perhaps. So lets say this is the case

HQ > FOGServer. Database host, Image host, PXE host.

HQ has 3 additional storage nodes.
HQ Has the ‘Image Master VM’ as I will call it.
Site 1> FOG Storage Node (Master Node of this Storage Group)
Site 2> FOG Storage Node (Master Node of this Storage Group)

Question 1 relates to only the HQ setup - If I use “DEPLOY IMAGE” without registering a host, is it supposed to only talk to the FOGServer, and not load balance in any way? In this functionality, I’ve got about 90 laptops imaging from the main FOGserver. It’s working incredibly fast, but just making sure this is by design. I guess the FOGServer doesn’t ‘know’ the clients, so it can’t load balance on something it doesn’t know? I’d expect the slots are taken regardless, but ?

Question 2 relates to the entire setup. If I create an image at HQ, but lets say I want to replicate that image to Site 1…how do I do that? As it is now, I’m just changing the storage group of Site 1’s storage node (disabling master) and putting it in the group of HQ so it replicates with the ones in HQ. Then moving it back to its own…but in this case, I will need to do this with all of everywhere…and as more sites come about, that may be a lottttt of images.

What I’d like to do is have that “FOGServer” master node. Capture an Image (Lets Say ImageAdobe) at HQ on this Master Node, then selectively replicate it to Site 2, for example. No need for Site 1 to get it. Likewise, Site 1 gets ImageChrome. No need for Site2 to get it. Is this something that can be done? If there is no ability for this to be done - if I just go into the storage node via SSH and delete the folder for the image I don’t need at a location, is this best practice? Will it cause any issues, if Site2 thought it had Image34, but the folder no longer existed?

So far, I’ve killed the crap out of the master fogserver based on 256MB PHP-FCGI limits (I run debian) but found a few threads @george1421 was a part of that explained how to resolve those issues. A little learning curve due to you guys using CentOS but all good. I’ve got 4000 Clients in there, which does take a while to load up the webpage, but since most of our work now is
image > approve pending hosts, add pending to group > rinse/repeat
we’ve done that whole workflow in Powershell GUI. Only I’ve had to go into the webserver

Sebastian Roth

@p4cm4n said in Master, Storage nodes, and clients for gigantic network. Trying to design - need sanity check?:

I think so far the adjustments you guys have mentioned will probably be made - but some may not be necessary. I don’t think enough of a requirement will be for the web portion to be ‘fast’ as we wrote a powershell GUI for the specific functionality that we needed with the fogAPI.

If you use the fog-client software on all your 14.000 machines this will probably flood your webserver/PHP-FPM and the fogAPI will stop working just as well.

I’m still not really understanding some things about the master/storage/storage groups (with the added feature of ‘locations’)
Something I’m running into is that at specific places, I will need a new variation of an image perhaps. So lets say this is the case
HQ > FOGServer. Database host, Image host, PXE host.

HQ has 3 additional storage nodes.
HQ Has the ‘Image Master VM’ as I will call it.
Site 1> FOG Storage Node (Master Node of this Storage Group)
Site 2> FOG Storage Node (Master Node of this Storage Group)

Question 1 relates to only the HQ setup - If I use “DEPLOY IMAGE” without registering a host, is it supposed to only talk to the FOGServer, and not load balance in any way? In this functionality, I’ve got about 90 laptops imaging from the main FOGserver. It’s working incredibly fast, but just making sure this is by design. I guess the FOGServer doesn’t ‘know’ the clients, so it can’t load balance on something it doesn’t know? I’d expect the slots are taken regardless, but ?

The FOG server doesn’t know the clients in this case but it does/needs to know the image that you want to deploy. Images are associated with one or more storage groups and therefore nodes depending on your setup (code reference).

Question 2 relates to the entire setup. If I create an image at HQ, but lets say I want to replicate that image to Site 1…how do I do that? As it is now, I’m just changing the storage group of Site 1’s storage node (disabling master) and putting it in the group of HQ so it replicates with the ones in HQ. Then moving it back to its own…but in this case, I will need to do this with all of everywhere…and as more sites come about, that may be a lottttt of images.
What I’d like to do is have that “FOGServer” master node. Capture an Image (Lets Say ImageAdobe) at HQ on this Master Node, then selectively replicate it to Site 2, for example. No need for Site 1 to get it. Likewise, Site 1 gets ImageChrome. No need for Site2 to get it. Is this something that can be done? If there is no ability for this to be done - if I just go into the storage node via SSH and delete the folder for the image I don’t need at a location, is this best practice? Will it cause any issues, if Site2 thought it had Image34, but the folder no longer existed?

Probably those five rules from the wiki will help you answer your question:

An image has one storage group as it’s primary group, but can be associated to many storage groups.
The image will always capture to the primary group’s master storage node.
Replication looks for images that belong to multiple groups - and replicates from the primary master to the other associated group’s master nodes.
Replication then replicates images from each group’s masters to other ‘regular’ storage nodes in the master’s group.
A storage node can belong to multiple storage groups - you just need a storage node entry for each. For example, a non-master in one group can be a master in another group.

So far, I’ve killed the crap out of the master fogserver based on 256MB PHP-FCGI limits (I run debian) but found a few threads @george1421 was a part of that explained how to resolve those issues. A little learning curve due to you guys using CentOS but all good. I’ve got 4000 Clients in there, which does take a while to load up the webpage, but since most of our work now is
image > approve pending hosts, add pending to group > rinse/repeat
we’ve done that whole workflow in Powershell GUI. Only I’ve had to go into the webserver

Well, if this is working for you, got ahead. I still suggest you increase the fog-client checkin time: FOG web UI -> FOG Configuration -> FOG Settings -> FOG Client -> CLIENT CHECKIN TIME…

george1421

@p4cm4n said in Master, Storage nodes, and clients for gigantic network. Trying to design - need sanity check?:

I’m going to answer some of these a slightly different way but still in line with what Sebastian posted. There was a few things I had to look in the code because I wanted to make sure they way I thought it worked was the way it was actually coded.

Question 1 relates to only the HQ setup - If I use “DEPLOY IMAGE” without registering a host, is it supposed to only talk to the FOGServer, and not load balance in any way? In this functionality, I’ve got about 90 laptops imaging from the main FOGserver. It’s working incredibly fast, but just making sure this is by design. I guess the FOGServer doesn’t ‘know’ the clients, so it can’t load balance on something it doesn’t know? I’d expect the slots are taken regardless, but ?

I had to look into the code. Each storage node (master or slave) has slots. The FOG imaging load balancer is not a true “utilization load”. There are storage nodes in a storage group. All nodes in that group have the same images as well as snapins deployed. When a system is requesting an image, it looks at where the image is located (storage group). Then the storage notes (master and slaves) are identified in that storage group. The noted is checked 1) to be turned on. 2) If it has reached its max clients (slots) it can service. If this is the first time through then this node is identified as the winning service node. Then it loops to storage node next. It tests is it online, and has less than max clients. If yes then this (new) storage node is check to see if its current client count is less than the winner’s client count. If yes then its the new winner. It continues to loop through the storage nodes in the storage group. So according to the code the storage node with the least number of active deployments in the storage group should get the next deployment job. ref: https://github.com/FOGProject/fogproject/blob/171d63724131c396029992730660497d48410842/packages/web/lib/fog/storagegroup.class.php#L259
That is how FOG imaging deployment does load balancing. I had the impression that it used overflow deployments in that it filled up storage node 1 til max clients then overflowed to storage node 2.

Question 2 relates to the entire setup. If I create an image at HQ, but lets say I want to replicate that image to Site 1…how do I do that? As it is now, I’m just changing the storage group of Site 1’s storage node (disabling master) and putting it in the group of HQ so it replicates with the ones in HQ. Then moving it back to its own…but in this case, I will need to do this with all of everywhere…and as more sites come about, that may be a lottttt of images.

FOG is really not setup for selective deployments. Its generally an all or nothing image deployment.

What I’d like to do is have that “FOGServer” master node. Capture an Image (Lets Say ImageAdobe) at HQ on this Master Node, then selectively replicate it to Site 2, for example. No need for Site 1 to get it. Likewise, Site 1 gets ImageChrome. No need for Site2 to get it. Is this something that can be done? If there is no ability for this to be done - if I just go into the storage node via SSH and delete the folder for the image I don’t need at a location, is this best practice? Will it cause any issues, if Site2 thought it had Image34, but the folder no longer existed?

When the replicator runs it will see that you deleted the file and then just recopy it over from the master node. I realize your image names where just for an example, but I wonder if you are doing something (maybe not wrong) a bit more complex than needed. I’ve worked on deployments where we had exactly 3 images (2 uefi and 1 bios) for 5400 computers, with 14 different models in that 5400 workstation population. We did use PDQ deploy for certain software deployments post imaging but 90% of the software was already baked into the golden image.

So far, I’ve killed the crap out of the master fogserver based on 256MB PHP-FCGI limits (I run debian) but found a few threads @george1421 was a part of that explained how to resolve those issues. A little learning curve due to you guys using CentOS but all good. I’ve got 4000 Clients in there, which does take a while to load up the webpage, but since most of our work now is

This is where we need to do some tuning from default. FOG is really geared towards the SMB deployments where you might have 500 or less computers with FOG clients (my opinion). One of the first things we need to do is make sure your mysql database is using the innodb engine for the tables and not myisam. The quick answer is isam uses tables level locking on updates and innodb uses row level locking. This becomes important when you have 800 notes checking in per minute. You end up with quite a bit of resource contention on your sql server database and your sql sever CPU utilization will jump way up. Once you convert the tables over to innodb load drops back to normal.

With as many client computers you have change your checkin time from 5 minutes to 10 or 15 minutes. It will slow down a few things on initial deployment, but you will release the back pressure on the FOG server. The one design I thought about was to have a 2 web server and 1 database server design, where you have 1 web server for deployment and system management and the second web server to service fog client requests. Both web servers would be hitting the same mysql database. I’m not sure if that would be any more performant than a single large fog server.

Another thing we need to look into is the sql server database design. I think we can make some repetitive queries a bit more responsive by creating indexes on frequently queried values. I can tell you no research has been put into optimizing the database structure by using database indexes. The main issue is finding large installations that has the time to help test different configurations. Usually once the larger installs get it working focus shifts from fog to something else so we never get to close the loop on potential changes.

p4cm4n

@sebastian-roth @george1421

Client time has been adjusted but its being adjusted on-the-fly. Imaging workflow is mostly completed on new laptops, however NOW will be pre-existing computers. Pre-Existing will expand that range from the 4000 as it stands now, to an additional 10k. I’m glad you guys mention that though, as I will need to definitely change that over ASAP before I roll out the FogClient deployment (via PDQ actually, george…however in this engagement we’ve noticed PDQ only scales sooooo much :))

It turns out that after setting the images to replicate with the functionality you guys mention (Group > Group Image Replication) worked. I might have been a little impatient when testing it before. I guess it makes sense that at the time I’ve been impatient with it, but it is what it is. I’ll have to monitor the Master Nodes in their respective groups - but I set one overnight to replicate and it worked as expected…prior to that it did not (Seemingly? perhaps firewall issues though…I watched the log as this one happened)

In most situations I wont be making storage group slaves, luckily. However the infrastructure is there to “make” the images at the HQ site. Massive ESXi farm, lots of existing snapshots, and already open firewall to everywhere.

I now understand what you’re mentioning about the database design. Any guides you have to work on this and migrate it over?

I’m getting to a point in this project where I have to offload nearly everything as I’m leaving the country soon. But all the preparation I can do in advance, the better. Going to go build 70 storage nodes today and tomorrow

Sebastian Roth

@p4cm4n said in Master, Storage nodes, and clients for gigantic network. Trying to design - need sanity check?:

I watched the log as this one happened

That’s definitely what I’d suggest in any case. Watch the log and know what it does instead of just guessing what might happen.

Going to go build 70 storage nodes today and tomorrow.

Whoooohaa!

george1421

@p4cm4n Here is the procedure for upgrading the data base design from MyISAM to INNODB engine.
https://forums.fogproject.org/topic/16099/configure-fog-database-to-use-innodb-engine

Wayne Workman

@p4cm4n Is your org able to help fog with resources? Be they time or financial? FOG is running critically low these things, and support is the one thing that comes to mind when you talk about 70 nodes and 14,000 clients.

p4cm4n

@george1421
Sweet. This will be my project tomorrow.

Today I ended up automating the server node installation, was pretty fun actually. Learned a bunch of the inner workings. Never written a bash script before, or modified the fog.man.reg in the way that I have. The man.reg only asks
Hostname:
LocID:
ImgID:
GroupID:
Then deploys.

This poses the question though, I still don’t understand the workflow enough…how DID you end up having the machine image without rebooting?

george1421

@p4cm4n said in Master, Storage nodes, and clients for gigantic network. Trying to design - need sanity check?:

how DID you end up having the machine image without rebooting?

I looked for the code I posted in the forums a few years ago but could not locate it. At the end of the fog.man.reg after inventory step, I checked to see if the user answer yes to image now if they did then I reloaded the kernel parameters and then (I think) called the fog.download script directly from fog.man.reg.

p4cm4n

@george1421 I called the fog.download script and got ‘no OS defined’ which must have been the kernel reload you mentioned.
how did you reload the kernel params like that? i dont see in any code anywhere to issue a reboot command unless its in functions.sh, which seems to be the common dump

p4cm4n

@george1421 yeah the error was something from funcs.sh, no OS ID passed, call determineOS.

george1421

@p4cm4n There is a master imaging script that gets called when FOS Linux starts up its called simply fog https://github.com/FOGProject/fos/blob/fda59eca648af1a38ed57c94f65558221e77534f/Buildroot/board/FOG/FOS/rootfs_overlay/bin/fog#L1

At the beginning of that script on line 4 I added a long time ago the code for usb booting, in that the target computer would call back to the FOG server to get the kernel parameters. Normally iPXE would load the kernel parameters when fog imaging was requested. But with USB booting into FOS Linux there are no kernel parameters per se. So this code was added to the beginning of the master bash script.

if [[ $boottype == usb && ! -z $web ]]; then
    sysuuid=$(dmidecode -s system-uuid)
    sysuuid=${sysuuid,,}
    mac=$(getMACAddresses)
    curl -Lks -o /tmp/hinfo.txt --data "sysuuid=${sysuuid}&mac=$mac" "${web}service/hostinfo.php" -A ''
    [[ -f /tmp/hinfo.txt ]] && . /tmp/hinfo.txt
fi

So specifically the curl calls back to the fog server to run a php file that loads the kernel parameters into /tmp/hinfo.txt then we stat or run that hinfo.txt file to load the kernel parameters into the bash variables. That was the trick to “rebooting” without actually restarting FOS Linux or the computer. Then after the variables were loaded I think I called fog.download directly.

If you look down that fog script a bit you will see the variables $mode and $type. After you load the kernel parameters you may need to unset $mode and set $type to “down” then call fog.download. That is what the script is checking for to know what fog module to call on each boot.

The trick is to call fog.download and not let the fog.man.reg exit, because when it does control will be returned to the main fog script and the target system will reboot.

Master, Storage nodes, and clients for gigantic network. Trying to design - need sanity check?

88

12.7k

17.6k

156.6k