@brooksbrown I think the terms are being mixed up here so I am trying to set this right. When installing FOG there are two different modes you can choose from, either master OR storage node (there is nothing called “master storage node”!). From your description it sounds as if you installed all your FOG servers as master nodes and therefore you have separate DBs and hosts being registered to one node won’t show up in the other node unless you register each and every client with every FOG server you have. On the other hand if you do the usual setup you’d have ONE master node (where the DB and web UI is) and several storage nodes.
The “failover setup” you intent to build is not as easy to handle I am afraid. There is a lot involved in PXE booting clients - 1st DHCP handshake from the PXE ROM, TFTP to load iPXE binary, 2nd DHCP handshake from the iPXE binary, TFTP and then HTTP request to load iPXE config, HTTP request to load kernel and initrd, 3rd DHCP handshake from the linux kernel. It might sound as if this is pretty straight forward. But if the FOG server for a particular bench fails (or is too busy) the client will get his first DHCP answer from another FOG server. For example if the 1st DHCP handshake is answered by lets say NODE 2 (as the client is on bench 2). Then the client will download the iPXE binary from NODE 2 as well. But the 2nd DHCP handshake might be answered by NODE 3 (cause NODE 2 is not fast enough this time). Still fine if they all share the same DB (which would be on the one single master node) and the client gets a consistent iPXE config (e.g. FOG menu or task).
In theory this all works but all servers kind of need to be in sync, TFTP files, kernels/initrds, FOG web UI. If you alter the kernel on one server you might see clients from a different bench booting that kernel at random.
That all said there is another reason why I think this setup is not great. For every DHCP broadcast a client sends it gets up to eight answers from all the DHCP servers. Finding an issue and keeping this all setup properly will be a nightmare I suppose! What if there is just one single setting different on NODE 6? Some clients will boot properly but others will fail randomly because of that.
I reckon one could be keen enough to set this up all in one broadcasting domain using two or three servers at the most. But definitely not eight. This will cause you so much headache I suspect. Just don’t do it if you and the rest of your team are no real network wizards who love to use tcpdump for analysing network packet dumps to figure out what’s going on.
If you intend to use the fog-client the whole idea of failover is buried alive anyway. Sure, fog-clients not reaching their particular server is not as problematic as the other stuff can be. But failover is just not possible.
I bet you better take some more time to think about the network setup now and have a lot less issues later on…