UNSOLVED Second "server" to solve bios boot order problem if Fog server not available
I have a working Fog server in production environment, one server, which is in most cases ok, but sometimes not accessible for pc’s to boot up from (all ok, but network is not reachable, or server offline cos of maintenance, etc). Almost all pc’s which we use fails in the ways if boot order is network->harddisk. It netboot fails, machines hangs there, with no option for the user only to wait (10-15 mins timeout, nonsene, but it is about that, depending on bios. old or new, no matter).
This is why is wanted a way to make a bypass like service on an other location, but strictly on for bypassing netboot if Fog is not reachable. The less problematic way to pass on, but can be enough for 99% of situations.
How can I manage this? I was thinking of a way to have a keepalived to check if master fog is reachable and if not, a virtual interface would come up to server as target for network boot clients for bypass. Until this it is not a big deal, but what should I have working on the bypass “server”? My first think was to install a “next-next-finish” dummy fog server, fully configred, but it can cause a lot of problem, so maybe a lot more sofisticated way is there for it to accomplish the goal.
Was I ever clear to tell what I want? If not, let me now what to tell in details. My only goal is to have the slave server to make a “boot from hard drive, nothing to do here” state and when master server comes back, keepalived would kill the virtual interface to let the master do its normal job. I hope it is clear enough to start finding solution, guessing, etc
I dont think I am the only one who noticed some bios cant move on when netboot server is not available, hope u guys tried to find solution before.
Aha, thx! Surely i will do a try! Thx
@d4rk3 Great solution!
d4rk3 last edited by d4rk3
You could always use SolarWinds Free TFTP Server on two separate servers (assuming you have them) or two regular old computers would even work.
Just configure both and copy your FOG server’s /tftpboot into both TFTP server’s root folder for their TFTP directories.
Of course this isn’t completely fail-safe…but this may do the trick:
Hostname: TFTP-Server01 (10.0.0.5)
Hostname: TFTP-Server02 (10.0.0.6)
Set option 66 to point at TFTP-Server01.
Add static DNS entries:
As you can see, TFTP-Server01 has two entries, the second being TFTP-Server02’s IP.
This works in a Windows environment as long as the second IP for the specific host is numerically higher (e.g. 10.0.0.6/10.0.0.7/10.0.0.8/etc.)
The only way hosts would be stuck waiting for their PXE bootfile is if both TFTP servers went down…you could even add a third server (TFTP-Server03 (10.0.0.7)) and just add another entry for TFTP-Server01 (10.0.0.7).
This is how I have FOG setup in one of my environments. This way I can freely reboot my FOG server and only need to worry about it being up when I want to image.
Fully virtualized fog server in a failsafe HA environment, my final goal. My only problem is my boss, who wont accept the needs for a working version of this. Right now at least. But ofc it will be the final solution. Until that, I will do experiment to do things and maybe find a working solution. I will consider what you told, will do tests if I have any free time. If I manage to work a “bare minimum” version to do so, I will post it. Maybe others may profit of it, or at least have fun with my struggle and get new ideas which can be even better. ,)
@Foglalt If you could migrate FOG to Windows Server Hyper-V, you could set up two windows servers with the Hyper-V role installed, and have the FOG server just replicate to the other. This is what I do at work. If for whatever reason my primary FOG server goes down, my secondary (which is always identical) automatically starts and takes the load.
I am aware of the working procedures of fog I can add storage node, but not main server for task and boot purposes. Am I wrong?
I’m not sure about FOG 1.2.0 but in FOG Trunk, storage nodes can also serve as TFTP servers for network booting. I believe you have to use the location plugin to accomplish this.
I’ve thought of another solution - as you want “The bare minimum”. I think in this case, you do need two servers. And I think the first one should be a dummy server as you said. However, the “dummy” server can’t be so dumb… it has to be smart.
The dummy server’s IP address must be your DHCP option 066, and it must be able to ping the FOG server to see if it is alive. While it knows the FOG server is alive, it should chain-load to the FOG server. If the FOG server is not alive, it should instruct computers to boot from HDD.
All of these pieces that you need can easily be accomplished by having a 2nd fog server, and writing a custom script and scheduling it as a cron-tab event.
Basically, I think you’d simply set the iPXE menu to just chain load to the main fog server for every computer. However, you need a script running to see if the main fog server is alive or not. If it’s not, the script should change the iPXE menu to tell every computer to boot from HDD. And vice versa when the main FOG sever comes back online, the script should set the file back the way it was.
I think that instead of manipulating files in FOG with a script, you could simply have two files that the script would just rename back and forth.
I think that doing this without a second FOG server would be extremely difficult and I wouldn’t know where to begin with that.
You definitely want me to give up, dont you? As for second fog server. If it is blank, and only need to have modify dhcp next server info is is practically the same what I am trying to do (only difference is that if i can figure out what is the bare minimum).
My problem is this: If i have a server with 2 interface and 1 of them is configured to work only if needed (a.k.a. failover solution) then how should fog be configured? If it is fully configured with same network information, like ip, nfs, etc, I dont know what malfunction it can cause if it is “online”, but not active (I mean the other interface it can use for communication, I cant tell fog not to communicate till I dont say now you can. So what if it does anyithing i dont know it does. I mean if slave put infos to log on master, make it confusing, etc)
If non-active slave is not configured with ip of “fog-ip”, then after failover comes, it needs to be, but that is not 1 click).
All I want to do is a fully, seamlessly replacable dummy who tells: do nothing. I am not a master of 1.2 fog version boot process on client side, I need to learn a lot more to have a solution easily. But I still dont think it is abosulte unnatural and impossible. I maybe need to find the correct question for community, you, etc. I am sure this magnificent tool can have a minimal backup dummy for this kinda problem.
(ps. about minimum: I cant afford total backup of server at the moment. Not because of the server, but maily of image sizes. I am not sure, but as I am aware of the working procedures of fog I can add storage node, but not main server for task and boot purposes. Am I wrong?
Some part of the infrastructure is easily modified, but main parts I cannot alter too easily (like network setup, storage allocation, etc.). You know, “you have to make things work, but I dont care how you do it, just make it work” style from management side… And my colleagues got addicted to fog as we got a lot of goods from using your system, we dont want to give up easily.
How about something much more simple as a solution?
Just have two FOG servers.
One will be your primary server and production server. The other will just be blank and always on.
When the primary goes down, quickly change your DHCP option 066 to point to the secondary. Problem solved. Fix your FOG server, then change it back.
You could even configure the secondary with all your images and hosts and such if you wanted… but that’s beyond the scope of this conversation (But don’t ever expect domain joining to work on the secondary).
I am aware that some have this issue, others dont, of course. As you pointed out all chain can have a fail. Yes, I know of it. I dont want to be pulletproof, for me it can be ok if I have an option to survive such cases till I can figure out there is a problem and I can have solution. Yes, it can be a solution if all problematice device must be set up to hd first boot (bios upgrade we tested, fail in almost all cases or other issue came up. just for fun ofc). Our main goal with network boot as first boot order is to be able to manage our computer rooms from remote location. Think of it a bit in our way: dozens of labs, with hundreds of computers. Do you really want to go to location to set network boot if new image come for those computers, boot up, image, then switch back all to hd boot? Or, as we plan and operate: Remote wake up for pc’s during off duty times, have them imaged, then power off if no fail log comes. If nothing problematic, it can be scheduled easily. But with “some fail bioses” we have to go and see how many, where, locate, replace/upgrade/etc, or totally disable the feature and back to stone age, all done manually
So, question still there. Is there a solution for a dummy system to work out, bare minimum, to tell pc’s to boot from harddrive? Am I the only one from space to imagine it possible somehow? I hope not.
I am sitting and thinkink even twisted solutions. Like a pxe version (e.g 0.32) where boot image is renamed to new mode (undionly things). That version was less complex, maybe the tftp boot server part should be installed where defaul menu option is only to make it chain to local hd boot. Come on, is it so unnatural thought?
Wayne Workman last edited by Wayne Workman
Not every computer just sits and stares at the wall for 15 minutes when it fails to retrieve the specified boot file.
However, since yours do, this presents a unique (yet minor) problem.
If you set every computer to boot to the network first, you’re always going to have some single points of failure.
It doesn’t matter how many dummy servers you set up that chain to another server… if the first one… or any in the chain fail (longer chain means greater chance of failure) then your computers will sit and stare at the wall for 15 minutes…
In this situation, I’d really recommend setting the primary boot device on these problematic models to the hard drive, and when you need to image, you’d create your imaging task as normal in the FOG web UI, and then just walk over to the computer, shut it down, and then tell it to network boot via the built in boot menu.
You might also want to look into any firmware updates that are available for these problematic models. A simple firmware update might just fix your issue.