FOG Boot Process
I’m working on putting together something to describe the fog boot process without being overly technical about it. The intentions of the tutorial and eventual video are to enable newcomers to identify “Where” a problem is, which would then enable them to troubleshoot better.
I’ve outlined these general points:
System powers on and network boots ----What can go wrong here? ----System must support PXE booting in order to boot from the network. ----The boot order must include the network before the local HDD OR you must provide a temporary boot device by hitting a hot key during POST. The system requests DHCP ----What can go wrong here? ----No DHCP server means no lease. ----No network connection. ----Bad patch cable. ----A firmware issue might exist, try updating the system's firmware. DHCP Server responds with a lease. If a ProxyDHCP server is not in use, the DHCP server will also contain the next-server and boot file name. If a ProxyDHCP server is in use, the ProxyDHCP server will respond with the next-server and boot file name. ----What can go wrong here? ----If a next-server is not set on the DHCP server or a ProxyDHCP server, then no next-server will be given. ----if a filename is not set on the DHCP server or a ProxyDHCP server, then no filename will be given. ----If the DHCP server does not hear the request, no response will be given. ----If the DHCP server is out of leases, no lease will be given. System requests the specified iPXE binary (the filename) from next-server via TFTP. ----What can go wrong here? ----Firewall on the FOG Server can block this request. ----TFTP might not be running. ----Permissions could have been manipulated and might prevent access to the file. ----The file might not exist (check for typos in the filename, option 067). ----The system's firmware may not have good support to retrieve a file from another network (if the fog server is on another network). This is more common in older computers. ----The FOG server might be turned off or disconnected. Next-server sends the iPXE binary. ----What can go wrong here? ----Packet loss due to pre-existing network problems. System loads the PXE binary. ----What can go wrong here? ----The pre-built iPXE binaries might not support ALL hardware, especially very uncommon NICs. ----The system can hang. ----There could be a reboot loop. PXE binary requests default.ipxe from the next-server via TFTP. This file contains where to try to get an iPXE boot script from. ----What can go wrong here? ----The file might not exist. ----Permissions could have been manipulated and might prevent access to the file. next-server sends default.ipxe ----What can go wrong here? ----Packet loss due to pre-existing network problems. PXE binary reads default.ipxe and then requests the boot.php file from the specified web address. ----What can go wrong here? ----The specified web address inside default.ipxe could be incorrect, resulting in a timeout or file not found for boot.php. ----The FOG server's firewall might block the web request. ----Apache might not be running. ----boot.php might not exist. ----Currently present MAC addresses that are also still assigned to other hosts (USB ethernet adapters, NIC cards, WiFi cards) can cause incorrect responses. Web server queries the Database to look for jobs and settings, then generates a host-specific iPXE boot script, and gives it to the requesting system. ----What can go wrong here? ----A DB Password can block access if one is set on the DB but not defined within FOG. ----MySQL/MariaDB might not be running. Try restarting it. ----Firewall might block this request. ----The DB may not have been setup due to incomplete installation of FOG. ----MySQL may be having load issues. ----MySQL may have crashed, look for error logs, try restarting. PXE binary reads this script from the web server. If there are no jobs waiting, it by default will show the FOG Boot menu and then exit to the HDD (loads the local OS). If there is a job, it then downloads the specified kernel and init via HTTP from the specified web server. ----What can go wrong here? ----System hangs when trying to load the local OS. This means the Exit to HDD type is not suitable for the system and needs changed (host management). ----Permissions on the kernel and init may be incorrect, blocking the download. ----The specified kernel may not exist (clear the host kernel field). ----Load issues could cause this to hang or be very slow. ----Packet loss due to pre-existing network problems. PXE binary passes control to the kernel with the specified options and arguments from the iPXE boot script. ----What can go wrong here? ----Kernel panic due to kernel not being compiled with needed parameters for this particular system. ----System could hang due to kernel not being compiled with needed parameters for this particular system. ----Garbage output on the screen due to kernel not being compiled with needed parameters for this particular system. ----Wrong architecture kernel for this system (x86 on a x64 bit system and vice-versa). ----Corrupt kernel causes weird undocumented errors. Kernel performs task. ----What can go wrong here? Past this?
Does anyone have any recommendations on how these points could be made better? Am I missing anything?
Also - just how detailed do we want to get? I’d really like to be very, very detailed about this
Absolutely! I just started to work on this. Long way still to go…
@Sebastian-Roth I looked it over and made typo corrections and added links and things.
There is quite the gap between full-registration and creating an image though… need to put kernel steps in there I think, and how to change the kernel.
Also - just how detailed do we want to get? I’d really like to be very, very detailed about this - and point out in a well-edited video (I can do that) where each and every little bitty thing happens.
@Wayne-Workman See in the footer of my messages… :-)
@Sebastian-Roth What article are you working on?
Has it really been a month since we were talking about this? I finally started to work on the mentioned article. I am sure this will be work in progress as we see issues in the forums…
@Sebastian-Roth The earlier one is very, very dated. Good material but a good deal of what Jeff was talking about back then (2009), a lot of that stuff is automated, and there are newer better ways to make changes - even integrated ways.
I say we fix up this article, update it and expand it:
@Wayne-Workman Have you had time to look through the wiki articles? Should we start based on one of those? Which wiki article/URL should we choose? Let me know when you are into it again. I guess it would be good to discuss in a chat session again.
@Sebastian-Roth I revised the OP based on your suggestions (thank you). Can you look it over?
Also - I agree on making it upload-specific.
Uploads are what people stumble on when they are new to FOG, it’s best we fully document every step of the upload process.
Just found this as well: https://wiki.fogproject.org/wiki/index.php?title=Booting_into_FOG_and_Uploading_your_first_Image
@Wayne-Workman Good that you are picking this up. I thought about that as well (as I have started to work on the wiki docu) but kept putting it off.
You are naming the iPXE binary ROM (like in “Next-server sends the boot rom”) which I think can be confusing to people who don’t know much about that yet. I’d really only use the term ROM when talking about the “PXE ROM” aka the NIC firmware (burned in flash ROM) that does the initial DHCP request and (chain)loads the PXE binary. What you called ROM I’d call “(PXE) boot file” or “PXE binary”.
The point “System reads this script from the web server and then downloads the specified kernel and init via HTTP” is not telling the whole truths. At this point quiet often the FOG menu is displayed to the user. If boot from HD is selected there is no linux kernel involved. So I think we need to have a very good description of exactly this point where a lot of things can go wrong (exit type setting, screwed ipxe config generated by the PHP script, missing/wrong kernel or kernel/initrd combination, missing kernel drivers and what not) and people should be able to figure out which part it is.
I am wondering where to put this in the wiki or if we have a similar article already. This one is kind of along what you want to write down: https://wiki.fogproject.org/wiki/index.php?title=PXE
But I really think this is not a good title for this article (and neither it is a good title for what is written there at the moment - initrd stuff…). What about “Client boot”, “Client boot process”, “Client netboot” or “Netbooting the clients”?
Edit: Hmmm, it would actually fit into our series of “Troubleshoot_XXX” articles. Maybe “Troubleshoot_Netboot” (there already is “Troubleshoot_Uploading” I found - which we might include then…)
DHCP responds with lease which includes next-server and boot file
I would suggest that you consider pointing out that this stage will include the response from the normal DHCP server and may possibly also include a response from a separate proxy DHCP server (if one is in use).