Chainloading failed / boot looping

george1421

I can say I would also want to know what is going on in this condition, where ipxe is getting partial or no dhcp information. Which is a bit crazy since the PXE rom was able to load the iPXE kernel from the boot server using the boot file value. This is not the first time I heard of this situation. I don’t know if this condition is because of a potential dhcp proxy server is in the environment, or because of a slow network link causing the port to not go into the forwarding state until sometime after the ipxe kernel needs it. While this isn’t really a FOG issue, it does tend to color the perception that FOG is not ready for production use.

gwhitfield

@Sebastian-Roth - Current boot menu settings:

I did (and do) have the boot menu hidden but when I un-hide it I do get the menu after entering the FOG IP. Then it fails. I did make sure of the e1000 NIC and snponly.efi settings. This environment has a 2012 Standard server doing DHCP to approx 75 BIOS machines (no proxy). This UEFI VM is only used for testing in preparation for adding UEFI to the mix this Fall. Therefore I have the policies and options set to allow BIOS and UEFI machines to grab their own boot files which works very well for the BIOS machines. Seems like I’m almost there. I have other FOG servers doing the same thing but they’re 2008 boxes and I can’t set policies so I have to leave them alone or face the wrath of a lot of people not being able to boot their BIOS machines.
@george1421 - Having relied HEAVILY on FOG for many years I can say that my perception of FOG is rose colored! Its all just a little bump in the road, probably of my own doing rather than FOG’s.

george1421

@gwhitfield Just for clarity these two environment you mentioned (2008 dhcp and 2012 dhcp) are in different broadcast domain and subnets?

As Sebastian said, the next step is to get a pcap of the communication between the target and dhcp server to see what is going on with this second stage dhcp request. The first stage request is working since the ipxe kernel is making it to the target computer, its just when the ipxe kernel issues a dhcp request the dhcp server is not issuing the option 66 value corectly.

The preferred way to to setup wireshark on a mirrored port. Since the dhcp communications are broadcasts you can pick up this information from any location in the same broadcast domain. If your fog server is on the same subnet as the target computer, you can install tcpdump on your fog server and pick up that traffic too. This would get all of the broadcast traffic plus any unicast communication between the target and the fog server.

Sebastian Roth

@gwhitfield said:

I did (and do) have the boot menu hidden but when I un-hide it I do get the menu after entering the FOG IP. Then it fails.

Could you please be more specific on how things fail?? Which item do you select from the menu and what happens then? Do you try to boot from local disk? Maybe change the “Exit to Hard Drive Type (EFI)” (seen in your screenshot) and see if that works. Have you actually tried scheduling a task for this VM? What happens if you do so? Please let us know the exact errors you see (picture if possible)!

As well I am still happy to have a look at the PCAP file to see what’s causing the “enter tftp server” hickup…

gwhitfield

@george1421 - The 2008dhcp and 2012 dhcp are all different locations with different subnets and broadcast domain. exported tcpdump (filtered as suggested) from FOG server : 0_1456840985605_GBfogboot.csv
Never used tcpdump or wireshark, will need to bring in a buddy to assist with a wireshark capture if you still want one.
Did I say THANK YOU" for your help?!

Sebastian Roth

@gwhitfield The CSV is a good start! I think I can see some weirdness already but unfortunately CSV is missing the most important bits of information. Try tcpdump -w output.pcap port 67 or port 68 or port 69 or host 192.168.120.135 on your FOG server. Make sure your client is actually getting the IP 192.168.120.135 fro your DHCP server. This way we can also see the clients’ HTTP request. Might be helpful as well.

gwhitfield

@Sebastian-Roth here’s the output. IP 120.135 confirmed
0_1456847693267_output.pcap

Sebastian Roth

@gwhitfield Your DHCP server is actually offering different information depending on the request being sent by the client. The first DHCP DORA (discovery, offer, request, ack) sequence issues by the VMs PXE ROM comes with all the PXE info (next-server/option 66: 192.168.120.19 and filename/option 67: snponly.efi) included. Seams fine. Then the iPXE binary is loaded via TFTP and sends its DHCP discovery request. The request looks a bit different from the first one (that’s normal for iPXE!) as is provides option 175 and some other things.
Hmmmmmmmm here I noticed something that might cause the issue. In the first request the client sends vendor class identifier “PXEClient:Arch:00007:UNDI:003016” but the iPXE binary sends “PXEClient:Arch:00009:UNDI:003010”. See the difference in arch. I guess you setup vendor classes to match ID 7 only? Those classes are still a mystery to me. Some UEFI firmwares send 7 others 9 and iPXE might do 7 or 9 as well. I guess that it somehow changed when you updated to the latest iPXE binaries.
So back to what happens next: The answer from your DHCP server comes without any PXE information whatsoever - most probably caused by the class mismatch just mentioned I hope. This is why iPXE does not find the next-server/tftp server IP by itself.

Sebastian Roth

Again: Have you ever tried registering this MAC address in the FOG web interface by hand and scheduling an upload task for it? What happens when you PXE boot the client then? Picture or video of an error would be great. Otherwise I can only guess what’s going on.

george1421

@Sebastian-Roth said:

Hmmmmmmmm here I noticed something that might cause the issue. In the first request the client sends vendor class identifier “PXEClient:Arch:00007:UNDI:003016” but the iPXE binary sends “PXEClient:Arch:00009:UNDI:003010”. See the difference in arch. I guess you setup vendor classes to match ID 7 only? Those classes are still a mystery to me. Some UEFI firmwares send 7 others 9 and iPXE might do 7 or 9 as well. I guess that it somehow changed when you updated to the latest iPXE binaries.

We may need to update the wiki to be sure to include all arch settings. I see it lists this for the Linux dhcp, but not for the windows 2012 setup (step 3). @Wayne-Workman

Wayne Workman

@george1421 said:

We may need to update the wiki to be sure to include all arch settings. I see it lists this for the Linux dhcp, but not for the windows 2012 setup (step 3). @Wayne-Workman

The steps are the same for all architecture types - you’d just change the number in step 3 and then maybe give the names something that is specific to the arch you have setup.

That said - I also understand that someone who doesn’t understand it already will be totally lost for how to set it up for additional architectures. So we do need more steps. Maybe even a video.

wiki

Also - in case anyone is wondering what the heck we are talking about, we are talking about this: https://wiki.fogproject.org/wiki/index.php?title=BIOS_and_UEFI_Co-Existence

george1421

@Wayne-Workman said:

That said - I also understand that someone who doesn’t understand it already will be totally lost for how to set it up for additional architectures. So we do need more steps.

Maybe just add some text under step three to rinse wash as repeat for “PXEClient:Arch:00002”, “PXEClient:Arch:00006”, “PXEClient:Arch:00008” and “PXEClient:Arch:00009”. If you don’t read through the linux section, the windows admins would not know these values are also required. (and for full disclosure, I did not create them either. Will do now…)

gwhitfield

@Sebastian-Roth said:

Again: Have you ever tried registering this MAC address in the FOG web interface by hand and scheduling an upload task for it? What happens when you PXE boot the client then? Picture or video of an error would be great. Otherwise I can only guess what’s going on.

Yes, the MAC is registered and an image task runs if/when set. It appears that if there’s no task it doesn’t go to the HD as the next option.

george1421

@gwhitfield said:

Yes, the MAC is registered and an image task runs if/when set. It appears that if there’s no task it doesn’t go to the HD as the next option.

That should be your exit condition. (i.e. sanboot, grub, exit, etc). Some systems have different exit conditions I’m sorry to say.

gwhitfield

@george1421 - For the time being I decided to simplify and set the Policies to:
1: Arch=00000 (BIOS) - undionly.kpxe (works great)
2. Arch<>00000 (all other) - snponly.efi
I was hoping that this would tell anything not reporting as a BIOS machine to get the same boot file regardless of architecture. Doesn’t seem to have worked.

Wayne Workman

@gwhitfield That won’t work for 32 bit UEFI systems.

george1421

@gwhitfield Yeah I agree with wayne, plus I don’t think wild card matches are supported. you have to spell out each one to get a match. No windows cheating here.

gwhitfield

Hold horses. I just changed boot order to look at some image details and the VM won’t boot to the HD. I think my test VM got reverted to a BIOS image on a UEFI disk. I’m guessing that’s the explanation for chainloading failure. Reinstalling OS now, should know shortly…

gwhitfield

@gwhitfield - Nope, not it. Correcting image on the disk didn’t change anything. However, at some point recently I stopped being prompted for the IP address and I hope that’s a good thing, I’m pretty sure that happened when I changed the DHCP policies. I remember with some of our older FOG machines we had to correct the chainloading by adding/editing some files. Is this possibly as simple as that ?? I did try all the different exit conditions and they all respond exactly the same way.

gwhitfield

@gwhitfield - another output after reinstall OS and changing DHCP policy:
0_1456859580183_output2.pcap

Chainloading failed / boot looping

139

12.1k

17.3k

155.4k