ipxe dhcp timeout

networkguy

I’m having a similar behavior as to what was discussed here:
https://forums.fogproject.org/topic/4973/default-ipxe-connection-timeout-on-dell-only/7

We have portfast enabled on our switch ports but it seems like ipxe just isn’t getting an IP address quick enough. We have a NAC appliance that delays the assignment of a vlan for a second or two and that seems to be the culprit, when I remove the dot1x configuration from the switch port the machine boots as expected. With the NAC(dot1x) configuration in place, if I press ‘s’ and then enter dhcp followed by the chain command at the ipxe shell I am able to see the boot menu and the machine boots as expected.

The machine in question is a Dell OptiPlex 3010. Is there any way to extend the dhcp timeout?

Junkhacker

what version of fog are you running?

george1421

Doesn’t the dot1x protocol require a supplicant running on the booting device? If this is the case then iPXE will not work.

networkguy

I’m running git version: 7659

Our nac is using mab (mac-auth-bypass) so no supplicant is needed to pxe boot.

george1421

@networkguy The other thing that happens during a FOS (Fog client OS) is that the network interface “winks” several times as the PXE ROM transitions to iPXE and then from iPXE to the FOS kernel this plays havoc with NAC systems.

george1421

@networkguy said in ipxe dhcp timeout:

Our nac is using mab (mac-auth-bypass) so no supplicant is needed to pxe boot.

Fair enough.

If you place a dumb (unmanaged) switch between your building switch and the booting device does it mask the issue?

george1421

said in ipxe dhcp timeout:

Dell OptiPlex 3010

Just for reference the OptiPlex uses a Realtek nic. Is it safe to assume your building switch is an advanced switch like a catalyst? (this seems to be a common thread of late)

networkguy

@george1421
I’m not going to be able to get back to test this today. For now I updated the DHCP settings to use pxelinux.0. I will test with an unmanaged switch tomorrow and report my findings.

The device is currently plugged into a Cisco Catalyst.
cisco WS-C3560-48PS

Thanks!

george1421

@networkguy pelinux.0, Oh wait I guess I missed asking you what version of FOG are you running. pelinux.0 is not used with 1.1.0 or newer. If you have one of those and you are using that you have other issues than dhcp.

networkguy

@george1421
Sorry George. It would be helpful to know that we have 2 fog servers up and running. One at .29 that has been in use for many years and 1 at git version 7659. I updated the DHCP settings in our labs scope this morning to use the new server/undionly.kpxe and this issue cropped up. Due to not being able to troubleshoot further today I put the settings back to our old server’s IP/pxelinux.0.

george1421

@networkguy Just a comment, if you use dhcp reservations you can define on a per client basis dhcp options. So while you are testing with this single client you can point to the new fog server and boot file. You can do this without breaking your current deployment environment.

networkguy

@george1421
Great suggestion George, thank you. I will do that in the morning. We will also be attempting to improve upon the way the switches are connected. The computers in question are 6 switches down a stack which are daisy chained together…

Sebastian Roth

@networkguy Do you see the Configuring (net0 aa:bb:cc:dd:ee:ff) ... ok (especially ok) before the timeout?

Can you please install tcpdump package on your FOG server and run sudo tcpdump -w timeout.pcap udp, then boot one of the clients till you see the timeout and stop tcpdump (ctrl+c). Upload the timeout.pcap file to the forum.

networkguy

@Sebastian-Roth
I have a very blurry screenshot that shows what I am seeing. I apologize for the quality. I also removed MAC/IP information from it.

I do not see the ok after configuring (net0 …)

Pressing ‘s’ to get into the shell followed by dhcp and then chain http://myfogserver/fog/service/ipxe/boot.php does allow me to boot.

Regarding running tcpdump on the FOG server, is that with the assumption that it is our DHCP server? If so then in my case I won’t be able to take that approach as our DHCP runs on our domain controller. As much as I really appreciate this assistance, I’m also slightly hesitant to upload a pcap from our domain controller.

http://pasteboard.co/191dt1Ib.png

george1421

@networkguy If your fog server, target computer and dhcp server are in the same broadcast domain (subnet) then its ok since the dhcp traffic we care about is sent via broadcast messages.

george1421

@Sebastian-Roth Is there a way in the iPXE kernel script to either try X times then die or set a startup delay to give the NAC system a chance to reregister the device between each network wink? I know his troubles because I’ve worked at a company that used NAC. It was a bit of a pita for network booting.

networkguy

@george1421
fog server and dhcp server are on the same subnet, the client is on another. We have the dhcp server added on our router using ip helper-address.

george1421

@networkguy Yeah that’s not going to work (the standard way to get this info). If your fog server was on the target computer side you would capture the client broadcast messages, but not the dhcp server. Once the dhcp requests hits the dhcp-helper it turns the broadcast messages to unicast messages.

Sebastian Roth

@networkguy I know why I keep asking people for posting a picture of what they see. Don’t want to sound arrogant but we usually see more than most users (especially as there are more eyes in the forums!)… The picture you posted is showing a different error than you initially posted. Timeout on default.ipxe is totally different than timeout on the preceding DHCP request.

@george1421 said:

Is there a way in the iPXE kernel script to either try X times then die or set a startup delay to give the NAC system a chance to reregister the device between each network wink?

This reminds me of the fact that the iPXE developers added some kind of spanning tree detection (and wait) probably about two years ago. So I am wondering if this should be addressed within the iPXE source as well. A quick search for “ipxe 802.1x” on the web revealed this post. While I haven’t tested it to me this sounds like iPXE in fact should cope with basic EAPOL stuff. I will check the code when I have a bit more time.

On page 5 of this presentation it says: “PXE Boot -> Open access”. From this document it seems to me that you need to configure your PXE booting ports as “Open access”. Sorry if you’ve already done this and it’s still not working. While I have done a fair amount of networking stuff I didn’t have a chance to look into that 802.1x stuff much yet. So this is just me flying “on sight” (means reading the manuals).

I’m also slightly hesitant to upload a pcap from our domain controller.

Perfectly fine. I do understand this. Less information simply means less professional help. Your choice.

networkguy

@Sebastian-Roth
I picked up on the difference after posting the picture and changed my description slightly. Thank you for pointing that out.

I appreciate you spending some time looking into this. One change I made which seems to work at least with this one computer, is changing the authentication order for the switch port. We aren’t really doing dot1x at the moment so it really doesn’t make sense to have the order as it was:

Previous port config(i switched both to mab dot1x):
authentication order dot1x mab
authentication priority dot1x mab

All is well at the moment, I will be changing the rest of the port configs and then follow up with changing our DHCP scopes again to see if any other problematic devices are reported.

ipxe dhcp timeout

188

12.2k

17.3k

155.5k