A new (faster) model of machine made an existing timing issue worse in my environment such that the fog ipxe.efi kernel would no longer boot.
The symptom: Machine tries to network boot, succeeds in loading ipxe from the server. Ipxe tries to configure the network and shows progression dots (“…”) but fails to get an address, resets the nic port and tries again, also failing. On older machines the second attempt would usually succeed.
The triggers (as far as I can tell): In my network the very first packet sent by ipxe (when the “dhcp” command is issued to auto configure the network) is a dhcpdiscover packet. That packet gets assigned our “guest” vlan as our switch hasn’t yet learned which vlan the packet should be in. An answer is sent from the dhcp server in the guest network and seen by the client. Ipxe tried to dhcprequest the address but by now the switches have moved the packets into the correct vlan and the dhcp server in that network refuse to allow the ip address request and dhcpnak the request. Ipxe doesn’t process the dhcpnak request eventually timing out. For it’s second try it shuts off the nic (observed behavior, unexplained. I can see the link light on the port go out). This loss of link triggers our switches to throw away vlan info for the port leading to a second identical failure mode for the second loop.
There’s a timing (race condition) present as slower machines or slower network ports (some 100 some 1000) may work, I believe that’s because in some cases the vlan security info gets processed faster.
I found a proposed patch:
https://lists.ipxe.org/pipermail/ipxe-devel/2017-October/005873.html
which would add the ability to ipxe to process dhcpnak packets by starting over with a new cycle of dhcpdiscover, etc. Using this guide:
https://forums.fogproject.org/topic/12121/compiling-ipxe-boot-kernels
I patched and recompiled ipxe. This seems to have worked.
I added the marked lines to: ./ipxe/src/net/udp/dhcp.c
— file dhcp.c changes —
/* (next line number was/is 557) /
/ Filter out unacceptable responses */
if ( peer->sin_port != htons ( BOOTPS_PORT ) )
return;
-> /* ADDED 1-2021 per online suggested commit /
-> / Handle DHCPNAK /
-> if ( msgtype / BOOTP / && ( msgtype == DHCPNAK ) ) {
-> / Go back to discover */
-> dhcp_set_state ( dhcp, &dhcp_state_discover );
-> return;
-> }
if ( msgtype /* BOOTP */ && ( msgtype != DHCPACK ) )
return;
if ( server_id.s_addr != dhcp->server.s_addr )
return;
if ( ip.s_addr != dhcp->offer.s_addr )
return;
— end changes —
I also thought about getting ipxe to send some kind of packet out 1-2 seconds before the dhcp discover process started to give the switches a second to recognize the device properly but couldn’t figure out an easy way to do that. There’s a ping command but as far as I can tell it doesn’t work before an IP is assigned to the interface, which the ifopen/dhcp command handles.
Newer vlan capable switches will apparently sometimes just drop the first packet but my particular location does not do this apparently.
I realize this is primarily an ipxe issue and I will comment appropriately in those forums as well but I wanted to document the issue here in case others are also seeing odd behavior in a vlan switch environment.