Matthew73

Matthew73

A new (faster) model of machine made an existing timing issue worse in my environment such that the fog ipxe.efi kernel would no longer boot.

The symptom: Machine tries to network boot, succeeds in loading ipxe from the server. Ipxe tries to configure the network and shows progression dots (“…”) but fails to get an address, resets the nic port and tries again, also failing. On older machines the second attempt would usually succeed.

The triggers (as far as I can tell): In my network the very first packet sent by ipxe (when the “dhcp” command is issued to auto configure the network) is a dhcpdiscover packet. That packet gets assigned our “guest” vlan as our switch hasn’t yet learned which vlan the packet should be in. An answer is sent from the dhcp server in the guest network and seen by the client. Ipxe tried to dhcprequest the address but by now the switches have moved the packets into the correct vlan and the dhcp server in that network refuse to allow the ip address request and dhcpnak the request. Ipxe doesn’t process the dhcpnak request eventually timing out. For it’s second try it shuts off the nic (observed behavior, unexplained. I can see the link light on the port go out). This loss of link triggers our switches to throw away vlan info for the port leading to a second identical failure mode for the second loop.

There’s a timing (race condition) present as slower machines or slower network ports (some 100 some 1000) may work, I believe that’s because in some cases the vlan security info gets processed faster.

I found a proposed patch:
https://lists.ipxe.org/pipermail/ipxe-devel/2017-October/005873.html
which would add the ability to ipxe to process dhcpnak packets by starting over with a new cycle of dhcpdiscover, etc. Using this guide:
https://forums.fogproject.org/topic/12121/compiling-ipxe-boot-kernels
I patched and recompiled ipxe. This seems to have worked.

I added the marked lines to: ./ipxe/src/net/udp/dhcp.c

— file dhcp.c changes —
/* (next line number was/is 557) /
/ Filter out unacceptable responses */
if ( peer->sin_port != htons ( BOOTPS_PORT ) )
return;

-> /* ADDED 1-2021 per online suggested commit /
-> / Handle DHCPNAK /
-> if ( msgtype / BOOTP / && ( msgtype == DHCPNAK ) ) {
-> / Go back to discover */
-> dhcp_set_state ( dhcp, &dhcp_state_discover );
-> return;
-> }

    if ( msgtype /* BOOTP */ && ( msgtype != DHCPACK ) )
            return;
    if ( server_id.s_addr != dhcp->server.s_addr )
            return;
    if ( ip.s_addr != dhcp->offer.s_addr )
            return;

— end changes —

I also thought about getting ipxe to send some kind of packet out 1-2 seconds before the dhcp discover process started to give the switches a second to recognize the device properly but couldn’t figure out an easy way to do that. There’s a ping command but as far as I can tell it doesn’t work before an IP is assigned to the interface, which the ifopen/dhcp command handles.

Newer vlan capable switches will apparently sometimes just drop the first packet but my particular location does not do this apparently.

I realize this is primarily an ipxe issue and I will comment appropriately in those forums as well but I wanted to document the issue here in case others are also seeing odd behavior in a vlan switch environment.

Matthew73

I have a working fog in my environment via my mods. The biggest part of the problem is clearly our switched network behaving badly so I don’t think there’s any in fog that should be changed cover that brokenness. I’ll be happier of course if our network folks can figure out why it takes so long.

I do wish udhcpc had an option to “try one more time” if it failed to get a lease but it appears to be that it’s either one pass through or repeated passes until you succeed. I don’t really feel like it’s a good idea to default fog to “keep trying dhcp forever” mode. And even if fog could be set to dhcp for two or three attempted leases it’s still a bad solution - it would make things “work” in my case but at the cost of several extra minutes on the boot time and really the problem is my network is broken. That piece is just a bit out of my ability to fix while hacking fog to workaround the issue is possible - apparently. Learned a bunch figuring it out.

If the code is useful then by all means use it, that’s why I posted it.

Matthew73

One other comment to anyone else who has similar weird issues they have to work around in their local network - don’t forget the FOG install replaces both /tftp/boot and /var/www/fog/service/ipxe during updates. Keep copies of locally modded stuff somewhere else. I ran two “installs” in a row a few day back and wiped out some things by accident. /tftpboot gets moved to /tftpboot.prev but I don’t think anything gets keep from ipxe.

Matthew73

Workman: It works when the client is plugged into a desktop switch which is then plugged into the building managed switch. But not when plugged directly into the building switch port. And yeah, it turns out it is driven by our particular network and possibly broken/quirky network.

I’m pretty convinced (and have a work around) for my problem. In essence our switches are taking too long to figure out which lan packets from a “new” machine belong in. When the network link is taken down (always at the beginning of the fog kernel boot) they forget mapping between MAC addresses and vlans. It seems to take 4-6 seconds on our system to relearn. This is our brokenness. This delay also means that 1 or 2 packets get sent that are initially assigned the wrong vlan (“unregistered”). For FOG that means the first DHCP Discover packet udhcpc sends usually gets answered by the wrong dhcp server. But by the time udhcp is trying to accept the IP offered everything is up and running and the correct dhcp server denies the address as invalid.

Udhcp however doesn’t believe NACK packets that come from different servers than the server that offered the initial address. This is probably “correct” behavior. So my clients sit through the 3 20 second (-T 20) delays of requesting a address and then udhcp fails because no lease was offered.

There seem to be a couple of ways to fix the problem. One would be to let udhcpc try retry the Discover phase if it fails (the -A option allows this) but I’d still see a 3*20 second delay before it would work. In addition that means boxes that don’t get resonable DHCP answers would continue DHCP’s forever. As far as I can tell there’s no way to tell udhcpc that it’s ok to re-try the discover phase 2 or 3 times but not forever.

So instead I just forced a couple of manual udhcp runs and threw away the results. Specifically I added:

# Wait for switches to process first packet and get vlan info
echo Preload the network with some packets to trigger switch vlan assignment
for packet in {1..3}; do
        udhcpc -t 1 -T 1 -n
        sleep 2
done

to the S40network script just after the section that brings up the link on all interfaces. This sends a DHCP Discover, tries to get a lease, and quits after 1 try whether or not it works. On my network the first try always fails and the second try usually works. Then the script continues and calls udhcp again per the normal options later which should work since the network switches have had more time to stabilize. Clearly our switches are taking too long.

Oddly I also tried sending an arp packet and sleeping 8 seconds which I though was a cleaner solution but that didn’t always work.

Anyhow I have a work around. Wiki directions on how to uncompress init.xz, mount it, and edit it were great.

I also switched out the “sleep 10” for link initialize with this code:

# Provide time for interfaces to detect their state
for iface in $ifaces; do
        # Check if each interface is up and if not wait up to 10 seconds
        echo -n Waiting $iface linkstate:
        for delay in `seq 10`; do
                linkstate=`/bin/cat /sys/class/net/$iface/carrier`
                if [ "x$linkstate" == "x1" ]; then
                        echo ' ' $iface up
                        break
                fi
                echo -n .
                sleep 1
        done
done

which seems just as functional but faster if the link comes up sooner.

It’s still true the the S40network script as written in FOG gets called twice (once for “start” mode which then calls itself with “stop” mode) and so the code to setup /etc/network/interfaces gets run twice - which seems necessary. Is there any reason not to move the entire block of code to inside the case “start” statement?

Mod edited to use code boxes.

Matthew73

Yeah, I mistyped that. I’m on 3731. Tests over the weekend on more machines show the problem isn’t fixed by my change. Some of the machine comes up, some don’t. It’s not consistent which works or doesn’t work. Which in a way is reassuring and it doesn’t make any sense that it affects anything anyway.

I’ll poke it it more later today.

Matthew73

I recently upgraded from 1.2.0 release to trunk 1371 (well 1370 and now 1371). Post upgrade I have a very odd problem with networking. My environment has “enterprise” switches and we run VLANs, authentication of packets by MAC address, and stuff like that. Symptoms are that when the bzImage/init.xz stuff loads dhcp fails. Machines are Dell’s, Optiplex 760, 790, 9020; all behave the same. The Intel netboot code gets an address and load the IPXE code. IPXE gets and address and loads the FOG boot menu. Easiest way for me to test has been to select the “Sys Info” option. Then the bzImage/init.xz code loads.

If my machines are directly connected to a building switch (vlan managed, secured) then the kernel fails to get an address. It tries 3 times to DHCP Request a 172.16.x.x address, complains it can’t and gives up.

If the machine is connected to a “dumb” switch which is connected to the building switch it works fine. Throwing a netgear switch or hub between the client and the building (and thus the server) “fixes” the problem.

This finally got me thinking about timing issues and so I went poking around the init.xz image and looked at the scripts. It seems like I can bypass the problem by changing the /etc/init.d/S40network where the “sleep 10” line is to “sleep 3”. I can’t think of any framework where that makes sense to me. As a work around it seems to work with very limited testing so far.

In the process here are some observations: If I use “ip link set eth0 down” and try to run “S40network start” it fails. Every time “S40network start” is called I get an error from “ifdown” that the interfaces aren’t configured. As far as I can tell “S40network start” uses “ip” to enable the link and then call itself again with the “stop” argument, which repeats the same set of code and then tries to call “ifdown -a”. Since at this point ifup has never been called the file /run/ifstate doesn’t have any mappings (“eth0=eth0” and “lo=lo”) which is why is fails. Ifup seems to generate these, ifdown removes them.

If I run ifup first manually, then run ifdown I don’t get any error messages - but “ifconfig -a” seems to indicate the interfaces are left up anyway.

I’m not sure why the start part of the script is calling itself to stop the interface first but using ifdown to do so before ifup has ever been called is definitely broken. That’s doesn’t explain why I have problems with “ifup -a” failing in my environment. Nor do I understand why the sleep shortening should help to problem. (I initially increased the timeout thinking maybe my switches were taking too long to setup…)

It seems like with the “sleep 10” in place the script sets up the interfaces file, calls itself to turn off the interface which sets it up again, and the tries to call ifup -a which produces a "RTNETLINK answers: File exists” error. After which udhcp starts but just keeps asking for a 172.16.x.x number. (This happens to be the range we use for unknown devices but I think it’s a fall back number for udhcp as well. I can’t figure out how to snoop the traffic since putting a hub inline “fixes” the problem.)

At a loss. Maybe this will mean something to someone else.

Matthew73

@Matthew73

Latest posts made by Matthew73