SOLVED Failed dhcp, busybox network startup

  • I recently upgraded from 1.2.0 release to trunk 1371 (well 1370 and now 1371). Post upgrade I have a very odd problem with networking. My environment has “enterprise” switches and we run VLANs, authentication of packets by MAC address, and stuff like that. Symptoms are that when the bzImage/init.xz stuff loads dhcp fails. Machines are Dell’s, Optiplex 760, 790, 9020; all behave the same. The Intel netboot code gets an address and load the IPXE code. IPXE gets and address and loads the FOG boot menu. Easiest way for me to test has been to select the “Sys Info” option. Then the bzImage/init.xz code loads.

    If my machines are directly connected to a building switch (vlan managed, secured) then the kernel fails to get an address. It tries 3 times to DHCP Request a 172.16.x.x address, complains it can’t and gives up.

    If the machine is connected to a “dumb” switch which is connected to the building switch it works fine. Throwing a netgear switch or hub between the client and the building (and thus the server) “fixes” the problem.

    This finally got me thinking about timing issues and so I went poking around the init.xz image and looked at the scripts. It seems like I can bypass the problem by changing the /etc/init.d/S40network where the “sleep 10” line is to “sleep 3”. I can’t think of any framework where that makes sense to me. As a work around it seems to work with very limited testing so far.

    In the process here are some observations: If I use “ip link set eth0 down” and try to run “S40network start” it fails. Every time “S40network start” is called I get an error from “ifdown” that the interfaces aren’t configured. As far as I can tell “S40network start” uses “ip” to enable the link and then call itself again with the “stop” argument, which repeats the same set of code and then tries to call “ifdown -a”. Since at this point ifup has never been called the file /run/ifstate doesn’t have any mappings (“eth0=eth0” and “lo=lo”) which is why is fails. Ifup seems to generate these, ifdown removes them.

    If I run ifup first manually, then run ifdown I don’t get any error messages - but “ifconfig -a” seems to indicate the interfaces are left up anyway.

    I’m not sure why the start part of the script is calling itself to stop the interface first but using ifdown to do so before ifup has ever been called is definitely broken. That’s doesn’t explain why I have problems with “ifup -a” failing in my environment. Nor do I understand why the sleep shortening should help to problem. (I initially increased the timeout thinking maybe my switches were taking too long to setup…)

    It seems like with the “sleep 10” in place the script sets up the interfaces file, calls itself to turn off the interface which sets it up again, and the tries to call ifup -a which produces a "RTNETLINK answers: File exists” error. After which udhcp starts but just keeps asking for a 172.16.x.x number. (This happens to be the range we use for unknown devices but I think it’s a fall back number for udhcp as well. I can’t figure out how to snoop the traffic since putting a hub inline “fixes” the problem.)

    At a loss. Maybe this will mean something to someone else.

  • I have a working fog in my environment via my mods. The biggest part of the problem is clearly our switched network behaving badly so I don’t think there’s any in fog that should be changed cover that brokenness. I’ll be happier of course if our network folks can figure out why it takes so long.

    I do wish udhcpc had an option to “try one more time” if it failed to get a lease but it appears to be that it’s either one pass through or repeated passes until you succeed. I don’t really feel like it’s a good idea to default fog to “keep trying dhcp forever” mode. And even if fog could be set to dhcp for two or three attempted leases it’s still a bad solution - it would make things “work” in my case but at the cost of several extra minutes on the boot time and really the problem is my network is broken. That piece is just a bit out of my ability to fix while hacking fog to workaround the issue is possible - apparently. Learned a bunch figuring it out. 🙂

    If the code is useful then by all means use it, that’s why I posted it.

  • @Matthew73 I’ve added your suggestions. While I’m not manually starting the scripts as you are, I am using your methods now to try to grab the link up states and the “speedier” detection of the link up.

    Hopefully you don’t mind and I appreciate (as well as I guess the rest of this community) the findings and reporting of this.

  • @Matthew73 So you know, Tom says he has added your code to the code-base. So the changes you posted will stick in future fog trunk versions.

    I’m having a little bit of a tough time deciding if you’re thread is solved or not… Have you resolved your issue?

  • One other comment to anyone else who has similar weird issues they have to work around in their local network - don’t forget the FOG install replaces both /tftp/boot and /var/www/fog/service/ipxe during updates. Keep copies of locally modded stuff somewhere else. I ran two “installs” in a row a few day back and wiped out some things by accident. /tftpboot gets moved to /tftpboot.prev but I don’t think anything gets keep from ipxe.

  • Workman: It works when the client is plugged into a desktop switch which is then plugged into the building managed switch. But not when plugged directly into the building switch port. And yeah, it turns out it is driven by our particular network and possibly broken/quirky network.

    I’m pretty convinced (and have a work around) for my problem. In essence our switches are taking too long to figure out which lan packets from a “new” machine belong in. When the network link is taken down (always at the beginning of the fog kernel boot) they forget mapping between MAC addresses and vlans. It seems to take 4-6 seconds on our system to relearn. This is our brokenness. This delay also means that 1 or 2 packets get sent that are initially assigned the wrong vlan (“unregistered”). For FOG that means the first DHCP Discover packet udhcpc sends usually gets answered by the wrong dhcp server. But by the time udhcp is trying to accept the IP offered everything is up and running and the correct dhcp server denies the address as invalid.

    Udhcp however doesn’t believe NACK packets that come from different servers than the server that offered the initial address. This is probably “correct” behavior. So my clients sit through the 3 20 second (-T 20) delays of requesting a address and then udhcp fails because no lease was offered.

    There seem to be a couple of ways to fix the problem. One would be to let udhcpc try retry the Discover phase if it fails (the -A option allows this) but I’d still see a 3*20 second delay before it would work. In addition that means boxes that don’t get resonable DHCP answers would continue DHCP’s forever. As far as I can tell there’s no way to tell udhcpc that it’s ok to re-try the discover phase 2 or 3 times but not forever.

    So instead I just forced a couple of manual udhcp runs and threw away the results. Specifically I added:

    # Wait for switches to process first packet and get vlan info
    echo Preload the network with some packets to trigger switch vlan assignment
    for packet in {1..3}; do
            udhcpc -t 1 -T 1 -n
            sleep 2

    to the S40network script just after the section that brings up the link on all interfaces. This sends a DHCP Discover, tries to get a lease, and quits after 1 try whether or not it works. On my network the first try always fails and the second try usually works. Then the script continues and calls udhcp again per the normal options later which should work since the network switches have had more time to stabilize. Clearly our switches are taking too long.

    Oddly I also tried sending an arp packet and sleeping 8 seconds which I though was a cleaner solution but that didn’t always work.

    Anyhow I have a work around. Wiki directions on how to uncompress init.xz, mount it, and edit it were great.

    I also switched out the “sleep 10” for link initialize with this code:

    # Provide time for interfaces to detect their state
    for iface in $ifaces; do
            # Check if each interface is up and if not wait up to 10 seconds
            echo -n Waiting $iface linkstate:
            for delay in `seq 10`; do
                    linkstate=`/bin/cat /sys/class/net/$iface/carrier`
                    if [ "x$linkstate" == "x1" ]; then
                            echo ' ' $iface up
                    echo -n .
                    sleep 1

    which seems just as functional but faster if the link comes up sooner.

    It’s still true the the S40network script as written in FOG gets called twice (once for “start” mode which then calls itself with “stop” mode) and so the code to setup /etc/network/interfaces gets run twice - which seems necessary. Is there any reason not to move the entire block of code to inside the case “start” statement?

    Mod edited to use code boxes.

  • @Matthew73 Well not to sound dumb or anything but, I’m just going to point out what you said above…

    You say… clients work just fine when they are plugged into an un-managed switch, but when they get plugged directly into a managed switch they are inconsistent.

    So, it’s clear that the problem has something to do with your network configuration. Additionally, my network is configured with vlans and is secured as well, and I have zero issues. And it’s also not my job to take care of the network so I couldn’t get you a sample configuration file.

    Do you have a spare managed switch that you can play around with? I suggest you do this first.

  • Yeah, I mistyped that. I’m on 3731. Tests over the weekend on more machines show the problem isn’t fixed by my change. Some of the machine comes up, some don’t. It’s not consistent which works or doesn’t work. Which in a way is reassuring and it doesn’t make any sense that it affects anything anyway.

    I’ll poke it it more later today.

  • As a matter of fact, that revision is even before 1.2.0. SVN version of 1.2.0’s release was 2094.

  • @Matthew73 Are you sure you’ve upgraded to 1371 ? If so, That’s a very old revision and it’s highly likely that any issues you’re seeing will be resolved by moving to a much later FOG Trunk revision. I recommend 3709, it’s working just fine for me.