• Recent
    • Unsolved
    • Tags
    • Popular
    • Users
    • Groups
    • Search
    • Register
    • Login

    Vlan, Ipxe, DHCPNAK

    Scheduled Pinned Locked Moved
    General
    2
    2
    413
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      Matthew73
      last edited by Matthew73

      A new (faster) model of machine made an existing timing issue worse in my environment such that the fog ipxe.efi kernel would no longer boot.

      The symptom: Machine tries to network boot, succeeds in loading ipxe from the server. Ipxe tries to configure the network and shows progression dots (“…”) but fails to get an address, resets the nic port and tries again, also failing. On older machines the second attempt would usually succeed.

      The triggers (as far as I can tell): In my network the very first packet sent by ipxe (when the “dhcp” command is issued to auto configure the network) is a dhcpdiscover packet. That packet gets assigned our “guest” vlan as our switch hasn’t yet learned which vlan the packet should be in. An answer is sent from the dhcp server in the guest network and seen by the client. Ipxe tried to dhcprequest the address but by now the switches have moved the packets into the correct vlan and the dhcp server in that network refuse to allow the ip address request and dhcpnak the request. Ipxe doesn’t process the dhcpnak request eventually timing out. For it’s second try it shuts off the nic (observed behavior, unexplained. I can see the link light on the port go out). This loss of link triggers our switches to throw away vlan info for the port leading to a second identical failure mode for the second loop.

      There’s a timing (race condition) present as slower machines or slower network ports (some 100 some 1000) may work, I believe that’s because in some cases the vlan security info gets processed faster.

      I found a proposed patch:
      https://lists.ipxe.org/pipermail/ipxe-devel/2017-October/005873.html
      which would add the ability to ipxe to process dhcpnak packets by starting over with a new cycle of dhcpdiscover, etc. Using this guide:
      https://forums.fogproject.org/topic/12121/compiling-ipxe-boot-kernels
      I patched and recompiled ipxe. This seems to have worked.

      I added the marked lines to: ./ipxe/src/net/udp/dhcp.c

      — file dhcp.c changes —
      /* (next line number was/is 557) /
      /
      Filter out unacceptable responses */
      if ( peer->sin_port != htons ( BOOTPS_PORT ) )
      return;

      -> /* ADDED 1-2021 per online suggested commit /
      -> /
      Handle DHCPNAK /
      -> if ( msgtype /
      BOOTP / && ( msgtype == DHCPNAK ) ) {
      -> /
      Go back to discover */
      -> dhcp_set_state ( dhcp, &dhcp_state_discover );
      -> return;
      -> }

          if ( msgtype /* BOOTP */ && ( msgtype != DHCPACK ) )
                  return;
          if ( server_id.s_addr != dhcp->server.s_addr )
                  return;
          if ( ip.s_addr != dhcp->offer.s_addr )
                  return;
      

      — end changes —

      I also thought about getting ipxe to send some kind of packet out 1-2 seconds before the dhcp discover process started to give the switches a second to recognize the device properly but couldn’t figure out an easy way to do that. There’s a ping command but as far as I can tell it doesn’t work before an IP is assigned to the interface, which the ifopen/dhcp command handles.

      Newer vlan capable switches will apparently sometimes just drop the first packet but my particular location does not do this apparently.

      I realize this is primarily an ipxe issue and I will comment appropriately in those forums as well but I wanted to document the issue here in case others are also seeing odd behavior in a vlan switch environment.

      george1421G 1 Reply Last reply Reply Quote 0
      • george1421G
        george1421 Moderator @Matthew73
        last edited by

        @matthew73 This is a unique condition. I can understand what is going on because we use NAC and VLAN switching on my campus. I can say that I have not seen this issue (anywhere) on my campus.

        I think I understand what needs to happen. Basically iPXE needs to say something and then wait XX seconds for your NAC system to identify the hardware and to switch it to the right vlan. The network link light winking happens 2 times during a normal pxe booting. The first time is when the PXE turns over control of the network adapter to iPXE and then when iPXE turns over control of the network adapter to FOS Linux. We see a similar issue when the network switches are using standard spanning tree and not one of the fast protocols (RSTP,MSTP, port-fast).

        The developers have created a specific group of iPXE boot loaders that have an embedded 10 second delay before iPXE tries to request an IP address. This gives STP and powersaver functions on the switch a chance to react before iPXE starts to talk. These files are in the 10secdelay folder. So to use these update dhcp option from ipxe.efi to 10secdelay/ipxe.efi This will call in the 10 second delay boot loader. See if that makes things better or not.

        Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

        1 Reply Last reply Reply Quote 0
        • 1 / 1
        • First post
          Last post

        210

        Online

        12.0k

        Users

        17.3k

        Topics

        155.2k

        Posts
        Copyright © 2012-2024 FOG Project