Trouble with DHCP after loading undionly.kpxe (xcp-ng)

magikmw

Hi,

Some details first:

FOG v.1.5.7 (current github master), did not setup DHCP during installation, ip 10.0.20.32/24
DHCP served by router at 10.0.20.1/24, options 66/67 set to 10.0.20.32 / undionly.kpxe
both FOG and the client I’m testing with run as VMs on a xcp-ng hypervisor in the same VLAN
router and hypervisor are connected via a single switch with no STP or similar enabled
both VMs have no networking problems once booted, and client receives IP via DHCP with no problems

The problem:
The PXE seems to pickup and IP from DHCP and loads undionly.kpxe correctly, however it ends up failing in the next step:
I’ve seen several threads with similar errors, but nothing I found worked (disable STP/use fastport doesn’t seem to apply).

Troubleshooting I did already:

I’ve dumped DHCP/TFTP traffic passing through the hypervisor, available here: ~~removed link~~
Seems like the ipxe client doesn’t accept or receive the DHCP offer.
Dropping to the ipxe shell and trying to ping anything results in a connection failure.
From the shell I can set a static IP address and that allows me to ping, but not to boot (autoboot and dhcp net0 commands just fail)
I’ve tried all .kpxe, .kkpxe and .pxe files aviable in /tftpboot with identical results.
The legacy bootfile (pxelinux.0.old) does boot into a FOG menu and even allows booting from the hard drive, however register option results in a kernel panic (which I think is to be expected).

I’m at a total loss. The only thing I haven’t tried yet is trying to boot a physical machine to see if it’s not a problem with the hypervisor networking. Even then, not sure what would cause this. There are no other network problems in the environment.

Hope anybody can help me. Let me know if you need any more information.

magikmw

I’ve figured it out! Turns out, I’ve had a NIC bond misconfigured. Switch had settings for a static LAG, the host just used an active-active backup. I’ve reconfigured both to use proper LACP, effectively bonding the two ports on both devices, and now the broadcasts work, and the FOG menu boots!

Thanks both @george1421 and @Sebastian-Roth for your help, I don’t think I’d find energy to go digging for this stuff if you didn’t push me in the right direction.
Turns out it wasn’t really about FOG or PXE, but it did help find an issue with my network I had no idea existed. Comes to show how interconnected technology is.

Here’s the article that helped me configure the LACP, if anyone faces a similar problem: https://support.citrix.com/article/CTX135690 (xcp-ng is basically opensource Citrix Hypervisor, formerly XenServer).

george1421

@magikmw Try to use ipxe.kpxe instead of undionly.kpxe for the iPXE boot loader. Its possible that the undi component of your hypervisor isn’t compatible with iPXE. Possibly one of the drivers in ipxe.kpxe will work better than the undi driver.

Since I don’t know that hypervisor, what NIC emulation is your VM configured for?

magikmw

@george1421 ipxe.kpxe produced the exact same result (I’ve tried it before, too).

I have two options for NIC emulation: Realtek RTL819 (default) and Intel e1000.

george1421

@magikmw So I would pick the e1000 emulation if I had a choice.

Now is that network connection bridged or natt’d? I see the target system is getting 10.0.20.206, is that the correct IP address for your network?

magikmw

@george1421 Alright. I’ve tried with e1000, no change.

The VM connection is bridged via virtual switch to a physical switch between the host and router.
Both virtual switch and physical switch trunk a VLAN to the router/DHCP (it’s transparent to the VM).
The host’s connection is a bonded 2 port NIC.
10.0.20.206 is correct (DHCP range is 10.0.20.200-240).

Sebastian Roth

@magikmw said in Trouble with DHCP after loading undionly.kpxe (xcp-ng):

I’ve dumped DHCP/TFTP traffic passing through the hypervisor,

That’s great and looks really interesting. First I noticed that packets seem to be partly duplicated in the PCAP. I see 18 DHCP Discover packets from your client within one second (0.85 s really) before the DHCP server sends an Offer. Very slow response in a network. Similar with the subsequent DHCP Request and DHCP ACK - 9 Requests (this time in a very short time) before ACK is sent. Looks really strange to me.

To make a long story short, I just noticed that on the first round (BIOS PXE boot) there are two Offer and two ACK packets, one of each VLAN tagged and one of each without. On the secount round (iPXE) I only see DHCP Offers with VLAN tag (ID 20 by the way). So to me it seems like the DHCP server behaves different depending on the DHCP Discover packet. Even more strange than the stuff before.

VLAN is the key I suppose! Can’t you terminate the VLAN on the switch or hypervisor?

magikmw

@Sebastian-Roth
I’m sorry, I think that’s a dead end. I’ve made a mistake of dumping packets from all interfaces on the host without thinking about it. Here’s just the dump from the bridge, showing packets as they appear outside of the VM’s interface: ~~removed link~~

The untagged packets were just the same thing, just from VM’s perspective - as I mentioned the VLAN is transparent to the VM, and the VM behaves like it’s just plugged into any dumb switch for what it cares.

I’m looking into dumping the packets from just the VM’s virtual interface, but it’s a bit tricky as the interface is only created after the VM is set to start, so I’ll have to juggle it a bit.

Anyway the DISCOVER and OFFER packets do appear different before and after getting the *.kpxe file from FOG, with ‘after’ being bigger a few bytes. Here’s a diff on two of them:

< Frame 1: 441 bytes on wire (3528 bits), 441 bytes captured (3528 bits)
---
> Frame 2: 458 bytes on wire (3664 bits), 458 bytes captured (3664 bits)
14,17c14,17
<     Transaction ID: 0x512c7b59
<     Seconds elapsed: 8
<     Bootp flags: 0x0000 (Unicast)
<         0... .... .... .... = Broadcast flag: Unicast
---
>     Transaction ID: 0xaa16170a
>     Seconds elapsed: 4
>     Bootp flags: 0x8000, Broadcast flag (Broadcast)
>         1... .... .... .... = Broadcast flag: Broadcast
48c48
<         Length: 21
---
>         Length: 23
55a56
>         Parameter Request List Item: (26) Interface MTU
59a61
>         Parameter Request List Item: (119) Domain Search
71,72c73,74
<         Length: 45
<         Value: b105018086100e2201011901012101011801011101011301…
---
>         Length: 60
>         Value: b1050800000000eb03010000170101220101160101130101…
79c81
<         Client Identifier (UUID): f278572e-dd57-1b1a-2e8b-c3d9b21795b9
---
>         Client Identifier (UUID): 2e5778f2-57dd-1a1b-2e8b-c3d9b21795b9
82c84
< 
---
>

Apparently the new one is a broadcast instead of unicast? I’m not sure what the significance is.

Same for offer:

< Frame 3: 346 bytes on wire (2768 bits), 346 bytes captured (2768 bits)
< Ethernet II, Src: Ubiquiti_bd:c7:6f (78:8a:20:bd:c7:6f), Dst: 6e:4e:8f:1e:0b:fb (6e:4e:8f:1e:0b:fb)
---
> Frame 7: 359 bytes on wire (2872 bits), 359 bytes captured (2872 bits)
> Ethernet II, Src: Ubiquiti_bd:c7:6f (78:8a:20:bd:c7:6f), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
7c7
< Internet Protocol Version 4, Src: 10.0.20.1, Dst: 10.0.20.202
---
> Internet Protocol Version 4, Src: 10.0.20.1, Dst: 255.255.255.255
14c14
<     Transaction ID: 0x512c7b59
---
>     Transaction ID: 0xaa16170a
16c16
<     Bootp flags: 0x0000 (Unicast)
---
>     Bootp flags: 0x8000, Broadcast flag (Broadcast)
33a34
>     Option: (119) Domain Search
35d35
<

magikmw

I’ve figured it out! Turns out, I’ve had a NIC bond misconfigured. Switch had settings for a static LAG, the host just used an active-active backup. I’ve reconfigured both to use proper LACP, effectively bonding the two ports on both devices, and now the broadcasts work, and the FOG menu boots!

Thanks both @george1421 and @Sebastian-Roth for your help, I don’t think I’d find energy to go digging for this stuff if you didn’t push me in the right direction.
Turns out it wasn’t really about FOG or PXE, but it did help find an issue with my network I had no idea existed. Comes to show how interconnected technology is.

Here’s the article that helped me configure the LACP, if anyone faces a similar problem: https://support.citrix.com/article/CTX135690 (xcp-ng is basically opensource Citrix Hypervisor, formerly XenServer).

Sebastian Roth

@magikmw Glad to see you found and fixed this. It’s good to know you really know your way around in that network stuff. It’s very hard for us to diagnose and help out with that kind of things.

Trouble with DHCP after loading undionly.kpxe (xcp-ng)

133

12.2k

17.4k

155.5k