PXE-E11: ARP timeout

Wayne Workman

I’m on r6371 with CentOS 7 minimal, fully updated in this particular case.

Error messages so search engines can find this:
PXE-E11: ARP timeout
PXE-E38: TFTP cannot open connection
PXE-M0F: Exiting Broadcom PXE ROM.

In a particular FOG setup at my environment, we have a master FOG Server with the web interface, it itself is a storage node, and we have a remote storage node on another network configured on our master fog server.

On the main fog server’s network, DHCP is being handled by Windows Server. On the remote storage node on the other network, that storage node is configured to run ISC-DHCP.

Both DHCP setups are giving out main fog server’s IP as the next-server / option 066.

Things were good on both networks, everything was working.

I virtualized the main fog server into VMWare - I reused the same IP, but of course the old FOG server had it’s IP changed at this time and was also powered down.

Then hosts on the other remote network could not network boot. Here’s the error:

On that remote network, of course I’ve tried pinging the gateway, the remote fog server, and every hop in between - all succeed. I’ve tried arpinging the remote server, the gateway, and every hop in between - those all succeed. I did all those things in reverse, from the Main fog server to the remote fog server and every hop in between and all succeed.

Then I looked at packet captures from both the remote fog server and the main fog server from when a client was trying to boot. There is a proper DHCP assignment, but I never see any traffic from the network-booting host on the main server.

There are also some ARP responses from what looks like a switch that says one of the switch management IPs is in duplicate use by two MAC addresses. I asked our network guys about that and they said the other MAC I gave them doesn’t exist on the network.

This is a capture from the remote FOG server on the remote network during a client network booting, it shows the ARP issues.
0_1456247045868_ta-issue.pcap

Now, after all that testing, I decided to set a DHCP reservation for my desktop (in yet another remote network) to network boot to this particular Main FOG server. It works fine.

I tried to set DHCP on the fog storage node to give out it’s own address as the next-server address so hosts there can grab their boot files from it’s tftp service locally, but the chainloading to the remote server failed:

What I eventually did to get around the ARP issue, and to get around the storage node’s boot.php file being somehow broken is just a basic redirect inside the boot.php file. I changed it to this on the storage node, and it seems to be working:

<?php
$forwardingIP="10.51.1.53";
$variables = "";
if ($_REQUEST['mac0'] && !$_REQUEST['mac1'] && !$_REQUEST['mac2']) {
        $variables = "?mac0=" . $_REQUEST['mac0'];
}
else if ($_REQUEST['mac0'] && $_REQUEST['mac1'] && !$_REQUEST['mac2']) {
        $variables = "?mac0=" . $_REQUEST['mac0'] . "&mac1=" . $_REQUEST['mac1'];
}
else if ($_REQUEST['mac0'] && !$_REQUEST['mac1'] && $_REQUEST['mac2']) {
        $variables = "?mac0=" . $_REQUEST['mac0'] . "&mac2=" . $_REQUEST['mac2'];
}
else if ($_REQUEST['mac0'] && $_REQUEST['mac1'] && $_REQUEST['mac2']) {
        $variables = "?mac0=" . $_REQUEST['mac0'] . "&mac1=" . $_REQUEST['mac1'] . "&mac2=" . $_REQUEST['mac2'];
}
header('Location: http://' . $forwardingIP . '/fog/service/ipxe/boot.php' . $variables);
?>

What that does is basically grabs the info sent to the storage node, and then just builds a link based on that info and then redirects to the main server’s boot.php file.

While this works - I’d really appreciate any suggestions with the problem - I know it’s some sort of network or VMWare issue but I don’t have access to the switches configurations and I don’t have direct access to a computer with the vmware client installed.

And I’d also like to maybe make this boot.php code I’ve written the standard for storage nodes since it is able to get around the ARP network issues.

george1421

Lets take this one step at a time.

You virtualized your physical server. Did you p2v it or just spin up a new vm instance?

If you were to power off your vm, power up your old fog server at the correct address would everything be golden?

I assume your fog server is at HQ and your storage node is at a remote location. Does pxe booting a target computer at the HQ site work and the remote site fail?

And just to restate the obvious you disabled the firewall on the new FOG server.

The error (PXE-E11: ARP timeout) translated comes to the pxe client saying "Hey who has ip address <ip address of what is in option 66> send me your mac address, yet no one replied.

Wayne Workman

@george1421 said:

You virtualized your physical server. Did you p2v it or just spin up a new vm instance?

I made a brand-new VM from scratch and ported over the images, db data, and CA and certs manually.

If you were to power off your vm, power up your old fog server at the correct address would everything be golden?

Don’t know, maybe, maybe not. We’re only having issues at the remote site. The site where the newly built FOG server is works fine in every regard.

I assume your fog server is at HQ and your storage node is at a remote location. Does pxe booting a target computer at the HQ site work and the remote site fail?

Yes.

And just to restate the obvious you disabled the firewall on the new FOG server.

I configured firewalld on the new VM, it works fine at that site. For troubleshooting purposes I have turned the firewall off on both the main server and the storage node.

The error (PXE-E11: ARP timeout) translated comes to the pxe client saying "Hey who has ip address <ip address of what is in option 66> send me your mac address, yet no one replied.

Right… ideas?

Mind you - I can network boot from yet another remote location to the main fog server just fine, and before the rebuilding, everything everywhere worked just fine.

I’m 100% positive this is a network issue, but I don’t know what it could be.

Wayne Workman

Tom asked for what http://10.65.2.20/fog/service/ipxe/boot.php shows. Because of the redirect I coded, it sends me to http://10.51.1.53/fog/service/ipxe/boot.php here is what that says (which is good config obviously)

#!ipxe
set fog-ip 10.51.1.53
set fog-webroot fog
set boot-url http://${fog-ip}/${fog-webroot}
cpuid --ext 29 && set arch x86_64 || set arch i386
goto get_console
:console_set
colour --rgb 0x00567a 1 ||
colour --rgb 0x00567a 2 ||
colour --rgb 0x00567a 4 ||
cpair --foreground 7 --background 2 2 ||
goto MENU
:alt_console
cpair --background 0 1 ||
cpair --background 1 2 ||
goto MENU
:get_console
console --picture http://10.51.1.53/fog/service/ipxe/bg.png --left 100 --right 80 && goto console_set || goto alt_console
:MENU
menu
colour --rgb 0xff0000 0 ||
cpair --foreground 1 1 ||
cpair --foreground 0 3 ||
cpair --foreground 4 4 ||
item --gap Host is NOT registered!
item --gap -- -------------------------------------
item fog.local Boot from hard disk
item fog.memtest Run Memtest86+
item fog.reginput Perform Full Host Registration and Inventory
item fog.reg Quick Registration and Inventory
item fog.quickimage Quick Image
item fog.multijoin Join Multicast Session
item fog.sysinfo Client System Information (Compatibility)
choose --default fog.local --timeout 10000 target && goto ${target}
:fog.local
sanboot --no-describe --drive 0x80 || goto MENU
:fog.memtest
kernel memdisk iso raw
initrd memtest.bin
boot || goto MENU
:fog.reginput
kernel bzImage32 loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 keymap= web=10.51.1.53/fog/ conosoleblank=0 loglevel=4 mode=manreg
imgfetch init_32.xz
boot || goto MENU
:fog.reg
kernel bzImage32 loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 keymap= web=10.51.1.53/fog/ conosoleblank=0 loglevel=4 mode=autoreg
imgfetch init_32.xz
boot || goto MENU
:fog.quickimage
login
params
param mac0 ${net0/mac}
param arch ${arch}
param username ${username}
param password ${password}
param qihost 1
isset ${net1/mac} && param mac1 ${net1/mac} || goto bootme
isset ${net2/mac} && param mac2 ${net2/mac} || goto bootme
:fog.multijoin
login
params
param mac0 ${net0/mac}
param arch ${arch}
param username ${username}
param password ${password}
param sessionJoin 1
isset ${net1/mac} && param mac1 ${net1/mac} || goto bootme
isset ${net2/mac} && param mac2 ${net2/mac} || goto bootme
:fog.sysinfo
kernel bzImage32 loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 keymap= web=10.51.1.53/fog/ conosoleblank=0 loglevel=4 mode=sysinfo
imgfetch init_32.xz
boot || goto MENU
:bootme
chain -ar http://10.51.1.53/fog/service/ipxe/boot.php##params ||
goto MENU
autoboot

However, I did look at what http://10.65.2.20/fog/service/ipxe/boot.php had before I setup the redirect and it was significantly shorter than what 10.51.1.53 provides.

Wayne Workman

With the old boot.php file put in place on the storage node, this is what’s rendered:

#!ipxe
set fog-ip
set fog-webroot
set boot-url http://${fog-ip}/${fog-webroot}
cpuid --ext 29 && set arch x86_64 || set arch i386
goto get_console
:console_set
colour --rgb 0x00567a 1 ||
colour --rgb 0x00567a 2 ||
colour --rgb 0x00567a 4 ||
cpair --foreground 7 --background 2 2 ||
goto MENU
:alt_console
cpair --background 0 1 ||
cpair --background 1 2 ||
goto MENU
:get_console
console --picture http:///service/ipxe/bg.png --left 100 --right 80 && goto console_set || goto alt_console
:MENU
menu
colour --rgb 0xff0000 0 ||
cpair --foreground 1 1 ||
cpair --foreground 0 3 ||
cpair --foreground 4 4 ||
item --gap Host is NOT registered!
item --gap -- -------------------------------------
choose --default fog.local --timeout 0 target && goto ${target}
:bootme
chain -ar http:///service/ipxe/boot.php##params ||
goto MENU
autoboot

george1421

OK its a remote device booting to a remote storage node (sorry about being intentionally slow, i’m trying to draw the picture here).

You updated the fog server at HQ and the client at the remote site is having an arp issue. At the remote site, what should the client be seeing for option 66 (I would expect it to see the storage node IP address).

I really don’t think its even getting that far to worry about the boot.php file. Its getting option 66 and trying to find the mac address of the device pointed to by option 66.

My inital reaction here is that it could be networking. I have seen routers have a really long arp cache refresh time. Where it may hold onto the old mac address to ip address translation for 20 or 30 minutes, but eventually it will clear. Based on what you’ve done so far I would assume its been more than 30 minutes.

Wayne Workman

@george1421 said:

My inital reaction here is that it could be networking. I have seen routers have a really long arp cache refresh time. Where it may hold onto the old mac address to ip address translation for 20 or 30 minutes, but eventually it will clear. Based on what you’ve done so far I would assume its been more than 30 minutes.

Try a week.

also - previously, the remote site’s next-server was set to the main fog server’s IP and this worked fine. Now, I’ve got it set to the fog node and I’m redirecting requests to that node’s boot.php file to the main fog server to get it working because there’s apparently mysql issues breaking that.

Wayne Workman

Tom figured out that booting from the storage node wasn’t working because some php was using old style mysql password mechanisms - and he removed that and that started working.

So, I don’t have to use my hack-ish redirect anymore (which were awesome).

But, the network issues with ARP remain.

Sebastian Roth

@Wayne-Workman Here you find a nice explanation on ‘gratuitous ARP’ https://wiki.wireshark.org/Gratuitous_ARP

To me this looks like you have two switches (MACs 00:0f:23:4c:49:00 and 00:22:56:01:4e:44) in that VLAN using the same IP address (10.50.65.254). Both seam to be Cisco devices (telling from the MAC addresses) but cannot be sure.

Wayne Workman

@Sebastian-Roth said:

@Wayne-Workman Here you find a nice explanation on ‘gratuitous ARP’ https://wiki.wireshark.org/Gratuitous_ARP

Interesting you mention this. I tried sending gratuitous ARP from both the storage node and the main fog server - multiple times. It didn’t make any difference.

To me this looks like you have two switches (MACs 00:0f:23:4c:49:00 and 00:22:56:01:4e:44) in that VLAN using the same IP address (10.50.65.254). Both seam to be Cisco devices (telling from the MAC addresses) but cannot be sure.

I noticed that too. I reported both MAC addresses and the message to my network team and they told me that the second MAC address doesn’t exist on our network… so… not sure what to say about that. I’m sure the MAC exists somehow/somewhere and this error isn’t just sent out by a switch that is in a bad mood - it must be caused by… something.

george1421

Since you have two mac addresses reporting, it would be interesting to know what something like
tcpdump with "ether host 00:0f:23:4c:49:00"
and
tcpdump with "ether host 00:22:56:01:4e:44"

What you are looking for is something that is distinguishable to help you locate this device. From the mac address i can tell you that it should be two different devices (not a sub interface on the same device).

From the logic standpoint do you use cicso gear for networking (switches and stuff) or just routers? I might suspect the 00:0f:23 device is older than the 00:22:56 device.

If you have a device on that same subnet, it would be interesting if you did a
ping -b 10.66.15.255
(broadcast ping to the subnet broadcast address) then wait a few seconds. Then did an arp -a and direct that into a text file. At this point I don’t care if you can find the mac addresses in question. I would look for devices that have the same vendor code 00:0f:23 or 00:22:56, with a relatively close device part. Once you do use the IP address returned to track down a know device to find out what it is (make and model). That may help you narrow down your ghost device (like an old configuration on an L3 router).

Sebastian Roth

Just found an interesting filter I didn’t know about yet: tcpdump -ee "ether[0:4] == 0x000f234c" (the bytes you are “grepping” for must be of length 1, 2 or 4 - so you can do “ether[0:1]” and “ether[2:2]” but you can’t do “ether[0:3]”)

Having a closer look at the MAC addresses I noticed the last byte being “00” on one of them. Then I remembered that every port usually has its own MAC on Cisco switches. “00” being the switch itself, “01” the first port and so on. You don’t usually see the switch MACs in IP communications as there will only be the MACs of source and destination in those packets (switches are transparent in that respect). But switches do send out traffic as well, like BPDU for spanning tree and stuff like that.

Using similar filters (eth.addr[0:4] contains 00:22:56:01) on that wireshark dump I found that 00:22:56:01:4e:44 has a “partner” with MAC 00:22:56:01:4e:02 (notice the change in the last byte) which sends out spanning tree messages (BPDUs) on a regular basis. Looking at those BPDUs I see “Bridge Identifier: 00:22:56:01:4e:00”. Does your network team know about this MAC/switch (“Root bridge: 00:0d:65:51:80:80” - if that’s of any help for them)???

http://www.ciscozine.com/how-to-trace-mac-address/ seams interesting in case they wanna trace the MAC.

Wayne Workman

@Sebastian-Roth said:

Does your network team know about this MAC/switch (“Root bridge: 00:0d:65:51:80:80” - if that’s of any help for them)???

Well if they don’t, they will. Today is a snow day so I get to dive in again tomorrow.

Thank you both Sebastian and George for helping out - you guys are phenomenal.

george1421

@Wayne-Workman said:

Well if they don’t, they will. Today is a snow day so I get to dive in again tomorrow.

No worries, stay warm and safe.

PXE-E11: ARP timeout

164

12.2k

17.3k

155.5k