PXE-E11: ARP timeout

Wayne Workman

You virtualized your physical server. Did you p2v it or just spin up a new vm instance?

I made a brand-new VM from scratch and ported over the images, db data, and CA and certs manually.

If you were to power off your vm, power up your old fog server at the correct address would everything be golden?

Don’t know, maybe, maybe not. We’re only having issues at the remote site. The site where the newly built FOG server is works fine in every regard.

I assume your fog server is at HQ and your storage node is at a remote location. Does pxe booting a target computer at the HQ site work and the remote site fail?

Yes.

And just to restate the obvious you disabled the firewall on the new FOG server.

I configured firewalld on the new VM, it works fine at that site. For troubleshooting purposes I have turned the firewall off on both the main server and the storage node.

The error (PXE-E11: ARP timeout) translated comes to the pxe client saying "Hey who has ip address <ip address of what is in option 66> send me your mac address, yet no one replied.

Right… ideas?

Mind you - I can network boot from yet another remote location to the main fog server just fine, and before the rebuilding, everything everywhere worked just fine.

I’m 100% positive this is a network issue, but I don’t know what it could be.

Wayne Workman

Tom asked for what http://10.65.2.20/fog/service/ipxe/boot.php shows. Because of the redirect I coded, it sends me to http://10.51.1.53/fog/service/ipxe/boot.php here is what that says (which is good config obviously)

#!ipxe
set fog-ip 10.51.1.53
set fog-webroot fog
set boot-url http://${fog-ip}/${fog-webroot}
cpuid --ext 29 && set arch x86_64 || set arch i386
goto get_console
:console_set
colour --rgb 0x00567a 1 ||
colour --rgb 0x00567a 2 ||
colour --rgb 0x00567a 4 ||
cpair --foreground 7 --background 2 2 ||
goto MENU
:alt_console
cpair --background 0 1 ||
cpair --background 1 2 ||
goto MENU
:get_console
console --picture http://10.51.1.53/fog/service/ipxe/bg.png --left 100 --right 80 && goto console_set || goto alt_console
:MENU
menu
colour --rgb 0xff0000 0 ||
cpair --foreground 1 1 ||
cpair --foreground 0 3 ||
cpair --foreground 4 4 ||
item --gap Host is NOT registered!
item --gap -- -------------------------------------
item fog.local Boot from hard disk
item fog.memtest Run Memtest86+
item fog.reginput Perform Full Host Registration and Inventory
item fog.reg Quick Registration and Inventory
item fog.quickimage Quick Image
item fog.multijoin Join Multicast Session
item fog.sysinfo Client System Information (Compatibility)
choose --default fog.local --timeout 10000 target && goto ${target}
:fog.local
sanboot --no-describe --drive 0x80 || goto MENU
:fog.memtest
kernel memdisk iso raw
initrd memtest.bin
boot || goto MENU
:fog.reginput
kernel bzImage32 loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 keymap= web=10.51.1.53/fog/ conosoleblank=0 loglevel=4 mode=manreg
imgfetch init_32.xz
boot || goto MENU
:fog.reg
kernel bzImage32 loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 keymap= web=10.51.1.53/fog/ conosoleblank=0 loglevel=4 mode=autoreg
imgfetch init_32.xz
boot || goto MENU
:fog.quickimage
login
params
param mac0 ${net0/mac}
param arch ${arch}
param username ${username}
param password ${password}
param qihost 1
isset ${net1/mac} && param mac1 ${net1/mac} || goto bootme
isset ${net2/mac} && param mac2 ${net2/mac} || goto bootme
:fog.multijoin
login
params
param mac0 ${net0/mac}
param arch ${arch}
param username ${username}
param password ${password}
param sessionJoin 1
isset ${net1/mac} && param mac1 ${net1/mac} || goto bootme
isset ${net2/mac} && param mac2 ${net2/mac} || goto bootme
:fog.sysinfo
kernel bzImage32 loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 keymap= web=10.51.1.53/fog/ conosoleblank=0 loglevel=4 mode=sysinfo
imgfetch init_32.xz
boot || goto MENU
:bootme
chain -ar http://10.51.1.53/fog/service/ipxe/boot.php##params ||
goto MENU
autoboot

However, I did look at what http://10.65.2.20/fog/service/ipxe/boot.php had before I setup the redirect and it was significantly shorter than what 10.51.1.53 provides.

Wayne Workman

With the old boot.php file put in place on the storage node, this is what’s rendered:

#!ipxe
set fog-ip
set fog-webroot
set boot-url http://${fog-ip}/${fog-webroot}
cpuid --ext 29 && set arch x86_64 || set arch i386
goto get_console
:console_set
colour --rgb 0x00567a 1 ||
colour --rgb 0x00567a 2 ||
colour --rgb 0x00567a 4 ||
cpair --foreground 7 --background 2 2 ||
goto MENU
:alt_console
cpair --background 0 1 ||
cpair --background 1 2 ||
goto MENU
:get_console
console --picture http:///service/ipxe/bg.png --left 100 --right 80 && goto console_set || goto alt_console
:MENU
menu
colour --rgb 0xff0000 0 ||
cpair --foreground 1 1 ||
cpair --foreground 0 3 ||
cpair --foreground 4 4 ||
item --gap Host is NOT registered!
item --gap -- -------------------------------------
choose --default fog.local --timeout 0 target && goto ${target}
:bootme
chain -ar http:///service/ipxe/boot.php##params ||
goto MENU
autoboot

george1421

OK its a remote device booting to a remote storage node (sorry about being intentionally slow, i’m trying to draw the picture here).

You updated the fog server at HQ and the client at the remote site is having an arp issue. At the remote site, what should the client be seeing for option 66 (I would expect it to see the storage node IP address).

I really don’t think its even getting that far to worry about the boot.php file. Its getting option 66 and trying to find the mac address of the device pointed to by option 66.

My inital reaction here is that it could be networking. I have seen routers have a really long arp cache refresh time. Where it may hold onto the old mac address to ip address translation for 20 or 30 minutes, but eventually it will clear. Based on what you’ve done so far I would assume its been more than 30 minutes.

Wayne Workman

@george1421 said:

My inital reaction here is that it could be networking. I have seen routers have a really long arp cache refresh time. Where it may hold onto the old mac address to ip address translation for 20 or 30 minutes, but eventually it will clear. Based on what you’ve done so far I would assume its been more than 30 minutes.

Try a week.

also - previously, the remote site’s next-server was set to the main fog server’s IP and this worked fine. Now, I’ve got it set to the fog node and I’m redirecting requests to that node’s boot.php file to the main fog server to get it working because there’s apparently mysql issues breaking that.

Wayne Workman

Tom figured out that booting from the storage node wasn’t working because some php was using old style mysql password mechanisms - and he removed that and that started working.

So, I don’t have to use my hack-ish redirect anymore (which were awesome).

But, the network issues with ARP remain.

Sebastian Roth

@Wayne-Workman Here you find a nice explanation on ‘gratuitous ARP’ https://wiki.wireshark.org/Gratuitous_ARP

To me this looks like you have two switches (MACs 00:0f:23:4c:49:00 and 00:22:56:01:4e:44) in that VLAN using the same IP address (10.50.65.254). Both seam to be Cisco devices (telling from the MAC addresses) but cannot be sure.

Wayne Workman

@Sebastian-Roth said:

@Wayne-Workman Here you find a nice explanation on ‘gratuitous ARP’ https://wiki.wireshark.org/Gratuitous_ARP

Interesting you mention this. I tried sending gratuitous ARP from both the storage node and the main fog server - multiple times. It didn’t make any difference.

To me this looks like you have two switches (MACs 00:0f:23:4c:49:00 and 00:22:56:01:4e:44) in that VLAN using the same IP address (10.50.65.254). Both seam to be Cisco devices (telling from the MAC addresses) but cannot be sure.

I noticed that too. I reported both MAC addresses and the message to my network team and they told me that the second MAC address doesn’t exist on our network… so… not sure what to say about that. I’m sure the MAC exists somehow/somewhere and this error isn’t just sent out by a switch that is in a bad mood - it must be caused by… something.

george1421

Since you have two mac addresses reporting, it would be interesting to know what something like
tcpdump with "ether host 00:0f:23:4c:49:00"
and
tcpdump with "ether host 00:22:56:01:4e:44"

What you are looking for is something that is distinguishable to help you locate this device. From the mac address i can tell you that it should be two different devices (not a sub interface on the same device).

From the logic standpoint do you use cicso gear for networking (switches and stuff) or just routers? I might suspect the 00:0f:23 device is older than the 00:22:56 device.

If you have a device on that same subnet, it would be interesting if you did a
ping -b 10.66.15.255
(broadcast ping to the subnet broadcast address) then wait a few seconds. Then did an arp -a and direct that into a text file. At this point I don’t care if you can find the mac addresses in question. I would look for devices that have the same vendor code 00:0f:23 or 00:22:56, with a relatively close device part. Once you do use the IP address returned to track down a know device to find out what it is (make and model). That may help you narrow down your ghost device (like an old configuration on an L3 router).

Sebastian Roth

Just found an interesting filter I didn’t know about yet: tcpdump -ee "ether[0:4] == 0x000f234c" (the bytes you are “grepping” for must be of length 1, 2 or 4 - so you can do “ether[0:1]” and “ether[2:2]” but you can’t do “ether[0:3]”)

Having a closer look at the MAC addresses I noticed the last byte being “00” on one of them. Then I remembered that every port usually has its own MAC on Cisco switches. “00” being the switch itself, “01” the first port and so on. You don’t usually see the switch MACs in IP communications as there will only be the MACs of source and destination in those packets (switches are transparent in that respect). But switches do send out traffic as well, like BPDU for spanning tree and stuff like that.

Using similar filters (eth.addr[0:4] contains 00:22:56:01) on that wireshark dump I found that 00:22:56:01:4e:44 has a “partner” with MAC 00:22:56:01:4e:02 (notice the change in the last byte) which sends out spanning tree messages (BPDUs) on a regular basis. Looking at those BPDUs I see “Bridge Identifier: 00:22:56:01:4e:00”. Does your network team know about this MAC/switch (“Root bridge: 00:0d:65:51:80:80” - if that’s of any help for them)???

http://www.ciscozine.com/how-to-trace-mac-address/ seams interesting in case they wanna trace the MAC.

Wayne Workman

@Sebastian-Roth said:

Does your network team know about this MAC/switch (“Root bridge: 00:0d:65:51:80:80” - if that’s of any help for them)???

Well if they don’t, they will. Today is a snow day so I get to dive in again tomorrow.

Thank you both Sebastian and George for helping out - you guys are phenomenal.

george1421

@Wayne-Workman said:

Well if they don’t, they will. Today is a snow day so I get to dive in again tomorrow.

No worries, stay warm and safe.

PXE-E11: ARP timeout

130

12.5k

17.5k

156.2k