PXE-E11: ARP timeout
-
Tom asked for what http://10.65.2.20/fog/service/ipxe/boot.php shows. Because of the redirect I coded, it sends me to http://10.51.1.53/fog/service/ipxe/boot.php here is what that says (which is good config obviously)
#!ipxe set fog-ip 10.51.1.53 set fog-webroot fog set boot-url http://${fog-ip}/${fog-webroot} cpuid --ext 29 && set arch x86_64 || set arch i386 goto get_console :console_set colour --rgb 0x00567a 1 || colour --rgb 0x00567a 2 || colour --rgb 0x00567a 4 || cpair --foreground 7 --background 2 2 || goto MENU :alt_console cpair --background 0 1 || cpair --background 1 2 || goto MENU :get_console console --picture http://10.51.1.53/fog/service/ipxe/bg.png --left 100 --right 80 && goto console_set || goto alt_console :MENU menu colour --rgb 0xff0000 0 || cpair --foreground 1 1 || cpair --foreground 0 3 || cpair --foreground 4 4 || item --gap Host is NOT registered! item --gap -- ------------------------------------- item fog.local Boot from hard disk item fog.memtest Run Memtest86+ item fog.reginput Perform Full Host Registration and Inventory item fog.reg Quick Registration and Inventory item fog.quickimage Quick Image item fog.multijoin Join Multicast Session item fog.sysinfo Client System Information (Compatibility) choose --default fog.local --timeout 10000 target && goto ${target} :fog.local sanboot --no-describe --drive 0x80 || goto MENU :fog.memtest kernel memdisk iso raw initrd memtest.bin boot || goto MENU :fog.reginput kernel bzImage32 loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 keymap= web=10.51.1.53/fog/ conosoleblank=0 loglevel=4 mode=manreg imgfetch init_32.xz boot || goto MENU :fog.reg kernel bzImage32 loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 keymap= web=10.51.1.53/fog/ conosoleblank=0 loglevel=4 mode=autoreg imgfetch init_32.xz boot || goto MENU :fog.quickimage login params param mac0 ${net0/mac} param arch ${arch} param username ${username} param password ${password} param qihost 1 isset ${net1/mac} && param mac1 ${net1/mac} || goto bootme isset ${net2/mac} && param mac2 ${net2/mac} || goto bootme :fog.multijoin login params param mac0 ${net0/mac} param arch ${arch} param username ${username} param password ${password} param sessionJoin 1 isset ${net1/mac} && param mac1 ${net1/mac} || goto bootme isset ${net2/mac} && param mac2 ${net2/mac} || goto bootme :fog.sysinfo kernel bzImage32 loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 keymap= web=10.51.1.53/fog/ conosoleblank=0 loglevel=4 mode=sysinfo imgfetch init_32.xz boot || goto MENU :bootme chain -ar http://10.51.1.53/fog/service/ipxe/boot.php##params || goto MENU autoboot
However, I did look at what http://10.65.2.20/fog/service/ipxe/boot.php had before I setup the redirect and it was significantly shorter than what 10.51.1.53 provides.
-
With the old boot.php file put in place on the storage node, this is what’s rendered:
#!ipxe set fog-ip set fog-webroot set boot-url http://${fog-ip}/${fog-webroot} cpuid --ext 29 && set arch x86_64 || set arch i386 goto get_console :console_set colour --rgb 0x00567a 1 || colour --rgb 0x00567a 2 || colour --rgb 0x00567a 4 || cpair --foreground 7 --background 2 2 || goto MENU :alt_console cpair --background 0 1 || cpair --background 1 2 || goto MENU :get_console console --picture http:///service/ipxe/bg.png --left 100 --right 80 && goto console_set || goto alt_console :MENU menu colour --rgb 0xff0000 0 || cpair --foreground 1 1 || cpair --foreground 0 3 || cpair --foreground 4 4 || item --gap Host is NOT registered! item --gap -- ------------------------------------- choose --default fog.local --timeout 0 target && goto ${target} :bootme chain -ar http:///service/ipxe/boot.php##params || goto MENU autoboot
-
OK its a remote device booting to a remote storage node (sorry about being intentionally slow, i’m trying to draw the picture here).
You updated the fog server at HQ and the client at the remote site is having an arp issue. At the remote site, what should the client be seeing for option 66 (I would expect it to see the storage node IP address).
I really don’t think its even getting that far to worry about the boot.php file. Its getting option 66 and trying to find the mac address of the device pointed to by option 66.
My inital reaction here is that it could be networking. I have seen routers have a really long arp cache refresh time. Where it may hold onto the old mac address to ip address translation for 20 or 30 minutes, but eventually it will clear. Based on what you’ve done so far I would assume its been more than 30 minutes.
-
@george1421 said:
My inital reaction here is that it could be networking. I have seen routers have a really long arp cache refresh time. Where it may hold onto the old mac address to ip address translation for 20 or 30 minutes, but eventually it will clear. Based on what you’ve done so far I would assume its been more than 30 minutes.
Try a week.
also - previously, the remote site’s next-server was set to the main fog server’s IP and this worked fine. Now, I’ve got it set to the fog node and I’m redirecting requests to that node’s boot.php file to the main fog server to get it working because there’s apparently mysql issues breaking that.
-
Tom figured out that booting from the storage node wasn’t working because some php was using old style mysql password mechanisms - and he removed that and that started working.
So, I don’t have to use my hack-ish redirect anymore (which were awesome).
But, the network issues with ARP remain.
-
@Wayne-Workman Here you find a nice explanation on ‘gratuitous ARP’ https://wiki.wireshark.org/Gratuitous_ARP
To me this looks like you have two switches (MACs 00:0f:23:4c:49:00 and 00:22:56:01:4e:44) in that VLAN using the same IP address (10.50.65.254). Both seam to be Cisco devices (telling from the MAC addresses) but cannot be sure.
-
@Sebastian-Roth said:
@Wayne-Workman Here you find a nice explanation on ‘gratuitous ARP’ https://wiki.wireshark.org/Gratuitous_ARP
Interesting you mention this. I tried sending gratuitous ARP from both the storage node and the main fog server - multiple times. It didn’t make any difference.
To me this looks like you have two switches (MACs 00:0f:23:4c:49:00 and 00:22:56:01:4e:44) in that VLAN using the same IP address (10.50.65.254). Both seam to be Cisco devices (telling from the MAC addresses) but cannot be sure.
I noticed that too. I reported both MAC addresses and the message to my network team and they told me that the second MAC address doesn’t exist on our network… so… not sure what to say about that. I’m sure the MAC exists somehow/somewhere and this error isn’t just sent out by a switch that is in a bad mood - it must be caused by… something.
-
Since you have two mac addresses reporting, it would be interesting to know what something like
tcpdump with "ether host 00:0f:23:4c:49:00"
and
tcpdump with "ether host 00:22:56:01:4e:44"
What you are looking for is something that is distinguishable to help you locate this device. From the mac address i can tell you that it should be two different devices (not a sub interface on the same device).
From the logic standpoint do you use cicso gear for networking (switches and stuff) or just routers? I might suspect the 00:0f:23 device is older than the 00:22:56 device.
If you have a device on that same subnet, it would be interesting if you did a
ping -b 10.66.15.255
(broadcast ping to the subnet broadcast address) then wait a few seconds. Then did anarp -a
and direct that into a text file. At this point I don’t care if you can find the mac addresses in question. I would look for devices that have the same vendor code 00:0f:23 or 00:22:56, with a relatively close device part. Once you do use the IP address returned to track down a know device to find out what it is (make and model). That may help you narrow down your ghost device (like an old configuration on an L3 router). -
Just found an interesting filter I didn’t know about yet:
tcpdump -ee "ether[0:4] == 0x000f234c"
(the bytes you are “grepping” for must be of length 1, 2 or 4 - so you can do “ether[0:1]” and “ether[2:2]” but you can’t do “ether[0:3]”)Having a closer look at the MAC addresses I noticed the last byte being “00” on one of them. Then I remembered that every port usually has its own MAC on Cisco switches. “00” being the switch itself, “01” the first port and so on. You don’t usually see the switch MACs in IP communications as there will only be the MACs of source and destination in those packets (switches are transparent in that respect). But switches do send out traffic as well, like BPDU for spanning tree and stuff like that.
Using similar filters (
eth.addr[0:4] contains 00:22:56:01
) on that wireshark dump I found that 00:22:56:01:4e:44 has a “partner” with MAC 00:22:56:01:4e:02 (notice the change in the last byte) which sends out spanning tree messages (BPDUs) on a regular basis. Looking at those BPDUs I see “Bridge Identifier: 00:22:56:01:4e:00”. Does your network team know about this MAC/switch (“Root bridge: 00:0d:65:51:80:80” - if that’s of any help for them)???http://www.ciscozine.com/how-to-trace-mac-address/ seams interesting in case they wanna trace the MAC.
-
@Sebastian-Roth said:
Does your network team know about this MAC/switch (“Root bridge: 00:0d:65:51:80:80” - if that’s of any help for them)???
Well if they don’t, they will. Today is a snow day so I get to dive in again tomorrow.
Thank you both Sebastian and George for helping out - you guys are phenomenal.
-
@Wayne-Workman said:
Well if they don’t, they will. Today is a snow day so I get to dive in again tomorrow.
No worries, stay warm and safe.