Unable to PXE boot on from different subnet

Defcon

Server

FOG Version: 1.4.3
OS: 16.04 LTS

Client

Service Version:
OS: Windows 7, Dell Optiplex 390

Description

Hello,

I was wondering if anyone may help with resolving this? I currently have a FOG server running at the school (10.30.x.x), and it is working perfectly under that subnet. When it comes to different different VLANs, some devices can boot with PXE, and some cant. On the subnet that is 10.80.x.x the Dell computer won’t PXE boot, but when I bring the computer physically on this network it boots just fine in PXE. I did run Wireshark between the client and server, and it seems like it just keeps reading the file undionly.kpxe over and over until it times out. The firewalls are off on the FOG server to accept all traffic.

Any ideas?

george1421

I’m still trying to digest the pcap file at this moment.

But I have to ask

There are no network firewalls between the two subnets?
I have to say
I know there are some flaky pxe boot firmware that will only pxe boot from the local subnet.

george1421

We’re missing the preamble on this tftp dialog, but I find its strange that your block size is 558 bytes, I might expect something around 1,500 bytes. How is this subnet connected to your HQ subnet (where the fog server is located)?

Defcon

Thank you for the reply! There is no firewall between the two subnets (that I know of) and it’s connected through a VLAN.

Here’s the filtered file here of the wireshark capture.
https://drive.google.com/file/d/0B1-vm6YfQDe1OFgzQUc5dEROTU0/view?usp=sharing

Defcon

George,

Don’t know why it didn’t show up earlier, but here’s the code below. I did snoop through the forums and found this and tried to change the bootfile to ipxe.kpxe instead of undionly.kpxe. Still no luck with connecting.

https://forums.fogproject.org/topic/8449/ipxe-initialising-devices/3

PXEEBG PXENV at  PXE at  No PXE stack found
 entry point at 
         UNDI code segment  data segment  kB
         UNDI device is PCI          Unable to determine UNDI physical device type  workaround enabled         kB free base memory after PXE unload
         UNDI API call  failed status code 
PSUtG
u
XP XPtXffP
iXPXfQgfYfQP1gXfYfPfUfhfhfh0    jfh    jfjjffffnff
ffFhfV  fVf ffXfQSbrffTfffffYjdPPdXfVfWfUf1f1fffffff1fQfWfVf1ffVfffffSBfYfffhsPh4hPh0fuffffffffffPfufff
fsffXfWf4f
fgffWff1ff42PhfWSf286fa1GfVfMz
Installation failed  cannot continue```

george1421

@Defcon I don’t think that is your problem here, but by all means test it out. If it fixes it I will have learned something today.

Defcon

@george1421 Yep, you’re right. No luck haha! Any other thoughts?

george1421

@Defcon As I pointed out earlier, I find it strange that your packet size is 558. Is that consistent across your entire network or just the subnets in question?

The case of 10.80.x.x, is that vlan connected at GbE speeds or is that vlan at a remote location (off-campus)?

Is it correct to say every computer on that vlan can’t pxe boot or is it random?

Defcon

@george1421 It’s at a remote location, and the PXE booting is random for sure. The ones that work at the moment are the Dell Optiplex GX520 desktops. I tested out the PXE at high school and the packet size is 1503. The 1503 packet size is what I am pulling here as well, where the server is at. Yikes…

george1421

@Defcon So just to confirm at that remote location random pc’s will pxe boot?

How is that remote location connected to the main location? What technology is being used (mpls, dsl, vpn over internet, etc.)

Defcon

@george1421 Hey George, apologizes on the long response. I wanted to get more information regarding this, but the employee was gone on vacation.

The school is actually a direct connect, no VPN etc. Apparently we own that fiber as well. Regarding the random PC’s that boot, so far it’s only the GX520 desktop that’s able to boot into PXE. All of the schools that are connected to the main district are direct connect (something new I learned). Do you think this may be a switch issue?

george1421

@Defcon ok lets focus on just a single computer on a single subnet that doesn’t pxe boot. Once that is identified, I would like to take a (dumb) unmanaged switch and place it between the building computer and the pxe booting computer to see if that ‘masks’ the issue.

Sebastian Roth

I feel like there are some parts of the picture still missing. So first I shall mention that the packet size is not playing a role here as far as I can see. The figure 558 is just the size of the full packet (TFTP plus headers for UDP, IP and ethernet/MAC). In the TFTP RFC (page 6) it says:

The data field is from zero to 512 bytes long.

So this is perfectly fine I reckon.

Second: The first transfer finishing and another one just starting right after it in the second picture of the initial post looks strange at first but there is a major time gap between those two so I think this is because it’s trying again after a reboot.

@Defcon I am still missing the actual error here. The text you posted (ending with “Installation failed cannot continue”) seems to be copied from wireshark and is probably just the readable strings within the transfered iPXE binary. Could you please describe exactly what you see when things go wrong. Take a picture or even video of a failed PXE boot!

Defcon

@Sebastian-Roth Thank you for the reply! Sorry for late response; during the summer months of school people take a lot of vacation during this time.

I heard back from the technician over there, and she sent over screen shots, and here there are provided below.

HP 7800 Error Screen
alt text

Dell Optiplex 390 Error Screen
alt text

george1421

@Defcon I’m still of a mind to say that this is/could be a spanning tree issue. Can we assume both of these computers in the pictures above are on the same subnet in the same building?

Please test this idea by placing an unmanaged switch between the pxe booting computer and the building switch. Then pxe boot the computer. Confirm if you can get to the fog iPXE menu.

What you are experiencing is what we typically see when spanning tree ( a good thing ) is turned on, but is not configured for one of the fast spanning tree protocols (fast-STP, RSTP, or what ever your switch mfg calls it). Placing the unmanaged switch between the pxe booting computer and the building switch will keep that building switch port from winking as the pxe booting computer starts up.

JLE

@defcon said in Unable to PXE boot on from different subnet:

On the subnet that is 10.80.x.x the Dell computer won’t PXE boot, but when I bring the computer physically on this network it boots just fine in PXE.

When you move this computer are you hooking it up to an entirely different switch? I recently ran into both of those errors you have posted a picture of. Here’s a checklist I’ve found that works for us:

Ip-helper address or dhcp-relays set up on each VLAN, and on each switch.
Spanning-Tree set to rapid-pvst (because of the switch model that we have.)
Portfast enabled.

Specifically with that bottom error I had to add all of the hosts to a new group (that I called Encryp Reset), go into the group general settings for that group - reset their encryption data. Deploy an image to the hosts again - got that same error again (no configuration methods succeeded) Then I rebooted the computer and upon the next cycle it worked just fine. I’ve had to do this maybe 50-60 times so far. Random Dell Optiplex 990s just seem to do it.

Defcon

I just want to say thank you for all the help! I am going to see if there is a switch laying around where I can test this. @JLE It’s on a completely different switch. I had a feeling it was some sort of issue regarding the switch just unsure what. Unfortunately I don’t have access to configuring the switches (apparently a third party does this??). I’ll get back to you guys one I can find a unmanaged switch switch, and tested it out.

lmioperations

Something JLE said is the first thing I thought about (ip helper-addresses). If you can get access to login to the switch, find the relevant documentation online for your switch and verify if the VLAN has an ip helper-address configured.

On our ProCurves, we could check like this:

show ip helper-address

or we could configure like this:

config
vlan <vlan_number>
ip helper-address <IP_of_your_DHCP_server>
write mem
end

Unable to PXE boot on from different subnet

Server

Client

Description

111

12.2k

17.4k

155.5k