Pxe-Boot gets hung up on TFTP
-
Hello,
We have been using FOG for deployment for a while with no issues until recently. While most machines boot perfectly, random machines will contact DHCP and then get stuck on TFPT. It will eventually boot up but takes hours and sometimes days for image deployment. If we move the “problem” machine to a new VLAN, the machine boots just fine. We are primarily working with Dell Optiplex 3010,3020,3040, and 3050s doing a legacy boot. Any ideas? Thanks! -
@zaccx32 said in Pxe-Boot gets hung up on TFTP:
get stuck on TFPT
- What does this mean, gets stuck??
- What error do you see?
- What mode is this computer in bios or uefi?
- Is the FOG server on the same subnet as the target computer?
- Is it predictable and you can create the error on demand?
- All of the troubled systems have the latest available firmware (esp the 3010s)?
- What device is your dhcp server (mfg and model)
-
What does this mean, gets stuck?? - When the machine tries to contact TFTP, normally it will take a fraction of a second, however, certain machines will sit there and try to contact it for 15 minutes.
What error do you see? - there are no errors. It eventually boots up but takes over 30 minutes to boot into windows.
What mode is this computer in bios or uefi? Legacy
Is the FOG server on the same subnet as the target computer? All of the issues are in a different subnet.
Is it predictable and you can create the error on demand? This is not predictable and happens at random. Will will swap the machine out with same make and model and it will boot fine. We have not been able to reproduce it on demand.
All of the troubled systems have the latest available firmware (esp the 3010s)? This is a good question. Is there a good way to check the version of the FOG client on the machines?
What device is your dhcp server (mfg and model) Windows 2012Thanks in advance! I am new to this world and have been learning on the go so I apologize for any miscommunication!
-
@zaccx32 said in Pxe-Boot gets hung up on TFTP:
All of the troubled systems have the latest available firmware (esp the 3010s)? This is a good question. Is there a good way to check the version of the FOG client on the machines?
I should have said dell bios version.
So the same computer on the same network port will sometimes go fast and other times wait 15 minutes? If so that sounds like network infrastructure. That iPXE boot loader is pretty small (< 100KB). It should go lighting quick. If you could predict the delay it would be interesting to get a pcap (packet capture) of the pxe booting process from a mirrored port using wireshark
I had an idea if you had a rogue dhcp server it might cause this issue without something timing out. Again if you can predict the problem you can capture at least the dhcp part using a witness computer on the same subnet as the failing computer. For wireshark you would use the capture filter of
port 67 or port 68
. It would be interesting to see if you are getting more than one OFFER packet and to see if you know the source of each OFFER host system. -
I just did a capture with wireshark and we are getting a single OFFER packet and there is a successful 3 way handshake with the gateway but then no other communication. It also randomly happened on another machine earlier when it was working fine this morning. I also updated the BIOS and that made no difference.
-
@zaccx32 Ok on a witness computer you get all 4 packets (discover, offer, request, ack) and they happen pretty quickly.
If you look at the offer packet. In the ethernet header (above the dhcp options) there should be a {next-server} field and that should be the IP address of the fog server. Down a little bit there should be a {boot-file} field, for bios computers it should be undionly.kpxe. If both are there, then scroll down a bit to the dhcp options 66 and 67 those values should mirror the header exactly.
Now if you can get a mirror port setup what we should see right after the ACK is the computer attempting to reach out to the {next-server} and download the {boot-file} using the tftp protocol. To capture this on a mirror port you would need to use the capture filter of
port 67 or port 68 or port 69
To capture all of that in one pcap. I think somewhere between the ACK and the tftp download something must be falling down. If you startup tcpdump on the FOG server you can capture the tftp request without needing a mirror port. So you will use wireshark on the witness computer and tcpdump on the fog server to get the entire picture. For tcpdump you can get the commands here: https://forums.fogproject.org/topic/9673/when-dhcp-pxe-booting-process-goes-bad-and-you-have-no-clue there has to be something going on in that 15 minute pxe booting gap. Like a lot of retrans, or something. -
@george1421
I have been running more captures on a machine that is booting very slowly and have noticed that it is getting multiple OFFER packets. 2 are from a single server and another packet if from another server. We do know what each of the servers are. We do have two DHCP servers within the same scope that are for load balancing. -
@zaccx32 said in Pxe-Boot gets hung up on TFTP:
We do have two DHCP servers within the same scope that are for load balancing
Make sure that both are setup to support pxe booting. We’ve see situations where one is configured and the other is not and the clients will get random pxe boots depending on which dhcp server responds first.
-
@george1421
Hey George,
I took a look at the two DHCP servers and both have the same server options with the the correct Boot server host name and bootfile name. I have not noticed anything out of the ordinary with DNS and DHCP. I did find a way to identify which machines are being affected by looking for extremely slow copy speeds when deploying files with PDQ. 4MB files will take over an hour to copy on machines that are being affected. I also have noticed that the time to boot is inconsistent. Most times it will take 15-20 minutes but a few times it takes 5 minutes but then takes 20 minutes on the next boot. When doing a pcap, I have noticed quite a bit of DNS query name errors as well. -
@zaccx32 said in Pxe-Boot gets hung up on TFTP:
I did find a way to identify which machines are being affected by looking for extremely slow copy speeds when deploying files with PDQ. 4MB files will take over an hour to copy on machines that are being affected.
This here kind of tells me its network infrastructure. What I would do is look at the network switch port (hopefully you have a managed switch) and look at the port counters. See if you are having a lot of crc errors. Can you generalize and say all computers from area A have a problem but not from area B. If this is the case then the troubles may be on an uplink port between the area A switch and the next switch in line. Again the port counters might give you a clue to what is not right. IMO if you can duplicate the error with 2 different servers then its probably not the FOG server at fault.
I did notice something in your dhcp screen shot. Its not a problem with fog, but in your polycom scope, you should probably remove the undionly.kpxe boot file name. Its not relevant to a voip phone and may cause problems.
The other thing since your dhcp server supports profiles you might want to take a look at this wiki page to setup profiles for both bios and uefi booting: https://wiki.fogproject.org/wiki/index.php/BIOS_and_UEFI_Co-Existence#Using_Windows_Server_2012_.28R1_and_later.29_DHCP_Policy