TFTP only working with certain VLANs.
-
Fog 1.5.4
Ubuntu 16.04 LTSWe began having issues this morning that prevented computers on a different VLAN than the fog server from finding the TFTP server; it would request us to enter the TFTP server address. Going through a few threads, it seems that this is caused by multiple DHCP servers, but we only have one. Throughout the day, I tested a few more VLANs. It appears that there’s only one VLAN having issues.
I ran
sudo tcpdump -w output.pcap port 67 or port 68 or port 69 or port 4011
while PXE booting over both VLANs. VLAN 2 (and others) works, VLAN 3 doesn’t. I don’t really know what to be looking for, so I’ll attach the pcaps. -
What are you using for a dhcp server? Do you have the option 66 and 67 defined globally or individually based on the vlan?
-
@drewklein22 Windows Server 2012 R2. Option 66 and 67 are global, but having them specific to the scopes doesn’t change the result unfortunately.
-
@zpoling Ok lets start with another pcap. But this time we need to use wireshark on the vlan that doesn’t work. Use a capture filter of
port 67 or port 68
and post the results here. We see what communication is happening on the fog server side (good for comparison reasons). We need to see what the dhcp server is telling your remote vlan client. -
@zpoling I’ve looked at your vlan 2 pcap and there is some strangeness going on there too. I see an offer and then a nak from dhcp server in the same transaction thread. Basically that is the dhcp server saying hey I have this to offer, (I’m not seeing the client side), then the dhcp server says nak (never mind I can’t help you).
A functional dhcp/netboot process is simple
(client) Discover (dhcp server) Offer (client) Request (dhcp server) ACK (client) -> tftp server what size is file xxxxx tftp server -> (client) its yyy size (client)-> tftp server give me file xxxxx tftp server->(client) here's your first block
-
@george1421 The vlan 3 pcap is the one that doesn’t work. I’ll make another pcap later today. I can’t turn the option 66 and 67 back on right now, as it’ll prevent computers on the network from booting.
-
@zpoling said in TFTP only working with certain VLANs.:
I can’t turn the option 66 and 67 back on right now, as it’ll prevent computers on the network from booting
Will you explain this? Having these settings enabled only impacts systems that are pxe booting. Normal dhcp requests don’t use these two dhcp fields.
The pcap we will need, needs to be captured from a computer connected to vlan 3. That way we can see what info the dhcp server is actually telling the client computer.
-
@george1421 Notice in my picture that PXE boot stalls at “Please enter tftp server:” where I must enter the fog IP manually. On all the computers on vlan 3 that haven’t been booted up yet, they will stall at this section of the PXE boot. I don’t feel like getting more calls about this.
Around noon, every computer that will be turned on for the day should be turned on, so I’ll turn the options back on and begin testing.
-
@zpoling said in TFTP only working with certain VLANs.:
“Please enter tftp server:”
This is telling me that your dhcp server is not sending out the dhcp option 66 correctly -OR- you have 2 dhcp servers responding to the target computer (like in a primary and secondary configuration) where the second dhcp server doesn’t have the options set correctly.
FWIW: This issue is a networking infrastructure one and not specifically related to FOG.
-
@george1421 I don’t know how it couldn’t be sending the option out properly. If I have them set globally and every subnet works besides one of them, I don’t know what it could be. Unless someone happened to plug in a DHCP server enabled device to that subnet, but there doesn’t appear to be any evidence of that anywhere.
I’m sure this isn’t fog, but I’m still at a loss.
-
@zpoling First understand I’m only suggesting based on past experience. The system prompting for a tftp server in your screen shot is at an interesting point.
Initially the boot server needs to have been originally detected correctly because at the point its asking for the tftp server address that is INSIDE iPXE. So it worked the first time and now the second time the dhcp address is discovered its not detecting option 66 correctly. That (based on past experience) tells me there are more than 1 dhcp server responding.
Using wireshark to grab the pcap connected to the vlan that is in error, then pxe boot a target computer. That will tell us what is flying down the wire. Since its only one vlan I would focus on that vlan because there is something going on that is not expected.
-
@george1421 I would love to send you a pcap of the issue, but it’s no longer happening. I’ve tested several work stations a couple dozen times. They’ve all been fine.
So the VLAN 3 pcap was not enough to see what was going wrong?
-
@zpoling said in TFTP only working with certain VLANs.:
So the VLAN 3 pcap was not enough to see what was going wrong?
Was the pcap taken from a computer connected to vlan 3 while it was doing the asking for tftp server question? It looks like it was taken from the fog server.
-
@george1421 No, it was a TCPdump from the fog server.
-
@zpoling That is what I thought.
Without going to deep into this. DHCP uses broadcast messages to communicate. If there is a router between the system doing the pcap collection and the pxe booting computer we will not see the dhcp communications. This is the bits that are missing, what the client computer was being told.
With a pcap from vlan 3 we should have seen the complete dhcp sequence of discover, offer, request, ack.
-
@george1421 Here you go. Taken from my computer on VLAN 3, attempting a pxe boot on the computer that was having an issue. Though like I said, it seems to be working fine now.
0_1541174066153_vlan3computer.pcap
The computer in question is 10.32.10.118, DHCP server is 10.32.0.224, and the fog server is 10.32.0.31.
-
@george1421 Don’t spend too much time on it if at all since it’s working now. I’d love to know why it wasn’t working, but I assume those packet captures aren’t going to show jack since they weren’t captured during the issue’s occurrence.
-
@zpoling DHCP handshake looks good in the latest PCAP but as you said we won’t find out what was going on when it failed. More often than not those kinds of issues keep coming back until you really find and fix them. Make sure to capture the traffic when you see it happening next time.
Marking this as solved for now.