PXE-E78 Cannot locate boot server

mkstreet

I did some experimenting and noticed this anomaly.

The Fog server and the host I want to load are on two switches which are connected together.
When I attach the Ethernet cable to the outside LAN / Internet to one of those switches, then I get the behavior as noted. Meaning the DHCP gets answered but when I press F8 to network boot, I get the PXE-E78 error.

When I disconnect that cable to the outside, then the PC I want to load can no longer find the DHCP. So something external to the LAB must be helping get the initial requests to 10.0.253.24 ?

mkstreet

@george1421

compteach@iepcomlabsrv:~$ netstat -an|grep 4011
udp        0      0 0.0.0.0:4011            0.0.0.0:*
compteach@iepcomlabsrv:~$

george1421

@mkstreet said in PXE-E78 Cannot locate boot server:

@george1421

When I disconnect that cable to the outside, then the PC I want to load can no longer find the DHCP. So something external to the LAB must be helping get the initial requests to 10.0.253.24 ?

This I understand. Since dnsmasq is not providing dhcp services for you, its only providing dhcpProxy services (filling in the gaps left out by your main dhcp server). What is strange is that your main dhcp server is sending out itself as the next server. I simply can’t understand why its not working here. It SHOULD BE WORKING.

TBH right now I’m at a loss, on where to turn next everything should be working. If you can disable the dhcp relay in your router (10.0.253.1) for 10.0.253.0 subnet then you can have the fog server with isc dhcp enabled supply the IP address (then also dnsmasq is not needed) or on your router (10.0.253.1) add yor dnsmasq server as the last dhcp server in its list. But this is starting to get messy.

The only other thing is to see if you can get your main dhcp server to NOT send out dhcp option 66 {next-server}. But its not clear if this will fix the issue either.

mkstreet

@george1421

Hmmm… If I understand this correctly, then I cannot disable dhcp relay within 10.0.253.1 as other hardware in this subnet but outside my lab would still need dhcp service for other purposes.
And, as you say, this path is getting messy.

As for changing the main dhcp option 66, I could try to request this. Would this affect only my subnet or our whole facility? My lab is about 90% of my subnet, but the main dhcp is servicing the whole campus which is comprised of several subnets… If the option 66 will affect others outside my area, then it is hard for me to do.

I am setting up the new version of FOG etc under VirtualBox. I think I will complete that, as it is a new clean install. I will see if this clean start resolves anything, as opposed to this – attempting to add DNSMASQ to an existing setup that (was) working.

Sebastian Roth

@mkstreet I understand that it is hard or maybe impossible to change the config of that 10.0.253.1 server. As you said dnsmasq can be used for exactly this purpose. So let’s give it another go. I’d say dnsmasq is answering faster as the other server as it is located right within your subnet. The PCAP output kind of proofs this. 10.0.253.24 answered 0.5 seconds before 10.0.253.1 did. So that’s good!
Then we only need to offer the correct PXE information to the client in one single DHCP answer. This is next-server and filename. Please modify the following line in your config and add the server IP as shown:

dhcp-boot=undionly.kpxe, 10.0.253.24, 10.0.253.24
pxe-service=X86PC, "Boot from network", undionly, 10.0.253.24

The later one shouldn’t be used but setting this correctly doesn’t hurt I’d say. Please take another PCAP capture to see if the next-server info is now being sent by dnsmasq.

[edit] I just saw that the information in the wiki page does not set those adresses. I haven’t played with dnsmasq in a while so this is just a quick idea. It’s kind of strange that you get an answer from dnsmasq that does not have next-server set… [/edit]

Sebastian Roth

I just had a look at the dnsmasq code (version 2.68-1ubuntu0.1 used in Ubuntu 14.04) and found that from the initially posted log output it seems like next-server (mess->siaddr.s_addr in the code) is actuelly not being set. Now I know what’s going on I think. If I remember correctly dnsmasq in proxy mode does not have to send the next-server information in the first DHCP answer (reply to the first DHCP discovery request). The client knows that there is a DHCP proxy server as it got a first quick message (only containing the filename) and should contact that server (port 4011) after finishing the normal DHCP handshake to setup an IP.

In your case the next-server information sent by 10.0.253.1 is most probably interfering and confusing the client. I guess I need to think a little more about this to find a good solution… Maybe George has an idea.

george1421

@Sebastian-Roth I think I like your first suggestion, updating the config file with the additional IP references:

dhcp-boot=undionly.kpxe, 10.0.253.24, 10.0.253.24
pxe-service=X86PC, "Boot from network", undionly, 10.0.253.24

If that fails, get another pcap file of the booting process to let us see what changed in the conversation.

The second though is that Wow, for ubuntu dnsmasq of 2.68 that was released 08-Dec-2013, where most of the distros are at 2.72. If this doesn’t work I can setup a ubuntu VM and compile the latest version of dnsmasq to see if that helps. But before I go through that effort lets see if your edits work.

Tom Elliott

@mkstreet Comment the port=0 line of your ltsp.conf file and restart dnsmasq.

george1421

@Tom-Elliott said in PXE-E78 Cannot locate boot server:

@mkstreet Comment the port=0 line of your ltsp.conf file and restart dnsmasq.

I talked with Tom over IM and he said the port=0 command makes the dnsmasq server become a DNS server and does exactly what we are seeing with the resolve.conf file. While this has no impact on the next host being sent it should resolve the FOG server name resolving.

Tom Elliott

@george1421 More accurately, commenting the port=0 allows the DNSMasq portion be transfer the originating DNS information to the new host. Leaving port=0 enabled essentially turns off DNS information. If you’re planning to leave port=0 enabled, then you’ll likely need to change the next-server to point at an IP address rather than a hostname.

mkstreet

@george1421

OK. The problem with dns and resolv.conf seems ok now. I am able to do apt-get updates and ping external places such as google.com. Oddly, the resolv.conf just shows the loopback:

compteach@iepcomlabsrv:/etc$ cat resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 127.0.0.1
compteach@iepcomlabsrv:/etc$
compteach@iepcomlabsrv:/etc$ ping google.com
PING google.com (110.164.6.251) 56(84) bytes of data.
64 bytes from mx-ll-110.164.6-251.static.3bb.co.th (110.164.6.251): icmp_seq=1 ttl=55 time=2.59 ms
64 bytes from mx-ll-110.164.6-251.static.3bb.co.th (110.164.6.251): icmp_seq=2 ttl=55 time=2.58 ms
64 bytes from mx-ll-110.164.6-251.static.3bb.co.th (110.164.6.251): icmp_seq=3 ttl=55 time=2.75 ms
^C
--- google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 2.588/2.644/2.751/0.095 ms

I made the changes to the ltsp.conf
a) comment out the port=0
b) change the dhcp-boot
c) change pxe-service

How every there is no change in the behavior.

I captured a new pcap and posted it here:
http://s000.tinyupload.com/?file_id=29844646319354557181

And for completeness, here is my ltsp.conf:

compteach@iepcomlabsrv:/etc/dnsmasq.d$ cat ltsp.conf
# Don't function as a DNS server:
# MKS  06-Oct-2016
#port=0

# Log lots of extra information about DHCP transactions.
log-dhcp

# Dnsmasq can also function as a TFTP server. You may uninstall
# tftpd-hpa if you like, and uncomment the next line:
# enable-tftp

# Set the root directory for files available via FTP.
tftp-root=/tftpboot

# The boot filename, Server name, Server Ip Address
dhcp-boot=undionly.kpxe,10.0.253.24,10.0.253.24

# rootpath option, for NFS
#dhcp-option=17,/images

# kill multicast
#dhcp-option=vendor:PXEClient,6,2b

# Disable re-use of the DHCP servername and filename fields as extra
# option space. That's to avoid confusing some old or broken DHCP clients.
dhcp-no-override

# PXE menu.  The first part is the text displayed to the user.  The second is the timeout, in seconds.
pxe-prompt="Press F8 for boot menu", 3

# The known types are x86PC, PC98, IA64_EFI, Alpha, Arc_x86,
# Intel_Lean_Client, IA32_EFI, BC_EFI, Xscale_EFI and X86-64_EFI
# This option is first and will be the default if there is no input from the user.
pxe-service=X86PC, "Boot from network", undionly, 10.0.253.24

# A boot service type of 0 is special, and will abort the
# net boot procedure and continue booting from local media.
#pxe-service=X86PC, "Boot from local hard disk", 0

# If an integer boot service type, rather than a basename is given, then the
# PXE client will search for a suitable boot service for that type on the
# network. This search may be done by multicast or broadcast, or direct to a
# server if its IP address is provided.
# pxe-service=x86PC, "Install windows from RIS server", 1

# This range(s) is for the public interface, where dnsmasq functions
# as a proxy DHCP server providing boot information but no IP leases.
# Any ip in the subnet will do, so you may just put your server NIC ip here.
# Since dnsmasq is not providing true DHCP services, you do not want it
# handing out IP addresses.  Just put your servers IP address for the interface
# that is connected to the network on which the FOG clients exist.
# If this setting is incorrect, the dnsmasq may not start, rendering
# your proxyDHCP ineffective.
dhcp-range=10.0.253.24,proxy

# This range(s) is for the private network on 2-NIC servers,
# where dnsmasq functions as a normal DHCP server, providing IP leases.
# dhcp-range=192.168.0.20,192.168.0.250,8h

# For static client IPs, and only for the private subnets,
# you may put entries like this:
# dhcp-host=00:20:e0:3b:13:af,10.160.31.111,client111,infinite
#dhcp-host=f8:0f:41:a0:04:75,net:allow
#dhcp-ignore=#allow

mkstreet

@george1421

On another box, I installed Ubuntu 16.04 LTS and Fog 1.3.0-RC-11.

I shut down the dnsmasq on the 10.0.253.24 box and started dnsmasq on the new box with an ltsp.conf etc.

I get the same behavior.

The PC boots and finds the internal DHCP (171.xxxx) and gets to the new box whose IP is 10.0.253.23.

It prompts me for F8 to boot from the network. Then I get:

UD 10.0.253.23

Which times out with the PXE-E78 error.

I captured a new pcap for this in case this is helpful.
This pcap is at:
http://s000.tinyupload.com/?file_id=97921552308199994673

mkstreet

@george1421

To experiment, I used the new Fog 1.3 install to contact that same PC with a hardware inventory request from Fog.

I noticed that…

The WLON did not happen. In the past, WLON worked.
When I manually turned that PC, the same results about finding the Fog server and press F8. When I did, I got the same PXE-E78 error and in Fog the active task showed the hardware inventory as still in progress.

I thought that this would be a way to attempt communication that did not involve TFTP boot.

I created a pcap file using the same command you gave me before
(sudo tcpdump -w output.pcap port 67 or port 68 or port 69 or port 4011)
but I don’t know if the port filters on this are suitable to capture needed info for the WLON/hardware inventory tasks…

This pcap file is at:
http://s000.tinyupload.com/?file_id=61578123182931059079

Tom Elliott

@mkstreet What is WLON? I’m imagining it’s (Wake on lan?), this will only work at a “Layer 2” level. To prove, if you have a system on the same switch as the fog server, and try to WOL to it, it should turn on (unless it’s one of the systems like Apple that only allows WOL to work if the machine is sleeping – not powered off).

What if you commented “dhcp-no-override”? If I’m understanding this particular option – Per the man page:

–dhcp-no-override
(IPv4 only) Disable re-use of the DHCP servername and filename fields as extra option space. If it can, dnsmasq moves the boot server and filename information (from dhcp-boot) out of their dedicated fields into DHCP options. This make extra space available in the DHCP packet for options but can, rarely, confuse old or broken clients. This flag forces “simple and safe” behaviour to avoid problems in such a case.

If I’m to understand this particular item, it prevents the configuration (in proxy mode?) from overriding the information that’s sent in the “main” packet.

Sebastian Roth

@mkstreet Beside the issues we were talking about already you seem to still have the FOG server configured via DHCP! This time I see dnsmasq answers from 10.0.253.23 in the PCAP files. This would cause problems even if every thing else is fine. Make sure you setup your FOG server to have a static IP!!

george1421

@Sebastian-Roth said in PXE-E78 Cannot locate boot server:

@mkstreet you seem to still have the FOG server configured via DHCP!

I was just thinking about this on the drive in this morning. This fog server was on an isolated network so it was the dhcp server then. I was wondering if the OP remember to stop the isc dhcp server when he setup dnsmasq?? This might cause this exact issue since dnsmasq would not be able to bind to the udp ports since they are already in use. But I also considered that dnsmasq should complain about not being able to bind to the ports either so I kind of pushed that idea to the back on possible causes. As I stated before this configuration should be working. Dnsmasq is not that hard to setup.

mkstreet

@george1421

It’s a good question about isc-dhcp. I haven’t ever been using it as far as I know.
But I checked this morning after booting the lab

For 10.0.253.24 (Ubun 14.04, Fog 3121), I get the following:

compteach@iepcomlabsrv:~$ sudo service isc-dhcp-server status
isc-dhcp-server stop/waiting
compteach@iepcomlabsrv:~$
compteach@iepcomlabsrv:~$
compteach@iepcomlabsrv:~$ sudo service isc-dhcp-server stop
stop: Unknown instance:
compteach@iepcomlabsrv:~$
compteach@iepcomlabsrv:~$ sudo service isc-dhcp-server status
isc-dhcp-server stop/waiting
compteach@iepcomlabsrv:~$

mkstreet

@Tom-Elliott

RE: WOL.
Yes, I meant Wake On LAN by WLON.
This used to work but doesn’t work now.
Yesterday, I offered this information and the pcap as additional information that might help uncover something.

RE: dchp-no-override
For DHCP-NO-OVERRIDE, I commented this out in ltsp.conf and restarted dnsmasq … on 10.0.253.24 (Ubuntu 14.04, Fog 3121) and it had no effect.

I captured a pcap of this:
http://s000.tinyupload.com/?file_id=06163152893454480790

Should I put this ltsp.conf option back or leave commented out?

mkstreet

@Sebastian-Roth

Yesterday, I did send a pcap from 10.0.253.24 (Ubu 14.04/Fog 3121).
This is my original Fog setup. It is setup with static IP.
Here is the link to yesterdays pcap from that box:
http://s000.tinyupload.com/?file_id=29844646319354557181

In addition, I created another pcap today for Tom Elliott from that box . Here is the link to that pcap:
http://s000.tinyupload.com/?file_id=06163152893454480790

I will now go check the settings on 10.0.253.23 (Ubu 16.xx / Fog 1.3 release candidate 11).
I suspect you are correct that this one is not static, but I will check that now and post information about that box (and pcap).

george1421

I’ll restate again, it should be working.

I want you to learn what I’m seeing (not that I really know what I’m looking at). If you install wireshark on your computer you can review these pcap files.

Below is the communication that is going on as viewed by your FOG server. In line 1 you see the client send a discover packet (basically hello I’m here I need network info). Then you see in step 2, two devices reply with an offer (here’s your network info). In step 3 you see the client again say “great this is a list of additional stuff I need”. In step 4 your main dhcp server says “ok here is the additional stuff you requested”, note here the dnsmasq box did not reply because it couldn’t add anything to what I already sent. Now here is where the process falls down. When the client gets the ACK back what it should do is contact the dhcpProxy on port 4011 and request the file name to download then reach out to the tftp server (listed in the next server field) and download the boot loader file (undionly.kpxe). That is what is suppose to happen. Now let me show you a side by side of what a proper exchange should look like.

PXE-E78 Cannot locate boot server

99

12.7k

17.6k

156.7k