Some hosts are unable to get an address through DHCP

mageta52

@Sebastian-Roth I could try analyzing the traffic coming into the fog server during an attempted PXE boot to see what’s going on, I’ll report back in a while with the findings.

Sebastian Roth

@mageta52 You are more than welcome to upload a pcap file here in the forums or send me a private message if you need help with finding the issue in the packet dump. As you can read in the forums we’ve done this a couple of times and I feel this is one of the best ways to help people debugging their network issues. When you start looking at the packets and understanding what’s going on - this is when you find the solution. No worries, we’ll help you with that.

mageta52

Once again, the same pattern, machine 1 gets through and can register, machine 2 fails and stalls out.

I’m not terribly familiar with Tcpdump, but it’s built in, so this is what I got from the second machine…

16:52:11.886022 IP 192.168.235.52.bootps > 255.255.255.255.bootpc: UDP, length 300
16:52:14.893389 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 548
16:52:14.893609 IP 192.168.235.52 > 192.168.235.17: ICMP echo request, id 35325, seq 0, length 28
16:52:15.887222 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:15.894726 IP 192.168.235.52.bootps > 255.255.255.255.bootpc: UDP, length 300
16:52:16.889224 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:17.891224 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:18.902506 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 548
16:52:18.902730 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:19.903230 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:19.903360 IP 192.168.235.52.bootps > 255.255.255.255.bootpc: UDP, length 300
16:52:20.905230 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:22.911623 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 548
16:52:22.911836 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:23.912965 IP 192.168.235.52.bootps > 255.255.255.255.bootpc: UDP, length 300
16:52:23.913229 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:24.915229 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:26.920739 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 548
16:52:26.920963 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:27.922086 IP 192.168.235.52.bootps > 255.255.255.255.bootpc: UDP, length 300
16:52:27.923228 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:28.925225 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:30.929858 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 548
16:52:30.930083 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:31.931204 IP 192.168.235.52.bootps > 255.255.255.255.bootpc: UDP, length 300
16:52:31.931234 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:32.933228 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:34.938973 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 548
16:52:34.939191 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:35.939229 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:35.939363 IP 192.168.235.52.bootps > 255.255.255.255.bootpc: UDP, length 300
16:52:36.941229 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:38.948090 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 548
16:52:38.948310 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:39.949434 IP 192.168.235.52.bootps > 255.255.255.255.bootpc: UDP, length 300
16:52:39.951227 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28
16:52:40.953226 ARP, Request who-has 192.168.235.17 tell 192.168.235.52, length 28

It looks like it wants to assign 192.168.235.17, but is unable to. My guess is that the arp is to make sure that the address is not in use already, but I’m not able to figure out if that’s coming from the PC, or from the fog server? Unfortunately I lost the successful exchange with the first machine. The output got blown away by the data transfer during the machine registration. If more info is needed I can provide it. If there are any switches to turn on with tcpdump for better results let me know and I’ll run the test again.

Wayne Workman

@mageta52 Those are just ARP broadcasts. We need to see everything. Look here: https://wiki.fogproject.org/wiki/index.php?title=Troubleshoot_TFTP#Troubleshooting
There’s instructions in there for TCPDump.

Sebastian Roth

@mageta52 This is not looking bad. But we need the full “content” of all those packets. You just need to add the -w command line parameter to dump to a file. As well a filter is probably a good idea: tcpdump -w /tmp/dhcp_works_sometimes.pcap udp or arp. Leave the command sitting there and do your client bootups. After success and failure stop tcpdump (ctrl + c) and upload that file to the forum.

mageta52

@Sebastian-Roth I’m completely swamped at work this week, and have to abandon this for a while, but I hope to capture the data on Monday.

Sebastian Roth

@mageta52 All good. Just get back to us when you have time and I am sure we can get you(r DHCP) sorted…

mageta52

@Sebastian-Roth 0_1463176833850_dhcp_fail.pcap

Never uploaded a file before, but there is my attempt, hopefully it can be salvaged.

This was 1) a successful connection from a machine, with full registration, then 2) unsuccessful connection attempt by second machine. Got hung up looking for a DHCP address.

Sebastian Roth

@mageta52 Thanks for the packet dump. I see a perfect first boot. Seems all fine as you said. The only thing that got my attention was the time between DHCP discovery request sent by the client and DHCP offer sent back by the server. There is a one second delay which I’ve never seen before I think. Let’s keep that in mind although I am not sure where this is coming from and if that might play a role here.
Then after the first successful boot I see a nice DHCP discovery request send by a different client (MAC address). Again, one second delay followed by a DHCP offer. Although I am not exactly sure it is making it to the client I guess it does (as all the other communication is fine). Then I see a couple more discovery/offer pairs but no request/ack to properly finish the DHCP talk.

Another odd thing I see are a lot of ARP requests from your server. It keeps asking “who is 192.168.235.1”. Either you have this IP configured as DNS server (cat /etc/resolv.conf) or as default gateway (route -n) plus maybe an external DNS server.

Just a wild guess. Your server is trying to resolve a reverse DNS entry before handing out the IP!!! Taking a second for the timeout which then confuses the client…

Ha, wait a second. I think I might have found something else. When the first client requests an IP the server asks “who has 192.168.235.17” to check if this IP is in use already. That’s perfectly fine. But then the next client comes and asks for an IP and I see the server offering the same IP to the different client and again sending an ARP broadcast “who has 192.168.235.17”. Possibly you are running out of leases??? Check the system logs while this is happening: tail -f /var/log/syslog | grep dhcp (or maybe /var/log/messages or /var/log/daemon.log)

Maybe a simple restart of the DHCP service can fix this? Sure you haven’t changed your DHCP config? The one you posted seems fine (range from .235.10 to .235.254).

mageta52

@Sebastian-Roth

Alright, so here’s the deal with the arps for 192.168.235.1; That subnet is actually our production network. I installed FOG on that subnet, and then when I want to image I just put it on an isolated, unmanaged switch with the other clients. The gateway is still there though, so it continues to look for it, even when it’s not connected to the production network.

Regarding the DHCP issue, earlier in the week, I allowed the client machines to boot to their old Windows install, and they were able to get addresses just fine. If it was an issue of exhausting the pool, I should have seen it there. On Monday I can try this again to confirm. I can look at the logs as well to see if there is anything.

Will the logs show how many addresses are leased? Is there some place i can check?

Wayne Workman

@mageta52 Why can’t you just leave the fog server on the production network? Maybe imaging will happen properly there?

mageta52

@Wayne-Workman There is another DHCP server on that network, and per security requirements it’s not going to be allowed in the future.

Wayne Workman

@mageta52 So turn DHCP off of the fog server and configure your main DHCP to support fog.
https://wiki.fogproject.org/wiki/index.php?title=Modifying_existing_DHCP_server_to_work_with_FOG
https://wiki.fogproject.org/wiki/index.php?title=BIOS_and_UEFI_Co-Existence

Sebastian Roth

@mageta52 said:

Will the logs show how many addresses are leased? Is there some place i can check?

DHCP leases should be in /var/lib/dhcp/dhcpd.leases. At least it is here on debian. My syslog is saying this when I restart the DHCP service:

...
... dhcpd: Wrote 22 leases to leases file.
...

Or use dnsmasq in proxy mode!? Although I have to admit that I don’t find dnsmasq’s proxy mode to be that good - it has limitations when it comes to serving BIOS and UEFI - it still might be a way to go for you.

But as Wayne already said, adding PXE booting options to the existing DHCP server is definitely the best way to go and shouldn’t conflict with anything in your network. Talk to your network guys.

mageta52

@Sebastian-Roth

Our engineer does not want the server on the core network. Apparently the switches on the core network are not set up to handle multicast and it creates issues.

I looked at the logs and it said that it wrote 44 leases to to the lease file, there should still be more than enough addresses. Not sure why it’s attempting to hand out 192.168.235.17 with each attempt.

Wayne Workman

@mageta52 If that’s his only reason, you could just not use multicast. I used to, but unicast is so fast I just use that now.

I looked over the dhcpd.conf file and interface info you posted a few days ago trying to find a problem, I didn’t see anything. I spent a good amount of time picking it over. The only thing that might even be an issue is the dns update style, since there is no DNS server on your isolated network. But I doubt this is causing the issues.

Maybe we should try to re-approach the problem with more simple troubleshooting? I’d like to.

Look inside of /var/log for system errors. I’d look through OS errors, any journalctl errors. I/O errors, and ensure again that firewall is indeed off and SELinux is disabled.

Wherever you put the fog installation files, just use those to reinstall fog. You’ll need an Internet connection for fog trunk to run the installer but only temporarily.

Try a different switch. An un-managed dumb switch.

Is the cabling or server nearby high voltage equipment, electric motors, HVAC equipment, manufacturing equipment, a microwave, or very close florescent lighting? These things will cause RF noise and can interfere with network communications and motherboards, ram, power supplies, and so on.

Do a MemTest on your FOG Server, you can use a bootable CD or flash drive to do this.

Do you have a power supply tester you can use on the fog server’s power supply?

Unplug peripherals if any, all the fog server needs is a network cable and a power cable.

mageta52

@Sebastian-Roth I’m afraid I have not gotten to imaging multiple clients at once yet, is it possible to image a bunch of them at a time using unicast? If so, what is the purpose of even having the multicast feature?

Sebastian Roth

@mageta52 Unicast vs. multicast is like trying to explain the same topic to several people one at a time vs. giving a speech where the audience is more or less just listening. Sure you can unicast to a bunch of computers and I know people who use unicast for mass-deployment. So yes it works. But being kind of a network guy for me this feels like a huge wast of resources as your switch(es) need to shuffle around a lot of extra packets just for the sake of it.

That said give unicast a try. Put a couple of your machines together in a group and start a unicast deploy for that group…

Wayne Workman

@mageta52 A tip, I limit my fog server’s maximum connections to 2.

So when I fire up a imaging task of 30 computers, only two run, the rest wait in line until a slot is open and then begin.

mageta52

@Wayne-Workman

Why do you limit it to only 2 at a time? Is it faster?

Some hosts are unable to get an address through DHCP

141

12.2k

17.3k

155.5k