Some hosts are unable to get an address through DHCP

  • The setup is five hosts hooked up to a switch, and the FOG server hooked up to this switch as well. No other devices, this is on an isolated network.

    All hosts are powered off, and I’m powering them on one at a time for registration. First host was able to connect and register with the server. Second host stalled out receiving a DHCP address. I thought it might be a physical connection issue, so I took the cable from the first host, who had connected successfully, and tried with that, still failed. These are all identical systems, I’m not sure why some of them fail to connect while others have no issues?

  • @Sebastian-Roth I agree. But I also don’t have access to my building’s switches to fix multicast.

  • Moderator

    @Wayne-Workman said:

    So, if you start 10 at once, each successive one that starts runs a bit slower.
    This slowness does not speed up if bandwidth is freed up by other hosts completing.

    Unicast is just the wrong “tool” to send out lots of identical data to many clients… Sorry, couldn’t hold it back.

  • @Wayne-Workman Thanks guys, I’ll try to experiment with these options, again, it will probably have to wait until next week. The schedule is insane right now.

  • @mageta52 also, after doing what @Tom-Elliott said, you can setup UEFI support as well by following this:

  • @mageta52 Windows DHCP:

    DHCP Scope (global or individual) -> Option 66 = ip address/hostname of where to get PXE information (FOG Server IP Address).

    DHCP Scope (global or individual) -> Option 67 = filename to try to get from the PXE server (undionly.kkpxe).

  • @Wayne-Workman So, we run open dhcp on our main network. What I’m not sure about, is how to point that towards the fog server if a client is PXE booting and looking to connect.

  • @mageta52 For me, it is. It might be different for you.

    So, if you start 10 at once, each successive one that starts runs a bit slower.

    This slowness does not speed up if bandwidth is freed up by other hosts completing.

    So, the slowest one is always the last to finish and it takes forever.

    By limiting it to two here at my work, the two run at full speed, the only limiting factor is the target host’s speed of writing to disk.

    My recommendation is to see how many you can run simultaneously before you detect any slowdown with the later joining hosts. Minus one from that number and that’s the sweet spot.

  • @Wayne-Workman

    Why do you limit it to only 2 at a time? Is it faster?

  • @mageta52 A tip, I limit my fog server’s maximum connections to 2.

    So when I fire up a imaging task of 30 computers, only two run, the rest wait in line until a slot is open and then begin.

  • Moderator

    @mageta52 Unicast vs. multicast is like trying to explain the same topic to several people one at a time vs. giving a speech where the audience is more or less just listening. Sure you can unicast to a bunch of computers and I know people who use unicast for mass-deployment. So yes it works. But being kind of a network guy for me this feels like a huge wast of resources as your switch(es) need to shuffle around a lot of extra packets just for the sake of it.

    That said give unicast a try. Put a couple of your machines together in a group and start a unicast deploy for that group…

  • @Sebastian-Roth I’m afraid I have not gotten to imaging multiple clients at once yet, is it possible to image a bunch of them at a time using unicast? If so, what is the purpose of even having the multicast feature?

  • @mageta52 If that’s his only reason, you could just not use multicast. I used to, but unicast is so fast I just use that now.

    I looked over the dhcpd.conf file and interface info you posted a few days ago trying to find a problem, I didn’t see anything. I spent a good amount of time picking it over. The only thing that might even be an issue is the dns update style, since there is no DNS server on your isolated network. But I doubt this is causing the issues.

    Maybe we should try to re-approach the problem with more simple troubleshooting? I’d like to.

    Look inside of /var/log for system errors. I’d look through OS errors, any journalctl errors. I/O errors, and ensure again that firewall is indeed off and SELinux is disabled.

    Wherever you put the fog installation files, just use those to reinstall fog. You’ll need an Internet connection for fog trunk to run the installer but only temporarily.

    Try a different switch. An un-managed dumb switch.

    Is the cabling or server nearby high voltage equipment, electric motors, HVAC equipment, manufacturing equipment, a microwave, or very close florescent lighting? These things will cause RF noise and can interfere with network communications and motherboards, ram, power supplies, and so on.

    Do a MemTest on your FOG Server, you can use a bootable CD or flash drive to do this.

    Do you have a power supply tester you can use on the fog server’s power supply?

    Unplug peripherals if any, all the fog server needs is a network cable and a power cable.

  • @Sebastian-Roth

    Our engineer does not want the server on the core network. Apparently the switches on the core network are not set up to handle multicast and it creates issues.

    I looked at the logs and it said that it wrote 44 leases to to the lease file, there should still be more than enough addresses. Not sure why it’s attempting to hand out with each attempt.

  • Moderator

    @mageta52 said:

    Will the logs show how many addresses are leased? Is there some place i can check?

    DHCP leases should be in /var/lib/dhcp/dhcpd.leases. At least it is here on debian. My syslog is saying this when I restart the DHCP service:

    ... dhcpd: Wrote 22 leases to leases file.

    Or use dnsmasq in proxy mode!? Although I have to admit that I don’t find dnsmasq’s proxy mode to be that good - it has limitations when it comes to serving BIOS and UEFI - it still might be a way to go for you.

    But as Wayne already said, adding PXE booting options to the existing DHCP server is definitely the best way to go and shouldn’t conflict with anything in your network. Talk to your network guys.

  • @Wayne-Workman There is another DHCP server on that network, and per security requirements it’s not going to be allowed in the future.

  • @mageta52 Why can’t you just leave the fog server on the production network? Maybe imaging will happen properly there?

  • @Sebastian-Roth

    Alright, so here’s the deal with the arps for; That subnet is actually our production network. I installed FOG on that subnet, and then when I want to image I just put it on an isolated, unmanaged switch with the other clients. The gateway is still there though, so it continues to look for it, even when it’s not connected to the production network.

    Regarding the DHCP issue, earlier in the week, I allowed the client machines to boot to their old Windows install, and they were able to get addresses just fine. If it was an issue of exhausting the pool, I should have seen it there. On Monday I can try this again to confirm. I can look at the logs as well to see if there is anything.

    Will the logs show how many addresses are leased? Is there some place i can check?

  • Moderator

    @mageta52 Thanks for the packet dump. I see a perfect first boot. Seems all fine as you said. The only thing that got my attention was the time between DHCP discovery request sent by the client and DHCP offer sent back by the server. There is a one second delay which I’ve never seen before I think. Let’s keep that in mind although I am not sure where this is coming from and if that might play a role here.
    Then after the first successful boot I see a nice DHCP discovery request send by a different client (MAC address). Again, one second delay followed by a DHCP offer. Although I am not exactly sure it is making it to the client I guess it does (as all the other communication is fine). Then I see a couple more discovery/offer pairs but no request/ack to properly finish the DHCP talk.

    Another odd thing I see are a lot of ARP requests from your server. It keeps asking “who is”. Either you have this IP configured as DNS server (cat /etc/resolv.conf) or as default gateway (route -n) plus maybe an external DNS server.

    Just a wild guess. Your server is trying to resolve a reverse DNS entry before handing out the IP!!! Taking a second for the timeout which then confuses the client…

    Ha, wait a second. I think I might have found something else. When the first client requests an IP the server asks “who has” to check if this IP is in use already. That’s perfectly fine. But then the next client comes and asks for an IP and I see the server offering the same IP to the different client and again sending an ARP broadcast “who has”. Possibly you are running out of leases??? Check the system logs while this is happening: tail -f /var/log/syslog | grep dhcp (or maybe /var/log/messages or /var/log/daemon.log)

    Maybe a simple restart of the DHCP service can fix this? Sure you haven’t changed your DHCP config? The one you posted seems fine (range from .235.10 to .235.254).