Request: Delayed, Asynchronous Sequential Wake-on-LAN Packets
We have Dell machines, and they have a BIOS feature that allows the computer to automatically PXE boot whenever the computer is turned on using wake-on-LAN. (When the computer is turned on using the power button, it boots from the hard drive instead.)
When we deploy using multicast, FOG automatically sends a wake-on-LAN packet to all machines in the group. This makes deployment very easy.
The problem is that when we do very large deployments, turning on all the computers at the exact same time overwhelms our DHCP server when the machines all try to PXE boot, so most of the machines do not receive a response from the DHCP server in time. So, most of the machines never get to FOG when we deploy many computers at once.
Could we get an option where it would have a 2-second delay between the wake-on-LAN packets? For example, if you had 3 computers, FOG would do this:
Send wake-on-LAN packet to computer 1
Wait 2 seconds
Send wake-on-LAN packet to computer 2
Wait 2 seconds
[*]Send wake-on-LAN packet to computer 3
That would probably fix our issue completely.
I keep forgetting about hubs…
Good job, Uncle Frank.
Theoretically, if you had a 1gbps hub, you could slide that in between your DHCP server and switch… and monitor traffic that way too… I would definitely not try it if the hub was consumer grade… you need a NICE hub to do that, and it needs to be during down-time.
[quote=“loosus456, post: 44757, member: 26317”]So, is there anything client-wise or network-wise (maybe using something like WireShark) that I can do myself to help determine where the problem is, at least?[/quote]
I love wireshark/tcpdump and I’d see you definitely need to capture packets to be able to see what’s going on more clearly!!
Start “from scratch”. Pick just one of your clients (maybe one that fails most times) and hook it up with a hub in bewteen. Get your laptop with wireshark and capture while you WOL 150 clients.
You’ll see the WOL packets and DHCP requests from ALL the clients as those are being send to broadcast addresses. Depending on the DHCP server you also see ALL the DHCP offers as some of the DHCP servers send their answer to broadcast as well. But you DEFINITELY see the answer for your particular client!
Probably stop capturing as soon as you see TFTP (turquoise) and HTTP (green) traffic.
Then get yourself a big cup of tea or coffee and look through the packet dump. Handy display filters are ‘bootp’, ‘tftp’ and ‘http’… Let us know if you need help looking through the dump file. Depending on the size you are more than welcome to upload the dump here or use one of those file upload services…
[quote=“loosus456, post: 44757, member: 26317”]
So, is there anything client-wise or network-wise (maybe using something like WireShark) that I can do myself to help determine where the problem is, at least?[/quote]
I want to help, but we need to get away from what you can’t change and get back to basic trouble shooting, and look for alternatives.
FYI, searching the web, I found one (1) page talking about PXE booting DHCP wait times:
And that says clients wait for 60 seconds…
Here’s a VERY interesting post from this site:
There’s a comment that says (hate for the resource to disappear):
[QUOTE]usually bypass the portfast requirement by making hosts do the extended memory check in the bios (not sure about Cisco but HP and IBM allow this setting) by the time the host boots to PXE everything is able to forward. UEFI systems also seem to slow up the boot process. I’m not saying the portfast option won’t work but sometimes it’s a tough sell to the “that’s the way we do things” networking types. This has worked everytime and everywhere I’ve tried it, without network intervention.[/QUOTE]
So, you may want to try to enable the LONG bios check… maybe it’ll work… who knows? According to that guy, it will. Apparently, it just gives the switches a little more time to register all the MACs, and maybe the DHCP server a little more time, too… ? MAYBE you could even stagger turning on the LONG check, so that half the systems boot up fast, and the other half not so fast… It’s not pretty, but it’d break the DHCP load into two heaps instead of one.
What version of FOG are you using?
HOW MANY clients can work at once when WOL’ing for imaging? You don’t have to image to test your specific problem. Send out a MemTest task to 50 hosts, 75, 100, and so on.
Try this test in different areas. Try several switches away from the DHCP server, and then try clients that are connected to the switch that DHCP is actually on…
If you’re in a computer lab, keep eyes on a row or two, watch which ones don’t work. Make note of them. Try those systems with MemTest individually. See what happens.
Ask you’re network guys if PortFast is enabled on the network, ask them if DHCP helper addresses are configured (PortFast is likely to help the most).
As far as WireShark goes, there’s not much to do with it in this scenario honestly… It’d be beneficial to install it on the DHCP server and see what’s happening, but you don’t have access. And I’m going to guess that asking the Network guys to just turn off DHCP so you can configure DHCP on FOG isn’t an option either (you be the judge of that).
Please let us know your thoughts/findings. We’re here to help, even with the tough problems.
I’m not positive that it’s DHCP at all.
However, the message received on all of them that I have personally seen is that “A DHCP response was not received.” iPXE appears to never load. It’s a little difficult to make sure that’s the message they all display, though, because all the computers are doing this at the exact same time, and just a few seconds after that message appears, the computer continues booting, going to the hard drive next. So I can really only check one or two computers per test.
I asked for help today from the department that manages the network and DHCP, and they don’t want to deal with it at the moment. They said I should just wait until they upgrade the servers from Windows Server 2003 to Windows Server 2012 R2, but that’s going to be several months, at best.
So, is there anything client-wise or network-wise (maybe using something like WireShark) that I can do myself to help determine where the problem is, at least?
That’s a good question, Uncle Frank.
And yes, DHCP should easily handle 50, or 300 simultaneous requests. It’s literally only 5 packets per client, with the 5th being an ARP probe from the client at the end.
[*]Client probes network for it’s new address
On a gigabit network, (Just an example from my Win7 workstation), if you monitor the network adapter status, mine is sending 400 to 1,000+ packets every 1 to 3 seconds… Receiving about the same. And that’s with it just sitting here idle…
This just highlights how lightweight DHCP is…
Are you sure DHCP is the issue and not TFTP?? What error messages do you see on the screen when clients fail?
I don’t control DHCP, unfortunately. And I have a feeling that when/if I bring the matter up, I’m going to be told the reason that it doesn’t work is that I am trying to push too many clients to it simultaneously and to not do that.
The interface is gigabit, not sure if the server is otherwise bogged down (although I’ve never had this issue except this exact scenario), and it is on the same LAN, although through a couple of gigabit switches. Just for reference, this DHCP server is not dedicated to FOG. It’s used by the entire campus.
So, is it fair to say that DHCP should be able to handle, say, 50 [B]simultaneous[/B] requests? I bold “simultaneous” because it is quite literally, what appears to be, the [I]exact[/I] same time.
Maybe you should examine your DHCP server, since it seems to be the problem.
Does it have a fast enough interface? Is the system bogged down with other tasks? Does it have enough RAM for breathing room? Is it on the same LAN, or is it a remote server running through a constricted and traffic heavy WAN pipe?
You really shouldn’t be having any issues with DHCP keeping up… it’s such an incredibly lightweight service…
[quote=“Tom Elliott, post: 44657, member: 7271”]Not that it directly helps your situation, but most networks already limit the number of wake on lan packets automatically sent at a time. At least from my experience. For this, to get systems to boot properly, we added to the FOGScheduler service to have it check which tasks have checked in and which ones haven’t yet. If they haven’t checked in at the next cycle check, it will resend the wake on lan packets to those systems.[/quote]
Our network actually doesn’t limit it, it seems. If we push out to 150 systems, they all wake up simultaneously – which is normally a good thing, except here.
Any other ideas? I hate to make three or four groups for a group of machines that should really be in just one group, but I guess I will do what I have to do.
Not that it directly helps your situation, but most networks already limit the number of wake on lan packets automatically sent at a time. At least from my experience. For this, to get systems to boot properly, we added to the FOGScheduler service to have it check which tasks have checked in and which ones haven’t yet. If they haven’t checked in at the next cycle check, it will resend the wake on lan packets to those systems.