DHCP lease timeout issue
-
Hey guys,
I am having a problem where some machines are not receiving dhcp address within the allowed time once the kernel boots and it tries to do a task.
PXE boot get a lease and starts iPXE (gives DHCP enough time)
iPXE then gets a lease and loads the correct action (gives DHCP enough time)
The kernel starts, abunch of info goes by and then the following happens in exactly 10 seconds:Starting network…
udhcpc (v1.22.1) started
Sending discover…
Sending discover…
Sending discover…
No lease, failing
ssh-keygen…etc, etc, etcDoesn’t give the DHCP enough time!
Is there a way to increase the timeout on the dhcp requests? Our server runs the whole campus, wireless, etc… It always responds…just never within the 10 seconds that it is given in fog on this step. If the timeout was 20 seconds, I’m pretty sure this issue would go away.
Any thoughts?
Adam
-
Well…i found the issue.
When i updated FOG from the SVN to fix the resize partition issue, it installed a new kernel also. This kernel seems to be the source of the issue here. I reverted back to the original 1.2.0 package kernel and now it is able to boot correctly.
What was changed between the 1.2.0 default kernel and then current SVN kernel?
Thanks,
Adam
-
We found out the main issue. The kernel switch alieavated it some but the issue remains.
Spanning-tree on the network here is interfering with DHCP in FOG. It’s not allowing the port up fast enough for the latest kernel/tools and dies at discovery.
Can the timeout please be lengthened for the DHCP discover requests in the latest tool? Is there any way that i can set it…can you point me where to go?
Thanks,
Adam
-
DHCP timeouts aren’t really something fog controls in any form. If you’re getting ipxe menu screens, then it seems (to me) that you’re booting systems using USB NICs?
Is this issue occurring on all of your systems or specific systems?
-
I am getting ipxe screens (but had to tweek spanning-tree even for that) becuase iPXE was not waiting long enough for the DHCP request to come through (ipxe waits 15 seconds…with full spanning-tree, it takes from 13-20 seconds to respond). We enabled fast learning for spanning-tree and it now responds in about 9-12 seconds which makes iPXE happy. But once it boots the kernel and starts (for a task or a registration), it does the above with the discover statements and then gives up. We timed it and it is only giving that process 10 sec max and then gives up and the task just hangs with a blinking cursor which slowly moves everything off the screen.
I am the network person on our campus so i can test but we cannot turn off spanning tree (nor would be want…it keeps people from incorrectly connecting network cables and taking down the entire network). I tried turning if straight off for a test and it was happy and all worked. That last “Sending Discover” process however just will not give the network enough time to “turn on”.
Any enterprise type network would have this issue and i can’t see me being the only one here
On .32 though, the PXE part would wait as long as needed and when the kernel booted and it did it’s things, it also waited as long as needed. It just seems to be something with this kernel/tools combo that refuses to wait more then 10 sec…
Also, no, the USB is PCIe 1x based, not USB and are standard intel branded 1G chips.
As far as type, Dell Optiplex 755,760,780,790,990,9020,9030 all show this problem. The less “switches” between the computer and the network core, the better chance of success…the further away…the more likely spanning-tree won’t start fast enough for it because there are more switches in which spanning-tree must calculate before allowing traffic on that port.
Does that help…lol
-
If you need more time I can put a delay before dhcp starts but after the kernel loads maybe that’ll help?
-
I’m willing to give that a try.
-
I’m specifying a 60 timeout value. Really it shouldn’t take any more than 30 seconds but I’ve seen a few times where it may take 45 seconds. I’ll inform you when the init’s are built.
-
SVN 2485 released.
Should add timeout value to 60 seconds. It will continue on when it receives a dhcp lease or times out.
-
This fixed it. It now gives our network enough time to actualy grab a DHCP address.
I really appreciate your help!
-
Speaking of switches - I’ve noticed that the delay will also lengthen due to managed switches. We have Dell Powerconnect switches at my school … the floor switches are unamanged - (i.e. plug and go variety), some of the bigger switches are managed - which means they have a static IP address that you can change, as well as configure any sort of LAGs or VLANS. I’ve noticed that the more managed switches between the fog client and dhcp server - will dramatically increase the waiting time.
-
That is spanning tree making sure there are no loops before allowing data to pass on the port. All managed switches have it turned on by default (which is a good thing). All our switches on campus have it, so if you have 3-4 swithces in a row that have to learn its new path, it does take longer then usual to come up then on unmanaged switches, hense our issue we were having.