PXE / TFTP booting is slow



  • I have a Fog Server running on Ubuntu version 1.4.4. SVN Revision 6077. I updated this about 3 or 4 months ago and this issue was happening before and after the update.

    What happens while booting in PXE on a working computer is I will boot the computer using the “integrated NIC”, The computer will then begin to boot into the FOG menu. Normally this take 2-3 seconds. Once at the FOG menu I will then deploy and image and that takes around 20 minutes.

    On a computer that does not work. I do the same steps except when I click on “integrated NIC”, the computer starts the sames steps then pauses on “TFTP…” for about 10 seconds then it will finish booting into the FOG menu. I will then deploy an image, but this time the image takes anywhere from 2 hours to 300 hours (just sat here and watched a computer get to 1.5% after 3 hours).

    There is no correlation between what computers work and those that don’t. This has happened to me when I had 4 brand new computers straight of the box. Same model and everything, 3 worked fine 1 did not. I could keep going on this notion but i think the point is made that model does not matter nor does age.

    I have tried lots of steps. Not sure if I should list every single one or not. Any ideas help I am really at a loss.

    Also, let me know if you need more information or pictures.

    Thank you.


  • Developer

    @jjsplitter said in PXE / TFTP booting is slow:

    I find it hard to believe that it is a network issue. Even after I know a computer is imaging slowly, I can start another one simultaneously and it will be imaging at a normal speed.

    The issue is probably not the network itself but a combination of network driver on the client and your setup. The only way we can figure this out is by looking at the network packets. You’d first have to figure out which network interface is used in the FOG server (sudo ip addr show).

    On the other hand it might be a very hard thing to nail down and fix anyway. So I do understand if you don’t want to spend more time on this. I’ll mark this thread solved for now. Feel free to open a new one and post a reference here if you want to keep debugging this.



  • @sebastian-roth No good news. I have given up on attempting to fix the issue. Ive spent way to many hours without any true fixes. It is random and only happens on 1 out of every 10 computers that are imaged. Figured I would just deal with it for now. I plan to re-do my entire FOG setup this holiday season. I plan to move the setup to be on an actual server machine instead of the PC it is currently on, hopefully fixing the issue for good.

    Thank you for following up.
    -Justin


  • Developer

    @jjsplitter Any news on this?


  • Developer

    @jjsplitter Please run ip addr show and post the output here. Usually tcpdump is pretty good at selecting a decent network interface but possibly went wrong on your system.



  • Thank you for the response. And I apologize about the wait for my response. Probably should not put this in on a Friday.

    @KnightRaven Basically all of the computers should be on the exact same level. All brand new and I am currently working with Dell Precision’s 7510. (happens with Latitudes 5570, 5580 as well).

    I have a switch at my desk with 2 ports that I use for basically all of my deployments. I can have 2 brand new computers and run them simultaneous and they’ll work, then I could put two more when those are done and one of them will be slow (again this is an example, it be that I just grab one out of the box and it does not work). But the answer the question. I have moved the computer to different switches.

    Have you tried rebooting when you notice it being slow?
    Yeah, I have done about everything under then sun to the hardware. Replaced HDD or SSD, NIC Card, and the battery.

    @george1421
    It seems that once a computer goes slow, no matter how many times i image this (different ports or not) it will never change. As i mentioned above. I use the same two ports and patch cables for all of my deployments. It seems to be entirely random to the computer and not the ports.

    For examples sake: I imaged 8 Dell Vostro 260’s last week, all using the same port and patch. These computers were not new out of the box and were using the same image. 7 of those computers completed the deployed image in under 30 minutes. One of those Vostro’s took overnight (3 hours at work and about 50%).

    @Sebastian-Roth

    I find it hard to believe that it is a network issue. Even after I know a computer is imaging slowly, I can start another one simultaneously and it will be imaging at a normal speed.

    I did the tcpdump
    0_1508158321788_42c93f45-4250-49f8-82ed-fe0ccd853e95-image.png

    Should be an image showing there were no packets lost. I left it running for about a minute while a had one of my computers that were imaging slow running.


  • Developer

    @jjsplitter Just adding a few ideas as well. The headline is a bit confusing as from your description this has nothing to do with PXE or TFTP. It’s just the deployment - so either NFS on single PC or multicast!

    My first thought as this is happening on brand new hardware (to rule the disk out) is that it could be a massive network issue. In case you have a lot if packet loss and TCP retransmissions that would cause a major slow down like you describe and that could happen kind of randomly (not specific to a client!).

    Please do the following as soon as you see this happening again. Open a terminal on your FOG server and run the following commands:

    sudo apt-get install tcpdump
    sudo tcpdump -w /tmp/slow.pcap
    

    It’s print one or two lines of output and then sit there waiting. Give it about 10 seconds (roughly) and then stop it with keys Ctrl+c. Grab that file /tmp/slow.pcap, upload to dropbox/googledrive/whatever and send me a private message with the link.

    I am fairly sure that we’ll see if it’s a network issue even from that very short packet dump. If yes we can investigate further.


  • Moderator

    In addition to what Jason posted. We need to narrow down the scope a bit. I would start by finding two systems of the same model. One works and one does not. Imaging them from the same network jack. Do they act consistently over 5 imaging cycles (don’t wait the 2 hrs for the slow one. You should be able to know right away if its fast or slow). You need to deploy the same image to the same model using the same network cable.

    Right now it could be anything from the computer, patch cable, network infrastructure, fog server, solar flare…



  • Well, since you didn’t specify i’ll take a stab at the obvious first…
    BIOS? Is it updated? Same across all PCs?
    Have you tried moving slow PC to a known good network connection? Make sure cabling/switches etc aren’t the problem.

    Have you tried rebooting when you notice it being slow?
    I just had a Dell 780 not pickup DHCP at PXE Boot.
    Had to remove power cable and clear the power(hit power button til nothing happened). After that it worked.
    Maybe try clearing BIOS. Either through BIOS menu or using jumpers.

    Are you sure you have exactly the same NICs? We have refurbs. While we haven’t had an issue with NICs it’s plausible if you have any refurbished that there could be a dif. NIC.

    Beyond that perhaps some physical encouragement? With a hammer? Maybe some greased lighting? :-D

    just shots in the dark. Hope it helps or at least sparks an idea.

    Jason


Log in to reply
 

401
Online

39.4k
Users

11.1k
Topics

105.5k
Posts

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.