DHCP Lease Failing on 1.5.5 after upgrade - again



  • Hi,

    I don’t found how make a subject to unsolved again., So I make a new topic.

    To summarize I make a upgrade from a 1.4.4 to 1.5.5 and most of time I have a DHCP Lease Failing problem (see screen shot following)

    The problem is back again, we make some several test, with only client and fog serveur in the same dumb switch to be sur there is no other DHCP server. We change the switch and cable too, and some time, I don’t know why it’s working.
    Some time 3 - 10 times, and after don’t working anymore on HP Prodesk 400 G4 - HP Prodesk 400 G3, old asus tower; acer Travelmate P2, but no problem with Dell actually.

    I think about a problem during the upgrade, but I don’t know where to found it yet, I will make a fresh install with a 1.5.5 to check.

    So my screen shot:
    compressed_IMG_20190213_081140.jpg
    And
    compressed_IMG_20190219_114308.jpg



  • @Sebastian-Roth Sometime we have a problem before the PXE menu. The PXE boot is waiting for an IP, and don’t receive it. Or some time, PXE says to check cable… I have no idea else a bios or hardware problem, we do try by changing switch, and put directly the fog server and client on same switch no change… per ups we have some hardware series problem, but only on PXE it’s strange


  • Developer

    @totoro Thanks for letting me know. But are you sure it’s random even with the exact same kernel booted every time? I am just asking because it could be that the kernel builds I provided could have an alternating outcome. Maybe 1_... works, 2_... fails and 3_... works again. Just an idea…



  • @Sebastian-Roth Hi again, it’s made me crazy… some time is not working any time it’s working again (on the same PC who we have all the time the problem before). And some time just at pxe boot, it’s not working too, so I look on the web about a bios or hardware issue, it’s like they have a problem so I don’t think the problem come with fog but a random bug somewhere in the bios.Thank’s for your help, and sorry to make you lose time.



  • @Sebastian-Roth Sorry lot’s of work here. I’m making test today.


  • Developer

    @totoro I would appreciate you letting us know if you are still interested in debugging this issue any further.


  • Developer

    @totoro Any news? Did you get to test some of the binaries yet?


  • Developer

    @totoro I just updated the binaries again to have the commit hash included in the name just to make sure we don’t mix up anything in the later analyses. Some were compiled again as well. Hope you have not started testing yet.

    Please start from bzImage_r8169_01_... and go through to bzImage_r8169_51...


  • Developer

    @totoro Ok, I think it’s best to compile binaries for each and every commit that seems related to that network driver. In the same download location you find new binaries named like bzImage_r8169_... - numbered from 1 to 64 (might not have compiled and uploaded all of them but will so soon). Please test those one by one and see where exactly the problem starts.


  • Developer

    @totoro Sorry for the delay. Turns out there have been major changes in that part of the code between those versions:

    git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
    cd linux-stable/
    git diff --stat v4.18.20 v4.19
    ...
     drivers/net/ethernet/realtek/Kconfig                                                              |     3 +-
     drivers/net/ethernet/realtek/r8169.c                                                              |  1120 +++++--------
    ...
    

    I am still working my way through those to see how we can properly test which one of those 1120 lines of changed code is causing the issue…



  • @Sebastian-Roth No more test ?


  • Developer

    @totoro Great, thanks for testing. I’ll look into the code changes between 4.18.20 and 4.19. Probably will compile some more kernels to test for you soon.



  • @Sebastian-Roth So I make test with allmost all kernel so the problem come back when I pass from 4.18.20.64 to 4.19.64 kernel so the problem is on the last one.
    I hope it’s help.


  • Developer

    @totoro Ok, here we go, I started to build all the kernel versions between 4.18.3 and 4.19.1 to see if can pinpoint it to a specific kernel version introducing the issue. Find the kernels here: https://fogproject.org/kernels/r8169/ (only 64 bit as your machines seems to be that arch)

    Please start testing upwards starting from 4.18.4. Download the kernels manually, put in /var/www/html/fog/service/ipxe/ and then set Host Kernel in the host settings of your test machine. Schedule a task for that machine, boot it up and see if it is able to get an IP from your DHCP. If it fails on 4.18.4 already, then please go back to 4.18.3 again and make sure it works with that. Just to make sure it’s not something else causing the issue at that point.

    Be aware, this is only the very first stage. To be able to send in a proper bug report to the kernel developers we’ll probably need to test different commits between the official kernel releases as well. Stay tuned!



  • @Sebastian-Roth Yes I will have access to few machines to make test.


  • Developer

    @totoro said in DHCP Lease Failing on 1.5.5 after upgrade - again:

    Can we make a bug report with this problem ? To the linux kernel team ? Or the Fog Kernel team ?

    You surely can and you partly have already by posting this here in the forums. But what it needs to actually get this solved in the Linux kernel is definitely some work down the road. I am more than happy to guide this process but need you help in trying out different kernels over and over until we find out exactly where the issue was introduced. This can take some time. Will you have access to at least one of these machines as well as time for testing over the next weeks!? There is no point in starting this if you can’t do the testing reliably - I don’t have the hardware and can’t do it therefore!



  • @Sebastian-Roth Thank’s for your answer. I think we going to stay in 4.18.3 because if each time we have to register the client to work on it, we going to lose lots of time…

    Can we make a bug report with this problem ? To the linux kernel team ? Or the Fog Kernel team ?

    Thank’s again for your help.


  • Developer

    @totoro I’d advice you to use the newest kernel as default. Then for those devices where you have issues with the network you can manually download an older kernel. Run the following commands as root:

    cd /var/www/html/fog/service/ipxe/
    wget https://fogproject.org/kernels/Kernel.TomElliott.4.18.3.64
    wget https://fogproject.org/kernels/Kernel.TomElliott.4.18.3.32
    chmod 644 Kernel.TomElliott.*
    

    Now go to the host settings in the web UI and set Host Kernel to Kernel.TomElliott.4.18.3.64 (guess those are 64 bit architecture)

    If you get “Kernel is too old” errors then you also need to download other init files:

    cd /var/www/html/fog/service/ipxe/
    wget https://fogproject.org/inits/init_compat.xz
    wget https://fogproject.org/inits/init_32_compat.xz
    chmod 644 init_*compat.xz
    

    Now set Host Init in the host settings of that machine to init_compat.xz.


  • Developer

    @totoro said in DHCP Lease Failing on 1.5.5 after upgrade - again:

    We make some try with older Kenel, and it’s look we don’t have any problem with the 4.18.3 one,

    Ah, that’s great to hear. I did not expect an older kernel would help here. Sure you can use the newer kernel as default and add the older one just for some particular machines. I will give instructions when I have a bit more time later on today.



  • I had the same problem with an older version of fog. I solved it by decreasing the network speed; I went down to 100 Mb and the network card can negotiate the speed.


Log in to reply
 

470
Online

5.9k
Users

13.2k
Topics

124.2k
Posts