DHCP Lease Failing on 1.5.5 after upgrade - again
-
@totoro Sorry I jumped off my focus and back to my questions.
- At the linux command prompt key in: ip addr show I’m interested in what it lists for eth0.
Your picture shows that eth0 is found. Can we guess that the mac address in your picture matches the actual physical network adapter in this computer? If so then the kernel network driver is fine. Sebastian’s suggestion to use a dumb (unmanaged) switch should have fixed the problem we are thinking of. Will you reboot the computer back into the FOS debug console? I want to try a command from the FOS linux command prompt.
/sbin/udhcpc -i eth0 --now
Then wait until the command completes then again run an
ip addr show
to see if it picks up an IP address. If a network address is assigned then I want you to run this commandping 192.168.0.10
(which should be your fog server’s IP address). Make sure you get a response. -
Hi here the following result:
-
@totoro Well that is disappointing… but now we know its not “time” that solves your issue (you mentioned randomly it works).
What I want you to do next. Take a second computer and load wireshark on it. Plug it into the same subnet/switch as the one in the picture. Use the wireshark capture filter of
port 67 and port 68
. Start the wireshark capture then issue that same udhcpc command. What I’m hoping to see in the capture is a DISCOVER, OFFER, REQUEST, ACK sequence from dhcp. Since dhcp is actually failing on this computer I’m guessing its failing on one of the steps. If we see the DISCOVER dhcp packet then we know the target computer is alive and on the network. Post the captured pcap here and I will take a look at it in detail. -
@totoro As well you could try booting in debug mode and setting a static IP and see if you can ping then:
ip addr add 192.168.0.222/24 dev eth0 ping 192.168.0.10
-
During the “udhcp: sending discover” process we don’t see anything:
We make some try with older Kenel, and it’s look we don’t have any problem with the 4.18.3 one, we going to continue to make test and see if it’s solved. They have a way to setup Fog to use a specific kernel by default without erase the last one ?
It’s could help for the last oneThanks for you help
-
I had the same problem with an older version of fog. I solved it by decreasing the network speed; I went down to 100 Mb and the network card can negotiate the speed.
-
@totoro said in DHCP Lease Failing on 1.5.5 after upgrade - again:
We make some try with older Kenel, and it’s look we don’t have any problem with the 4.18.3 one,
Ah, that’s great to hear. I did not expect an older kernel would help here. Sure you can use the newer kernel as default and add the older one just for some particular machines. I will give instructions when I have a bit more time later on today.
-
@totoro I’d advice you to use the newest kernel as default. Then for those devices where you have issues with the network you can manually download an older kernel. Run the following commands as root:
cd /var/www/html/fog/service/ipxe/ wget https://fogproject.org/kernels/Kernel.TomElliott.4.18.3.64 wget https://fogproject.org/kernels/Kernel.TomElliott.4.18.3.32 chmod 644 Kernel.TomElliott.*
Now go to the host settings in the web UI and set Host Kernel to
Kernel.TomElliott.4.18.3.64
(guess those are 64 bit architecture)If you get “Kernel is too old” errors then you also need to download other init files:
cd /var/www/html/fog/service/ipxe/ wget https://fogproject.org/inits/init_compat.xz wget https://fogproject.org/inits/init_32_compat.xz chmod 644 init_*compat.xz
Now set Host Init in the host settings of that machine to
init_compat.xz
. -
@Sebastian-Roth Thank’s for your answer. I think we going to stay in 4.18.3 because if each time we have to register the client to work on it, we going to lose lots of time…
Can we make a bug report with this problem ? To the linux kernel team ? Or the Fog Kernel team ?
Thank’s again for your help.
-
@totoro said in DHCP Lease Failing on 1.5.5 after upgrade - again:
Can we make a bug report with this problem ? To the linux kernel team ? Or the Fog Kernel team ?
You surely can and you partly have already by posting this here in the forums. But what it needs to actually get this solved in the Linux kernel is definitely some work down the road. I am more than happy to guide this process but need you help in trying out different kernels over and over until we find out exactly where the issue was introduced. This can take some time. Will you have access to at least one of these machines as well as time for testing over the next weeks!? There is no point in starting this if you can’t do the testing reliably - I don’t have the hardware and can’t do it therefore!
-
@Sebastian-Roth Yes I will have access to few machines to make test.
-
@totoro Ok, here we go, I started to build all the kernel versions between 4.18.3 and 4.19.1 to see if can pinpoint it to a specific kernel version introducing the issue. Find the kernels here: https://fogproject.org/kernels/r8169/ (only 64 bit as your machines seems to be that arch)
Please start testing upwards starting from 4.18.4. Download the kernels manually, put in
/var/www/html/fog/service/ipxe/
and then set Host Kernel in the host settings of your test machine. Schedule a task for that machine, boot it up and see if it is able to get an IP from your DHCP. If it fails on 4.18.4 already, then please go back to 4.18.3 again and make sure it works with that. Just to make sure it’s not something else causing the issue at that point.Be aware, this is only the very first stage. To be able to send in a proper bug report to the kernel developers we’ll probably need to test different commits between the official kernel releases as well. Stay tuned!
-
@Sebastian-Roth So I make test with allmost all kernel so the problem come back when I pass from 4.18.20.64 to 4.19.64 kernel so the problem is on the last one.
I hope it’s help. -
@totoro Great, thanks for testing. I’ll look into the code changes between 4.18.20 and 4.19. Probably will compile some more kernels to test for you soon.
-
@Sebastian-Roth No more test ?
-
@totoro Sorry for the delay. Turns out there have been major changes in that part of the code between those versions:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git cd linux-stable/ git diff --stat v4.18.20 v4.19 ... drivers/net/ethernet/realtek/Kconfig | 3 +- drivers/net/ethernet/realtek/r8169.c | 1120 +++++-------- ...
I am still working my way through those to see how we can properly test which one of those 1120 lines of changed code is causing the issue…
-
@totoro Ok, I think it’s best to compile binaries for each and every commit that seems related to that network driver. In the same download location you find new binaries named like
bzImage_r8169_...
- numbered from 1 to 64 (might not have compiled and uploaded all of them but will so soon). Please test those one by one and see where exactly the problem starts. -
@totoro I just updated the binaries again to have the commit hash included in the name just to make sure we don’t mix up anything in the later analyses. Some were compiled again as well. Hope you have not started testing yet.
Please start from
bzImage_r8169_01_...
and go through tobzImage_r8169_51...
-
@totoro Any news? Did you get to test some of the binaries yet?
-
@totoro I would appreciate you letting us know if you are still interested in debugging this issue any further.