Server
- FOG Version: 1.4.4
- OS: 16.04.3 LTS
Client
- Service Version:
- OS: Win 10 + Ubuntu 16.04.3
Description
Hello,
In order to better organize ideas and to separate unrelated issues, I am creating a new post as suggested by Sebastian Roth. You’re right, thanks. This one will focus on the tg3_stop_block timed out problem.
My first post is here (sorry, was describing the two problems in one place).
https://forums.fogproject.org/topic/10711/could-not-open-inode-xxxxx-through-the-library-hp-elitedesk-705-g3-mini/4
The problem:
I am seeing a timeout error during the cloning process. I believe it is related to the tg3 kernel module, which is responsible for handling the tigor3 wired Ethernet device.
The observed behavior is as follows. I start a deploy, the machine sometimes starts the deploy process and after a while, it gets stuck. Then after some time (a few minutes), the kernel crashes with a timeout error.
This happens with both a crossover cable and over a wired Ethernet across a switch. It is an intermittent issue. Last Friday I managed to clone about five machines with the crossover cable, plus one that failed.
Today, two failed using the crossover cable. The deploy starts but at some point during the partition writing, it crashes. After 8 minutes or so, with an NTFS partition partially deployed. I tested only one machine at a time, due to the limitation of the crossover cable.
All tests I did through the network switch also failed, but in a somewhat different way. Right after writing GPT data, but before starting to write data inside the partition. I tested with a small group of four, then two and finally with a single machine. All tests failed the same way, both with UDPCAST method (multicast deploy) and NFS method (unicast, if I remember
correctly).
Possible causes:
-
My first guess was related to an issue on the crossover cable being too loose. Now I don’t think this is the root cause, since I replaced the cable by a new one. With the new cable I observed both successful image capture and image deploy. But failed captures
and deploys happened too. So, I don’t think it’s the cable anymore.
-
Failure on the tg3 kernel module.
Current investigation:
After some reading, I’ve found a few references.
This (old) message suggests that the problem happens in one kernel version, but not in the previous one.
https://askubuntu.com/questions/88319/server-getting-error-after-doing-distro-upgrade-tg3-stop-block-timed-out
This (also old) message points out that:
“When using TSO property of the TG3 driver to transmit a packet with a large header, such as over 80 bytes, an error message similar to the following appears in the Kernel log when using the TG3 3.66d version of the driver with GA3 firmware…”
https://www.ibm.com/support/home/docdisplay?lndocid=migr-5071755
Is also suggests a workaround.
"Turn off the TSO functionality of the driver using the following command from Linux:
ethtool -K eth0 tso off
"
I started a “deploy (debug)” task and tried to do that once, manually. But the problem is still there in the very same way. If it worked, I would workaroud it by using a postinit script.
I also tried to limit bandwidth with wondershaper a few times, but could not see much difference: same error. The idea was based on a possible concurrency issue. If the tg3 problem is due to some subtle race condition on the buffer handling for the network card, slowing it down could (possibly) reduce the issue likelyhood.
Finally, I started playing with different kernels.
With Kernel.TomElliott.4.1.0.64, it was “too old” and refused to work.
With the following kernel versions, the issue is still there.
4.12.3.64,
Kernel.TomElliott.4.10.1.64 and
Kernel.TomElliott.4.9.0.64
By the way, other than showing the same issue, the last version (4.9.0.64) also complained about an APIC issue. It reads: “Firmware bug”, and also “APIC ID mismatch”. Here is the screenshot.
And that’s it: I’m getting out of ideas, other than trying the other kernels.
Any suggestion? Anything I can do under a debug deploy, even manually to workaround this? Is there a wireless option? Anything?
Thank you very much,
Paulo
p.s.: tomorrow I will try this search:
linux kernel tg3 tso
And read this to see what happens.
https://blog.sleeplessbeastie.eu/2017/04/17/how-to-install-missing-firmware-for-tg3-module/