Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2



  • Server
    • FOG Version: 1.4.4
    • OS: 16.04.3 LTS
    Client
    • Service Version:
    • OS: Win 10 + Ubuntu 16.04.3
    Description

    Hello,

    In order to better organize ideas and to separate unrelated issues, I am creating a new post as suggested by Sebastian Roth. You’re right, thanks. This one will focus on the tg3_stop_block timed out problem.

    My first post is here (sorry, was describing the two problems in one place).
    https://forums.fogproject.org/topic/10711/could-not-open-inode-xxxxx-through-the-library-hp-elitedesk-705-g3-mini/4

    The problem:

    I am seeing a timeout error during the cloning process. I believe it is related to the tg3 kernel module, which is responsible for handling the tigor3 wired Ethernet device.
    0_1503972985241_tg3_stop_block_timed_out.png

    The observed behavior is as follows. I start a deploy, the machine sometimes starts the deploy process and after a while, it gets stuck. Then after some time (a few minutes), the kernel crashes with a timeout error.

    This happens with both a crossover cable and over a wired Ethernet across a switch. It is an intermittent issue. Last Friday I managed to clone about five machines with the crossover cable, plus one that failed.

    Today, two failed using the crossover cable. The deploy starts but at some point during the partition writing, it crashes. After 8 minutes or so, with an NTFS partition partially deployed. I tested only one machine at a time, due to the limitation of the crossover cable.

    All tests I did through the network switch also failed, but in a somewhat different way. Right after writing GPT data, but before starting to write data inside the partition. I tested with a small group of four, then two and finally with a single machine. All tests failed the same way, both with UDPCAST method (multicast deploy) and NFS method (unicast, if I remember
    correctly).

    Possible causes:

    1. My first guess was related to an issue on the crossover cable being too loose. Now I don’t think this is the root cause, since I replaced the cable by a new one. With the new cable I observed both successful image capture and image deploy. But failed captures
      and deploys happened too. So, I don’t think it’s the cable anymore.

    2. Failure on the tg3 kernel module.


    Current investigation:

    After some reading, I’ve found a few references.

    This (old) message suggests that the problem happens in one kernel version, but not in the previous one.

    https://askubuntu.com/questions/88319/server-getting-error-after-doing-distro-upgrade-tg3-stop-block-timed-out


    This (also old) message points out that:
    “When using TSO property of the TG3 driver to transmit a packet with a large header, such as over 80 bytes, an error message similar to the following appears in the Kernel log when using the TG3 3.66d version of the driver with GA3 firmware…”

    https://www.ibm.com/support/home/docdisplay?lndocid=migr-5071755

    Is also suggests a workaround.

    "Turn off the TSO functionality of the driver using the following command from Linux:

    ethtool -K eth0 tso off 
    

    "

    I started a “deploy (debug)” task and tried to do that once, manually. But the problem is still there in the very same way. If it worked, I would workaroud it by using a postinit script.

    I also tried to limit bandwidth with wondershaper a few times, but could not see much difference: same error. The idea was based on a possible concurrency issue. If the tg3 problem is due to some subtle race condition on the buffer handling for the network card, slowing it down could (possibly) reduce the issue likelyhood.

    Finally, I started playing with different kernels.

    With Kernel.TomElliott.4.1.0.64, it was “too old” and refused to work.
    With the following kernel versions, the issue is still there.
    4.12.3.64,
    Kernel.TomElliott.4.10.1.64 and
    Kernel.TomElliott.4.9.0.64

    By the way, other than showing the same issue, the last version (4.9.0.64) also complained about an APIC issue. It reads: “Firmware bug”, and also “APIC ID mismatch”. Here is the screenshot.
    0_1503972858770_Problema_APIC.png
    And that’s it: I’m getting out of ideas, other than trying the other kernels.

    Any suggestion? Anything I can do under a debug deploy, even manually to workaround this? Is there a wireless option? Anything?

    Thank you very much,
    Paulo

    p.s.: tomorrow I will try this search:
    linux kernel tg3 tso

    And read this to see what happens.
    https://blog.sleeplessbeastie.eu/2017/04/17/how-to-install-missing-firmware-for-tg3-module/


  • Developer

    @Paulo-Guedes Oh man, this sounds very unfortunate that you need to sail this with one broken arm, crippled eyes and let’s hope there is no storm coming up on the last leg of the turn. I really hope and keep my fingers crossed that you can get this done in time. After that I am more than happy to get into finding and fixing this with you. Maybe I can even get a piece of hardware myself to test.

    So yes, finish that ugly job and let me know when you have time again. Wish you all the best!



  • @sebastian-roth
    Hello Sebastian,

    well, yes and no. I have a lot of information on this, but I’m still working to bring up my labs. Please allow me a few days to work this out. I have a few hundred students and teachers that are eagerly expecting our labs to be ready, so it’s a big issue for us.

    Currently I am working with a very small team, cloning about 100+ machines, one by one. Yes, you’ve read it right: it’s currently impossible to multicast images with this bug and our infrastructure (gigabit ethernet mixed with 10/100 switches).

    We setup three fog servers and are using them with crossover cables. I’ve got also two external hard drives (USB 3.0), and hacked my way out through them by using a few shell scripts and a lot of tinkering.

    We are able to clone about five machines in parallel with this scheme. However, the cloning process is very very unstable. About 30% to 50% of the cloning operations are failing (roughly).

    From these, only about half is due to the tg3 problem and is related to fog. Yes, that’s right: with a pair of distinct machines and a single crossover cable across them, the “tg3 timeout” issue is still happening. Both machines (in each pair) have gigabit cards, but they are different. The bug is way less frequent, and we managed to finish many cloning operations successfully. But it’s still hapening.

    This means the 10/100 switch makes the bug more reproducible, but it’s not the root cause. It still happens, even without any 10/100 network interface in the middle.

    The other half of the failures are due to crashes and freezes from a couple of live memory sticks running ubuntu and pumping about 200GB over USB3.0 (about 45 min to 1h to finish).

    I could not dig deeper into this since we need to finish the work. Hope to have it done by next friday, maybe before of that.

    About iommu=soft, I also tried it a few times, without any success. Both in a 64 and 32bit kernel, and also with the latest “vanilla + firmware repo” kernel. I also tried many other things, such as noapic, nolapic, both of them, turning off autonegotiation, raising the log level to look for more messages and the like. Oh, and I also updated the HP BIOS firmware, turned on traffic shaping and tried other things (isolated and combined).

    Nothing solved the problem. It’s clearly a regression somewhere between the HW, the firmware and the kernel driver.

    With all the respect to Broadcom, this is something that they should have catched in a reasonably easy way. Since they gives explicit support for the kernel module, a goot testbench should have exposed the problem. Most probably their test setup has only gigabit cards, otherwise the bug would be exposed more easily.

    I really believe that a testbench with Fog, a set of machines (with many distinct cards) and a set of images (with many distinct sizes, partition layouts and the like) would be great to catch this kind of thing. Oh my, I would love to help them setup something like that…

    Aham. Well, let me see how things are moving. Will get back to you in a few days.

    Thank you all for your support,
    Paulo


  • Developer

    @Paulo-Guedes Are there any news on this? Did you get Gigabit on both ends? Reading the other post I am fairly sure this would work for you as well.

    Did you try the kernel parameter suggested in the Ubuntu bug report? Edit the host’s settings in the web UI and add iommu=soft as kernel parameter. Then try again.

    Other than that I don’t have any other ideas as of right now.



  • @sebastian-roth said in Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2:

    PCI

    Hello, here is the output for lspci.

    00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1576
    00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Carrizo (rev e4)
    00:01.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Kabini HDMI/DP Audio
    00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 157b
    00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 157c
    00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 157c
    00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 157b
    00:08.0 Encryption controller: Advanced Micro Devices, Inc. [AMD] Device 1578
    00:09.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 157d
    00:09.2 Audio device: Advanced Micro Devices, Inc. [AMD] Device 157a
    00:10.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller (rev 20)
    00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 49)
    00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 49)
    00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 4a)
    00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 11)
    00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1570
    00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1571
    00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1572
    00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1573
    00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1574
    00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1575
    01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5762 Gigabit Ethernet PCIe (rev 10)
    02:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81)

    The interesting line is the following:
    01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5762 Gigabit Ethernet PCIe (rev 10)

    This is a Broadcom BCM5762 device.

    Unfortunately, I still don’t know how to workaround this issue. I have systematically tried all sorts of kernel parameters, ethtool parameters and other things with no luck. And yes, I tried to turn off autoneg, with no luck.

    In the meantime, I gathered a lot of information. It’s a bit messy, so I’ll have to organize it.
    I don’t have gigabit in both ends, other than a couple of machines. That is making my cloning process a real pain, and is also the main reason I’ve not answered before :(.

    I also downloaded and rebuilt the latest kernel (linux-4.12.10) based on your .config files and instructions in here.
    https://wiki.fogproject.org/wiki/index.php?title=Build_TomElliott_Kernel#Build_TomElliott_Kernel_for_FOG_0.33b_and_newer

    This still does not solved the issue. And yes, I’m sure my kernel is running because I added a few messages in Portuguese to make sure it was going up, instead of the previous one.

    Currently I am trying more things, including tinkering with this module myself. It seems that there is something wrong with ACPI. But tg3 is a quite complex module (at least for me). Looks more like a ton of modules merged together, with dozens of special cases, switches and paths to accommodate a large family of devices. Ouch!

    Will try to better organize my ideas in order to share the details with you.
    Talk to you soon. Thank you for helping me out with this crazy bug!

    Regards,
    Paulo

    By the way, there are others looking at it right now. Check out this:
    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664


  • Developer

    @Paulo-Guedes Again I think you are doing great on testing things and trying to figure this out. Using a crossover cable and different cables to verify that it must be a kernel issue is a great step. From what I see you’ve done the best to make sure we are not on the wrong path with this assumption.

    So I went ahead and tried to find out one of the very important detail that you have missed out so far. What NIC exactly this is?! The PCI ID of it I mean. Luckily searching for this piece of information I stumbled upon postings in our forums that tell me that others had this issue in April already (missed that as I have been out of business for a couple of weeks back then). Read through this: https://forums.fogproject.org/topic/9976/hp-elitedesk-705-g2-mini

    The issue seems to be the auto negotiation. Make sure you have Gigabit on both ends of the client connection (switch and client or FOG server and client if crossover) and you should be imaging fine!

    And read this to see what happens.
    https://blog.sleeplessbeastie.eu/2017/04/17/how-to-install-missing-firmware-for-tg3-module/

    Hmm, good finding. Possibly we need to add a firmware blob for this NIC to get around the issue. But please try the above forced Gigabit workaround first.


Log in to reply
 

368
Online

38982
Users

10712
Topics

101680
Posts

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.