Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2
-
@tom-elliott
Hello Tom, I have added a few dmesg logs in the messages below. I think it’s not related to the firmwares, since the kernel builds ok, but the module crashes.Hello all, it’s a real pleasure to finally say that IT WORKED!!! Wow, it finally worked! I almost can’t believe it. Thank you so much for all your help.
Aham. The solution was found by Sebastian (thanks Sebastian!!!). Here I just describe the process.
The message thread that contains the solution and a patch. It describes precisely the failure scenario: The same NIC, boot over the network, then a 10/100 switch, then the way the tg3 kernel module breaks with a timeout.
https://www.mail-archive.com/netdev@vger.kernel.org/msg189347.htmlThe kernel version: 4.13.3
https://www.kernel.org/pub/linux/kernel/v4.x/
https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.13.3.tar.xzBasically I followed the instructions to rebuild a static image.
Download the kernel and the patch; extract the kernel, apply the patch. Build an image (mine was a 64 bit one).
https://wiki.fogproject.org/wiki/index.php?title=Build_TomElliott_KernelInstall the build inside fog, then try to image something over ethernet with the regular procedure: using pxe to boot.
Without a patch, the deploy will fail with a timeout crash inside tg3. Now it should work flawlessly.
If you wish to justIf you wish, I’ve built a 64-bit image, ready to be used inside fog. Here it is.
https://goo.gl/n1qBESRegards,
Paulo
p.s.: I really hope nothing has changed inside the firmware repository, and the fix is not due to a new firmware. Maybe it’s worth trying the same kernel with the same firmware repository, but without the patch (to see if it breaks). Anyway, it works, and this is what matters:) -
@Paulo-Guedes Oh that’s really great to hear that we have figured out this at least! Probably a real pleasure to see it image nicely now!!!
We are more than happy to add a patch to the FOG kernel but we also should look into if it will make it into the official kernel as well. Last comment on the mailing list was:
Good. We will work on required changes and upstream proper patch after
sanity test with multiple speeds.Can anyone figure out if and where this patch made it into the upstream kernel? If not we ought to push the developers to do so.
-
@sebastian-roth
As far as I can tell, the patch for tg3 was not inside the release candidates for the current kernel. I’ve tested 4.15-RC8 and it was not working. Then RC9 was released (no idea about it). Two days ago a brand new stable version was released. Will try it and see what happens.
https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.15.tar.xzI just checked the changelog and it mentions nothing related to tg3, tigon, timeout or broadcom. I would bet this patch is not in here yet. Here it is.
https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.14.15I will try to run more tests today. One with a 4.13.3 without the patch, to see if it breaks (and hence, the patch is the real fix). And another with 4.15 (with and without patch), to see if it is fixed and, in case it’s not, if the patch applies cleandly and works. Meanwhile, yesterday I wrote in another thread (with the same bug), asking people from there to double check our findings. Maybe they can take a look too, and see what happens.
-
@Paulo-Guedes Yeah right, seems like the patch didn’t make it into the kernel yet. Probably a good idea to get in contact with the guy posting the patch. You can find his e-mail address in the patch file! Definitely send him a short message to see what the current state is and tell him that the fix is working great to fix your issue.
-
@sebastian-roth
Hello Sebastian, all,-
Stable kernels 4.13.3 and 4.15 crash without the patch. Patch is not merged yet in the main branch.
-
Stable kernels 4.13.3 and 4.15 work great with the patch: no timeouts on tg3. Fast transfers on gigabit links and 10/100 links.
-
Wrote to the patch author as Sebastian suggested, with my results and asking when it will be merged. Waiting for his answers. Patch has a slight offset for 4.15 (2 lines, probably new comments or code) but works anyway. Will keep you updated on this.
-
Deploy for single machines (in parallel without multicast) is finally checked. Tested overnight with a bunch of machines and it’s ok.
-
If you wish, I can upload the patched 4.15 kernel tomorrow, just in case someone wants to use it.
-
Multicast deploy for groups of machines is working too, but much slower (about 10x) than my 10/100 network could transfer. Same network, same machines, no cable touched, nothing reset and… the deploy already starts at a slow speed (between 100 and 200 MB/min). Just reporting. Will start reading about it, to try to understand the problem. If anyone can point me on the right direction, please answer this message.
-
-
@Paulo-Guedes Great stuff! Keep it up and I am sure we’ll have you up and running soon.
About multicast… First, please open a new thread on this topic. I don’t like to mix things up all in one thread. And then keep in mind that it’s always the slowest part of the chain which limiting the speed. So if there is just one single client with a crappy hard drive it will slow down all the other hosts. So I’d start by testing multicast in groups of maybe 3 to 5 machines each and see if those are all going at the same slow pace or if some groups are faster than others.
-
@Tom-Elliott Paulo told me that he’s sent a message to the guy at Broadcom to ask if the fix would be included in the main line kernel at some point but he hasn’t got an answer from him. So I am wondering if you are happy adding the patch to our kernel for now? Paulo has had huge trouble and the patch solved the network issues for him. Take a look here: https://www.mail-archive.com/netdev@vger.kernel.org/msg189923/0001-tg3-Add-clock-override-support-for-5762.patch
-
Hello, just updating.
-
No answer so far from Broadcom. Tom, adding the patch would be good.
-
Added a link to this discussion in another thread. I think it’s the same problem.
Maybe they can also report on the problem.
https://forums.fogproject.org/topic/9976/hp-elitedesk-705-g2-mini -
Mentioned the patch and test results in another forum. Hope this helps the patch to enter the main kernel faster.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664
-
-
Somehow I have lost track of this. Luckily I somehow came across this again and added the patch now as it seems like it still hasn’t made it into the main line kernel. Also added the patch information to our wiki article on kernel compiling. Just in case anyone reads this thread and wonders where it all went.
@Paulo-Guedes Have you ever heard back from that broadcom guy?
-
@Paulo-Guedes Ahh, I just saw that a fix was actually added upstream in Juli this year: https://lkml.org/lkml/2018/7/23/671 (just didn’t notice it a little further down the code)
Can you confirm this is fixing your issue? Have you used one of the official FOG kernels since then? Which versions?