Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

Paulo.Guedes

@sebastian-roth
As far as I can tell, the patch for tg3 was not inside the release candidates for the current kernel. I’ve tested 4.15-RC8 and it was not working. Then RC9 was released (no idea about it). Two days ago a brand new stable version was released. Will try it and see what happens.
https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.15.tar.xz

I just checked the changelog and it mentions nothing related to tg3, tigon, timeout or broadcom. I would bet this patch is not in here yet. Here it is.
https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.14.15

I will try to run more tests today. One with a 4.13.3 without the patch, to see if it breaks (and hence, the patch is the real fix). And another with 4.15 (with and without patch), to see if it is fixed and, in case it’s not, if the patch applies cleandly and works. Meanwhile, yesterday I wrote in another thread (with the same bug), asking people from there to double check our findings. Maybe they can take a look too, and see what happens.

Sebastian Roth

@Paulo-Guedes Yeah right, seems like the patch didn’t make it into the kernel yet. Probably a good idea to get in contact with the guy posting the patch. You can find his e-mail address in the patch file! Definitely send him a short message to see what the current state is and tell him that the fix is working great to fix your issue.

Paulo.Guedes

@sebastian-roth
Hello Sebastian, all,

Stable kernels 4.13.3 and 4.15 crash without the patch. Patch is not merged yet in the main branch.
Stable kernels 4.13.3 and 4.15 work great with the patch: no timeouts on tg3. Fast transfers on gigabit links and 10/100 links.
Wrote to the patch author as Sebastian suggested, with my results and asking when it will be merged. Waiting for his answers. Patch has a slight offset for 4.15 (2 lines, probably new comments or code) but works anyway. Will keep you updated on this.
Deploy for single machines (in parallel without multicast) is finally checked. Tested overnight with a bunch of machines and it’s ok.
If you wish, I can upload the patched 4.15 kernel tomorrow, just in case someone wants to use it.
Multicast deploy for groups of machines is working too, but much slower (about 10x) than my 10/100 network could transfer. Same network, same machines, no cable touched, nothing reset and… the deploy already starts at a slow speed (between 100 and 200 MB/min). Just reporting. Will start reading about it, to try to understand the problem. If anyone can point me on the right direction, please answer this message.

Sebastian Roth

@Paulo-Guedes Great stuff! Keep it up and I am sure we’ll have you up and running soon.

About multicast… First, please open a new thread on this topic. I don’t like to mix things up all in one thread. And then keep in mind that it’s always the slowest part of the chain which limiting the speed. So if there is just one single client with a crappy hard drive it will slow down all the other hosts. So I’d start by testing multicast in groups of maybe 3 to 5 machines each and see if those are all going at the same slow pace or if some groups are faster than others.

Sebastian Roth

@Tom-Elliott Paulo told me that he’s sent a message to the guy at Broadcom to ask if the fix would be included in the main line kernel at some point but he hasn’t got an answer from him. So I am wondering if you are happy adding the patch to our kernel for now? Paulo has had huge trouble and the patch solved the network issues for him. Take a look here: https://www.mail-archive.com/netdev@vger.kernel.org/msg189923/0001-tg3-Add-clock-override-support-for-5762.patch

Paulo.Guedes

Hello, just updating.

No answer so far from Broadcom. Tom, adding the patch would be good.
Added a link to this discussion in another thread. I think it’s the same problem.
Maybe they can also report on the problem.
https://forums.fogproject.org/topic/9976/hp-elitedesk-705-g2-mini
Mentioned the patch and test results in another forum. Hope this helps the patch to enter the main kernel faster.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664

Sebastian Roth

Somehow I have lost track of this. Luckily I somehow came across this again and added the patch now as it seems like it still hasn’t made it into the main line kernel. Also added the patch information to our wiki article on kernel compiling. Just in case anyone reads this thread and wonders where it all went.

@Paulo-Guedes Have you ever heard back from that broadcom guy?

Sebastian Roth

@Paulo-Guedes Ahh, I just saw that a fix was actually added upstream in Juli this year: https://lkml.org/lkml/2018/7/23/671 (just didn’t notice it a little further down the code)

Can you confirm this is fixing your issue? Have you used one of the official FOG kernels since then? Which versions?

Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

143

12.6k

17.5k

156.4k