rcu_sched stall OR kernel panic on PowerEdge R640



  • Using the latest versions of kernels and inits I get the following repeating indefinitely:

    rcu: INFO: rcu_sched self-detected stall on CPU
    rcu:    0-....: (20999 ticks this GP) idle=042/1/0x4000000000000002 softirq=8/8 fqs=5248 
    rcu:     (t=21000 jiffies g=-1179 q=18)
    

    I tried rolling back just inits to 1.5.2 as suggested here, as well as rolling back kernels AND inits but both result in this kernel panic:

    Kernel BUG at         (ptrval) [verbose debug info unavailable]
    invalid opcode: 0000 [#1] SMP PTI
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.2 #5
    Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019
    RIP: 0010:0xffffffff810252dd
    RSP: 0000:ffffffff82803ed8 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 00000000002191c0 RCX: 00000000000001ac
    RDX: 0000007abb318fee RSI: 0000000000000002 RDI: 0000000000000020
    RBP: 0000007abb318fee R08: 0000000000000000 R09: ffffffff82d93854
    R10: 0000000000000000 R11: 0000000000000048 R12: 0000000000000000
    R13: ffffffff82d1a0a0 R14: 0000000000000000 R15: 0000000000000000
    FS:  0000000000000000(0000) GS:ffff88183fc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff88183ffff000 CR3: 0000000002812001 CR4: 00000000000606b0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     0xffffffff82c99749
     0xffffffff82c8fed4
     0xffffffff82c89c9a
     0xffffffff810000d5
    Code: 45 85 e4 74 10 59 5b 5d 41 5c 41 5d 41 5e 41 5f e9 6a 1f 00 00 e8 d8 e1 fd ff 48 8b 05 2d 32 60 01 ff 90 b0 00 00 00 85 c0 75 02 <0f> 0b 48 8b 05 1a 32 60 01 ff 90 c0 00 00 00 48 8b 05 0d 32 60 
    RIP: 0xffffffff810252dd RSP: ffffffff82803ed8
    ---[ end trace 4f4168bda6c10f2c ]---
    Kernel panic - not syncing: Attempted to kill the idle task!
    ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
    random: crng init done
    

    This is on a Dell PowerEdge R640 running BIOS 2.2.11. I confirmed the NIC is set to boot in ‘BIOS’ mode (not UEFI). Also tried another R640 with the same result.


  • Developer


  • Moderator

    @Sebastian-Roth I think it’s fine to leave X86 to 8 since I don’t think they make huge multicore 32 bit CPUs. (wouldn’t really make sense to me anyway) Though I also think it wouldn’t necessarily hurt to change it, why bother if it’s not needed? I also believe 8 is default for X86 anyway

    As for ARM: https://www.phoronix.com/scan.php?page=news_item&px=ARM64-256-Default-NR_CPUS

    Default in Linux 5.1 ARM 64 is 256 now. (current default being 64)


  • Developer

    @george1421 @Quazz I just added the two kernel options as mentioned below to the x64 kernel config (not pushed the change yet).

    While CONFIG_INTEL_IDLE is available in x86 (32 bit) config as well CONFIG_MAXSMP is not (depends on X86_64 [=y]). Should I leave CONFIG_NR_CPUS set to 8 (default I think) or increase it to 16, 32, 64?

    For ARM kernel config we don’t have CONFIG_INTEL_IDLE nor CONFIG_MAXSMP but can adjust CONFIG_NR_CPUS too. 16, 32, 64?


  • Developer

    @Quazz said in rcu_sched stall OR kernel panic on PowerEdge R640:

    I can’t find anything googling about stalls/problems with MAXSMP either. Only some people on embedded systems who want to reduce the size of their kernel, but that’s a targetted compile anyway.

    Ok, you and George have convinced me this is most probably not going to cause us much trouble, so I will add the options as mentioned below.


  • Moderator

    @Sebastian-Roth

    There are kernel flags to disable SMP if necessary, so I think it’s pretty safe to compile with MAXSMP. Just my opinion of course; without a diverse test fleet it’s hard to say for sure since kernels can always have bugs or unforeseen interactions. But that would be true for any change we make.

    I can’t find anything googling about stalls/problems with MAXSMP either. Only some people on embedded systems who want to reduce the size of their kernel, but that’s a targetted compile anyway.

    There will be more and more systems entering the floor with more than 8 cores (our current NR_CPU value) given the recent CPU releases as well, so at the very least that number could use a bump.


  • Developer

    @george1421 said in rcu_sched stall OR kernel panic on PowerEdge R640:

    And on the other side we really don’t know what the impact could be.

    Yes, because of that I don’t like switching it on.

    What can it hurt?

    I don’t know. Stalls on older CPUs?


  • Moderator

    @Sebastian-Roth I’m still on the fence about this, I would say turn it on because the core count continues to rise on these processors. What can it hurt? And on the other side we really don’t know what the impact could be.


  • Developer

    @george1421 Looking through a stack of other rcu_sched stall topics in the forums I can’t seem to find any thread where I’d think that people had CPUs with more than 8 cores. Sure sooner or later this will be state of the art but I don’t reckon we should step ahead of this. We know the current kernel works pretty good on most CPUs and I’d rather point people to this topic and provide compile instructions than setting CONFIG_MAXSMP as default. Hmm?


  • Moderator

    @Sebastian-Roth said in rcu_sched stall OR kernel panic on PowerEdge R640:

    Did you get to test this kernel on your fleet of Dell hardware to see if it might cause any other harm?

    TBH, no I did not test it. I haven’t found any other system that the max-cpu value fixed either. We had one dual core with the rcu_sched stall, but that was fixed with the current kernel and changing the acpi clock source that Quazz posted. I think the max-cpu will only impact CPUs with more than 8 cores.


  • Developer

    @Quazz @george1421 Ok, back from travels… what shall we do with this pending topic. I do understand that adding CONFIG_MAXSMP does fix the rcu_sched stall issue on PowerEdge R640. But do we know if this fixes rcu_sched stalls on other platforms as well? Would we get at least two more people to test this before we add it to the official kernel?

    @george1421 Did you get to test this kernel on your fleet of Dell hardware to see if it might cause any other harm?


  • Moderator

    @Sebastian-Roth Hmm, I may have been misremembering, though their CONFIG_NR_CPUS
    is going to be much higher than 8 at the very least. (at least 512 afaik)

    The only difference I can find is that CONFIG_MAXSMP enables CPUMASK_OFFSTACK
    , which it requires to function correctly I believe (or any high CONFIG_NR_CPUS would at least)


  • Developer

    @Quazz said in rcu_sched stall OR kernel panic on PowerEdge R640:

    as far as I know CONFIG_MAXSMP is in fact enabled by default on Kernel 4.4+ or so on all major distributions without issues

    That’s valuable information! Any reference for this?


  • Moderator

    @Sebastian-Roth As far as I understand it, Xenomai implements a patch to the kernel that does all kinds of stuff, potentially it’s not compatible with their patches, but as far as I know CONFIG_MAXSMP is in fact enabled by default on Kernel 4.4+ or so on all major distributions without issues.

    That said, I don’t mind testing it.


  • Developer

    @george1421 said in rcu_sched stall OR kernel panic on PowerEdge R640:

    We can also hold this “test” kernel in reserve in case this issue comes up again if you don’t want to release it as general availability.

    Don’t get me wrong on this. I am more than happy to make this the default kernel for everyone. It comes at low cost. But I’d like to see this tested on several different machines (PC as well as notebooks and even servers if possible) before we make it the new default kernel.


  • Moderator

    @Sebastian-Roth I can test it here, but I don’t have a system that is causing this rcu_sched issue. But I can surely test it against our current fleet of Dell systems to see if it does any harm.

    We can also hold this “test” kernel in reserve in case this issue comes up again if you don’t want to release it as general availability. What I would not like to see is having a special kernel for this, and a different special kernel for that.


  • Developer

    @george1421 @Quazz I found a bit of time to look into this. Adding CONFIG_INTEL_IDLE should be just fine I think. But I am not exactly sure about adding CONFIG_MAXSMP (Enable Maximum number of SMP Processors and NUMA Nodes). Found this topic: https://www.xenomai.org/pipermail/xenomai/2018-July/039297.html

    Though I am not convinced this will actually cause trouble it’s still a bit risky. @Testers @Moderators. Would you be able to run a test kernel on several different client machines so we get a feeling of this being troublesome or not?


  • Developer

    @george1421 @Quazz @djgalloway Great work!!! Thanks to you all. I will add this in the next days!


  • Moderator

    @developers Here’s the final update on this issue.

    I reset my kernel build environment and then created 2 new kernel builds. The first was to remove the imposed CPU limit on the linux kernel this kernel was called bzImageMaxCPU. I reset the kernel build environment and then went through the ACPI settings turning on what I turned on in the debug kernel. This kernel was called bzImageACPI.

    The OP tested both and the bzImageMaxCPU was the only kernel that booted on those Dell servers. So in the end @Quazz was right about the CPU not liking some of its cores disabled.

    So I would recommend that we add the following settings to the official kernel build

    CONFIG_INTEL_IDLE
    and
    Processor type and features —>

    We have seen a recent uptick in reports of rcu_sched stalls with kernel panics Maybe we are running into this issue more often as the core counts go up on these processors.


  • Moderator

    @Junkhacker @Sebastian-Roth

    I was able to get the OP going by doing this and that.

    We are not sure if it was this or that that got the kernel to boot. What I did was unlocked the max CPUs (that was capped at 8) in the kernel and I also enabled almost all of the ACPI modules in the kernel. We also tried the acpi_osi=Linux kernel parameter.

    We ruled out the acpi_osi=Linux kernel parameter fixing the issue so it must be something I enabled in the kernel. Tomorrow AM I’m going to reset the kernel environment and only unlock the max CPUs. The OP is going to test that new kernel to see if it was unlocking the max cpu or it was the acpi modules I enabled.

    Either way I’ll report where we ended up and which kernel change fixed the issue. I have also seen other recent CPU stalls like this that was fixed by setting acpi=off so we may need to move what ever fixed the issue into the main kernel build because new hardware/cpus may require it.


Log in to reply
 

486
Online

6.3k
Users

13.7k
Topics

129.0k
Posts