• Recent
  • Unsolved
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Register
  • Login
  • Recent
  • Unsolved
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Register
  • Login

rcu_sched stall OR kernel panic on PowerEdge R640

Scheduled Pinned Locked Moved Solved
FOG Problems
5
45
5.4k
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • D
    djgalloway
    last edited by Sep 19, 2019, 7:36 PM

    Here’s the latest output using the debug kernel:

    console [ttyS1] enabled
    bootconsole [earlyvga0] disabled
    ACPI: Core revision 20180810
    clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
    APIC: Switch to symmetric I/O mode setup
    x2apic: IRQ remapping doesn't support X2APIC mode
    x2apic disabled
    Switched APIC routing to flat.
    ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
    clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fb633008a4, max_idle_ns: 440795292230 ns
    Calibrating delay loop (skipped), value calculated using timer frequency.. 4400.00 BogoMIPS (lpj=2200000)
    pid_max: default: 32768 minimum: 301
    Mount-cache hash table entries: 131072 (order: 8, 1048576 bytes)
    Mountpoint-cache hash table entries: 131072 (order: 8, 1048576 bytes)
    ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
    ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)
    process: using mwait in idle threads
    Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
    Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
    Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
    Spectre V2 : Mitigation: Full generic retpoline
    Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
    Spectre V2 : Enabling Restricted Speculation for firmware calls
    Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
    Spectre V2 : User space: Mitigation: STIBP via seccomp and prctl
    Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
    MDS: Mitigation: Clear CPU buffers
    Freeing SMP alternatives memory: 52K
    smpboot: CPU0: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (family: 0x6, model: 0x55, stepping: 0x4)
    Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver.
    ... version:                4
    ... bit width:              48
    ... generic registers:      4
    ... value mask:             0000ffffffffffff
    ... max period:             00007fffffffffff
    ... fixed-purpose events:   3
    ... event mask:             000000070000000f
    rcu: Hierarchical SRCU implementation.
    smp: Bringing up secondary CPUs ...
    x86: Booting SMP configuration:
    .... node  #0, CPUs:      #1 #2 #3 #4 #5 #6 #7
    smp: Brought up 1 node, 8 CPUs
    smpboot: Max logical packages: 10
    smpboot: Total of 8 processors activated (35220.85 BogoMIPS)
    devtmpfs: initialized
    clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
    futex hash table entries: 2048 (order: 5, 131072 bytes)
    xor: automatically using best checksumming function   avx       
    pinctrl core: initialized pinctrl subsystem
    rcu: INFO: rcu_sched self-detected stall on CPU
    rcu:    0-....: (20999 ticks this GP) idle=04a/1/0x4000000000000002 softirq=10/10 fqs=5241 
    rcu:     (t=21000 jiffies g=-1175 q=19)
    NMI backtrace for cpu 0
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.65 #12
    Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019
    Call Trace:
     <IRQ>
     0xffffffff81d6ecad
     0xffffffff81d7222f
     ? 0xffffffff8102b073
     0xffffffff81d7228a
     0xffffffff8107ce90
     0xffffffff8107c41d
     0xffffffff810806b4
     0xffffffff8108a34e
     0xffffffff81e017d5
     0xffffffff81e013af
     </IRQ>
    RIP: 0010:0xffffffff8108fa1d
    Code: 36 48 89 de 89 c7 e8 ca ef cd 00 3b 05 c0 13 86 01 73 24 48 63 f0 49 8b 16 48 03 14 f5 30 83 61 82 8b 72 18 40 80 e6 01 74 04 <f3> 90 eb f3 eb d1 0f 0b e9 72 fe ff ff 48 83 c4 10 5b 5d 41 5c 41
    RSP: 0000:ffffc9000007fae0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
    RAX: 0000000000000001 RBX: ffff8897e101fac8 RCX: 0000000000000001
    RDX: ffff8897e10621c0 RSI: 0000000000000001 RDI: ffff8897e101fac8
    RBP: 000000000001fa80 R08: 0000000000000000 R09: 00000000016daed4
    R10: ffffc9000007fb58 R11: 000fffffffe00000 R12: 0000000000000001
    R13: 0000000000000008 R14: ffff8897e101fac0 R15: 0000000000000000
     ? 0xffffffff81039a
    
    G 1 Reply Last reply Sep 19, 2019, 7:58 PM Reply Quote 0
    • G
      george1421 Moderator @djgalloway
      last edited by george1421 Sep 19, 2019, 2:24 PM Sep 19, 2019, 7:58 PM

      Just for grins I had the OP boot a 486 kernel I built for another poster for a specific dedicated machine to image with FOG. That kernel gave a bit more details than the full system kernel .

      Checking if this processor honours the WP bit even in supervisor mode...Ok.
      SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
      rcu: Hierarchical RCU implementation.
      NR_IRQS: 2304, nr_irqs: 1848, preallocated irqs: 16
      Console: colour VGA+ 80x25
      console [tty0] enabled
      console [ttyS1] enabled
      ACPI: Core revision 20180810
      clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
      APIC: Switch to symmetric I/O mode setup
      Enabling APIC mode:  Flat.  Using 9 I/O APICs
      ------------[ cut here ]------------
      Kernel BUG at 0xc1028128 [verbose debug info unavailable]
      invalid opcode: 0000 [#1] SMP
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.65 #2
      Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019
      EIP: 0xc1028128
      

      It looks like the kernel is crashing at enabling apic mode or with the apic IO. The clock source hpet also is memorable for some reason.

      So the kernel is crashing at the same point. For reference the 486 compatible kernel is also “Linux version 4.19.65”

      acpi=ht acpi=oldboot acpi_osi=Linux

      noapic

      Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

      J 1 Reply Last reply Sep 19, 2019, 9:02 PM Reply Quote 0
      • J
        Junkhacker Developer @george1421
        last edited by Sep 19, 2019, 9:02 PM

        i was googling the problem a bit and i was curious, will it boot if you remove the raid card?
        just trying to understand the source of the panic.

        signature:
        Junkhacker
        We are here to help you. If you are unresponsive to our questions, don't expect us to be responsive to yours.

        G 1 Reply Last reply Sep 19, 2019, 9:12 PM Reply Quote 0
        • G
          george1421 Moderator @Junkhacker
          last edited by george1421 Sep 19, 2019, 3:16 PM Sep 19, 2019, 9:12 PM

          @Junkhacker @Sebastian-Roth

          I was able to get the OP going by doing this and that.

          We are not sure if it was this or that that got the kernel to boot. What I did was unlocked the max CPUs (that was capped at 😎 in the kernel and I also enabled almost all of the ACPI modules in the kernel. We also tried the acpi_osi=Linux kernel parameter.

          We ruled out the acpi_osi=Linux kernel parameter fixing the issue so it must be something I enabled in the kernel. Tomorrow AM I’m going to reset the kernel environment and only unlock the max CPUs. The OP is going to test that new kernel to see if it was unlocking the max cpu or it was the acpi modules I enabled.

          Either way I’ll report where we ended up and which kernel change fixed the issue. I have also seen other recent CPU stalls like this that was fixed by setting acpi=off so we may need to move what ever fixed the issue into the main kernel build because new hardware/cpus may require it.

          Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

          G 1 Reply Last reply Sep 20, 2019, 5:37 PM Reply Quote 1
          • G
            george1421 Moderator @george1421
            last edited by Sebastian Roth Sep 26, 2019, 4:44 AM Sep 20, 2019, 5:37 PM

            @developers Here’s the final update on this issue.

            I reset my kernel build environment and then created 2 new kernel builds. The first was to remove the imposed CPU limit on the linux kernel this kernel was called bzImageMaxCPU. I reset the kernel build environment and then went through the ACPI settings turning on what I turned on in the debug kernel. This kernel was called bzImageACPI.

            The OP tested both and the bzImageMaxCPU was the only kernel that booted on those Dell servers. So in the end @Quazz was right about the CPU not liking some of its cores disabled.

            So I would recommend that we add the following settings to the official kernel build

            CONFIG_INTEL_IDLE
            and
            Processor type and features —>

            Enable Maximum number of SMP Processors and NUMA Nodes

            We have seen a recent uptick in reports of rcu_sched stalls with kernel panics Maybe we are running into this issue more often as the core counts go up on these processors.

            Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

            1 Reply Last reply Reply Quote 4
            • S
              Sebastian Roth Moderator
              last edited by Sep 20, 2019, 6:39 PM

              @george1421 @Quazz @djgalloway Great work!!! Thanks to you all. I will add this in the next days!

              Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

              Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

              1 Reply Last reply Reply Quote 1
              • S
                Sebastian Roth Moderator
                last edited by Sep 26, 2019, 10:46 AM

                @george1421 @Quazz I found a bit of time to look into this. Adding CONFIG_INTEL_IDLE should be just fine I think. But I am not exactly sure about adding CONFIG_MAXSMP (Enable Maximum number of SMP Processors and NUMA Nodes). Found this topic: https://www.xenomai.org/pipermail/xenomai/2018-July/039297.html

                Though I am not convinced this will actually cause trouble it’s still a bit risky. @Testers @Moderators. Would you be able to run a test kernel on several different client machines so we get a feeling of this being troublesome or not?

                Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                G Q 2 Replies Last reply Sep 26, 2019, 12:31 PM Reply Quote 0
                • G
                  george1421 Moderator @Sebastian Roth
                  last edited by Sep 26, 2019, 12:31 PM

                  @Sebastian-Roth I can test it here, but I don’t have a system that is causing this rcu_sched issue. But I can surely test it against our current fleet of Dell systems to see if it does any harm.

                  We can also hold this “test” kernel in reserve in case this issue comes up again if you don’t want to release it as general availability. What I would not like to see is having a special kernel for this, and a different special kernel for that.

                  Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                  1 Reply Last reply Reply Quote 0
                  • S
                    Sebastian Roth Moderator
                    last edited by Sep 26, 2019, 12:45 PM

                    @george1421 said in rcu_sched stall OR kernel panic on PowerEdge R640:

                    We can also hold this “test” kernel in reserve in case this issue comes up again if you don’t want to release it as general availability.

                    Don’t get me wrong on this. I am more than happy to make this the default kernel for everyone. It comes at low cost. But I’d like to see this tested on several different machines (PC as well as notebooks and even servers if possible) before we make it the new default kernel.

                    Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                    Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                    1 Reply Last reply Reply Quote 2
                    • Q
                      Quazz Moderator @Sebastian Roth
                      last edited by Sep 26, 2019, 1:25 PM

                      @Sebastian-Roth As far as I understand it, Xenomai implements a patch to the kernel that does all kinds of stuff, potentially it’s not compatible with their patches, but as far as I know CONFIG_MAXSMP is in fact enabled by default on Kernel 4.4+ or so on all major distributions without issues.

                      That said, I don’t mind testing it.

                      1 Reply Last reply Reply Quote 0
                      • S
                        Sebastian Roth Moderator
                        last edited by Sep 26, 2019, 1:28 PM

                        @Quazz said in rcu_sched stall OR kernel panic on PowerEdge R640:

                        as far as I know CONFIG_MAXSMP is in fact enabled by default on Kernel 4.4+ or so on all major distributions without issues

                        That’s valuable information! Any reference for this?

                        Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                        Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                        Q 1 Reply Last reply Oct 2, 2019, 12:37 PM Reply Quote 0
                        • Q
                          Quazz Moderator @Sebastian Roth
                          last edited by Quazz Oct 2, 2019, 6:38 AM Oct 2, 2019, 12:37 PM

                          @Sebastian-Roth Hmm, I may have been misremembering, though their CONFIG_NR_CPUS
                          is going to be much higher than 8 at the very least. (at least 512 afaik)

                          The only difference I can find is that CONFIG_MAXSMP enables CPUMASK_OFFSTACK
                          , which it requires to function correctly I believe (or any high CONFIG_NR_CPUS would at least)

                          1 Reply Last reply Reply Quote 0
                          • S
                            Sebastian Roth Moderator
                            last edited by Oct 21, 2019, 8:46 PM

                            @Quazz @george1421 Ok, back from travels… what shall we do with this pending topic. I do understand that adding CONFIG_MAXSMP does fix the rcu_sched stall issue on PowerEdge R640. But do we know if this fixes rcu_sched stalls on other platforms as well? Would we get at least two more people to test this before we add it to the official kernel?

                            @george1421 Did you get to test this kernel on your fleet of Dell hardware to see if it might cause any other harm?

                            Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                            Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                            G 1 Reply Last reply Oct 21, 2019, 8:50 PM Reply Quote 0
                            • G
                              george1421 Moderator @Sebastian Roth
                              last edited by Oct 21, 2019, 8:50 PM

                              @Sebastian-Roth said in rcu_sched stall OR kernel panic on PowerEdge R640:

                              Did you get to test this kernel on your fleet of Dell hardware to see if it might cause any other harm?

                              TBH, no I did not test it. I haven’t found any other system that the max-cpu value fixed either. We had one dual core with the rcu_sched stall, but that was fixed with the current kernel and changing the acpi clock source that Quazz posted. I think the max-cpu will only impact CPUs with more than 8 cores.

                              Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                              1 Reply Last reply Reply Quote 0
                              • S
                                Sebastian Roth Moderator
                                last edited by Oct 21, 2019, 9:05 PM

                                @george1421 Looking through a stack of other rcu_sched stall topics in the forums I can’t seem to find any thread where I’d think that people had CPUs with more than 8 cores. Sure sooner or later this will be state of the art but I don’t reckon we should step ahead of this. We know the current kernel works pretty good on most CPUs and I’d rather point people to this topic and provide compile instructions than setting CONFIG_MAXSMP as default. Hmm?

                                Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                                Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                                G 1 Reply Last reply Oct 21, 2019, 9:30 PM Reply Quote 0
                                • G
                                  george1421 Moderator @Sebastian Roth
                                  last edited by Oct 21, 2019, 9:30 PM

                                  @Sebastian-Roth I’m still on the fence about this, I would say turn it on because the core count continues to rise on these processors. What can it hurt? And on the other side we really don’t know what the impact could be.

                                  Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                                  1 Reply Last reply Reply Quote 0
                                  • S
                                    Sebastian Roth Moderator
                                    last edited by Oct 21, 2019, 9:42 PM

                                    @george1421 said in rcu_sched stall OR kernel panic on PowerEdge R640:

                                    And on the other side we really don’t know what the impact could be.

                                    Yes, because of that I don’t like switching it on.

                                    What can it hurt?

                                    I don’t know. Stalls on older CPUs?

                                    Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                                    Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                                    Q 1 Reply Last reply Oct 22, 2019, 7:47 AM Reply Quote 0
                                    • Q
                                      Quazz Moderator @Sebastian Roth
                                      last edited by Quazz Oct 22, 2019, 1:54 AM Oct 22, 2019, 7:47 AM

                                      @Sebastian-Roth

                                      There are kernel flags to disable SMP if necessary, so I think it’s pretty safe to compile with MAXSMP. Just my opinion of course; without a diverse test fleet it’s hard to say for sure since kernels can always have bugs or unforeseen interactions. But that would be true for any change we make.

                                      I can’t find anything googling about stalls/problems with MAXSMP either. Only some people on embedded systems who want to reduce the size of their kernel, but that’s a targetted compile anyway.

                                      There will be more and more systems entering the floor with more than 8 cores (our current NR_CPU value) given the recent CPU releases as well, so at the very least that number could use a bump.

                                      1 Reply Last reply Reply Quote 0
                                      • S
                                        Sebastian Roth Moderator
                                        last edited by Oct 22, 2019, 10:16 AM

                                        @Quazz said in rcu_sched stall OR kernel panic on PowerEdge R640:

                                        I can’t find anything googling about stalls/problems with MAXSMP either. Only some people on embedded systems who want to reduce the size of their kernel, but that’s a targetted compile anyway.

                                        Ok, you and George have convinced me this is most probably not going to cause us much trouble, so I will add the options as mentioned below.

                                        Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                                        Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                                        1 Reply Last reply Reply Quote 0
                                        • S
                                          Sebastian Roth Moderator
                                          last edited by Oct 22, 2019, 9:03 PM

                                          @george1421 @Quazz I just added the two kernel options as mentioned below to the x64 kernel config (not pushed the change yet).

                                          While CONFIG_INTEL_IDLE is available in x86 (32 bit) config as well CONFIG_MAXSMP is not (depends on X86_64 [=y]). Should I leave CONFIG_NR_CPUS set to 8 (default I think) or increase it to 16, 32, 64?

                                          For ARM kernel config we don’t have CONFIG_INTEL_IDLE nor CONFIG_MAXSMP but can adjust CONFIG_NR_CPUS too. 16, 32, 64?

                                          Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                                          Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                                          Q 1 Reply Last reply Oct 23, 2019, 7:46 AM Reply Quote 0
                                          • 1
                                          • 2
                                          • 3
                                          • 2 / 3
                                          • First post
                                            Last post

                                          189

                                          Online

                                          12.0k

                                          Users

                                          17.3k

                                          Topics

                                          155.2k

                                          Posts
                                          Copyright © 2012-2024 FOG Project