• Recent
    • Unsolved
    • Tags
    • Popular
    • Users
    • Groups
    • Search
    • Register
    • Login

    rcu_sched stall OR kernel panic on PowerEdge R640

    Scheduled Pinned Locked Moved Solved
    FOG Problems
    5
    45
    5.0k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S
      Sebastian Roth Moderator
      last edited by

      @george1421 said in rcu_sched stall OR kernel panic on PowerEdge R640:

      I still have my kernel dev environment setup. What do we need to enable in the kernel for debugging?

      First enable CONFIG_EARLY_PRINTK and CONFIG_EARLY_PRINTK_EFI in the kernel config and edit arch/x86/boot/compressed/eboot.c and search for the function called efi_main. Add print statements like efi_printk(sys_table, "Text output\n"); at various places in that function to find out where exactly it locks up.

      Then when using the kernel the OP needs to add earlyprintk=efi (or earlyprintk=vga) to the kernel arguments.

      Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

      Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

      1 Reply Last reply Reply Quote 1
      • george1421G
        george1421 Moderator @djgalloway
        last edited by

        @djgalloway This is going to be a bit of a hunt and peck game here.

        Remove the apci=off command and lets have it use just one cpu by adding in nosmp as a kernel parameter. It looks like it crashes just after it brings up smp.

        Also just for clarity, from the printout this is the processor that is currently in use: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (family: 0x6, model: 0x55, stepping: 0x4)

        Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

        D 1 Reply Last reply Reply Quote 0
        • D
          djgalloway @george1421
          last edited by

          Kernel command line: loglevel=7 initrd=init.xz root=/dev/ram0 rw ramdisk_size=127000 web=http://10.8.128.2/fog/ consoleblank=0 rootfstype=ext4 console=tty0 console=ttyS1,115200 nosmp mac=e4:43:4b:7d:a9:ba ftp=10.8.128.2 storage=10.8.128.2:/opt/fog/images/dev/ storageip=1p
          Misrouted IRQ fixup and polling support enabled
          This may significantly impact system performance
          Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes)
          Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes)
          Memory: 97288196K/99055148K available (16392K kernel code, 992K rwdata, 4548K rodata, 1056K init, 2416K bss, 1766952K reserved, 0K cma-reserved)
          SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
          Kernel/User page tables isolation: enabled
          rcu: Hierarchical RCU implementation.
          rcu:    RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=1.
          rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
          NR_IRQS: 4352, nr_irqs: 32, preallocated irqs: 16
          Console: colour VGA+ 80x25
          console [tty0] enabled
          console [ttyS1] enabled
          ACPI: Core revision 20180810
          ACPI: setting ELCR to 0200 (from 0820)
          clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
          APIC: SMP mode deactivated
          APIC: Switch to symmetric I/O mode setup in no SMP routine
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
          PGD 0 P4D 0 
          Oops: 0002 [#1] SMP PTI
          CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.64 #1
          Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019
          RIP: 0010:0xffffffff8102d1e6
          Code: c2 48 8b 14 d5 00 53 62 82 4a 8b 1c 22 48 85 db 74 d7 3b 2b 75 d3 eb 14 48 8b 1d 25 60 d8 01 48 c7 05 1a 60 d8 01 00 00 00 00 <89> 2b 65 48 89 1d 98 7c fe 7e 65 8b 05 39 1f fe 7e 89 c0 f0 48 0f
          RSP: 0000:ffffffff82803e98 EFLAGS: 00010202
          RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000040
          RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff828f36f0
          RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff81408bd7
          R10: 0000000000000000 R11: 000000000000005c R12: 0000000000014e88
          R13: ffffffff82d460a0 R14: 0000000000000000 R15: 0000000000000000
          FS:  0000000000000000(0000) GS:ffff8897e1000000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000000000000 CR3: 0000000002812001 CR4: 00000000000606b0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           0xffffffff81028ad4
           0xffffffff82cc1652
           0xffffffff82cb691a
           0xffffffff82cafd33
           0xffffffff810000d4
          Modules linked in:
          CR2: 0000000000000000
          ---[ end trace f19259880c7c4bbb ]---
          RIP: 0010:0xffffffff8102d1e6
          Code: c2 48 8b 14 d5 00 53 62 82 4a 8b 1c 22 48 85 db 74 d7 3b 2b 75 d3 eb 14 48 8b 1d 25 60 d8 01 48 c7 05 1a 60 d8 01 00 00 00 00 <89> 2b 65 48 89 1d 98 7c fe 7e 65 8b 05 39 1f fe 7e 89 c0 f0 48 0f
          RSP: 0000:ffffffff82803e98 EFLAGS: 00010202
          RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000040
          RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff828f36f0
          RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff81408bd7
          R10: 0000000000000000 R11: 000000000000005c R12: 0000000000014e88
          R13: ffffffff82d460a0 R14: 0000000000000000 R15: 0000000000000000
          FS:  0000000000000000(0000) GS:ffff8897e1000000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000000000000 CR3: 0000000002812001 CR4: 00000000000606b0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Kernel panic - not syncing: Attempted to kill the idle task!
          ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
          
          1 Reply Last reply Reply Quote 0
          • Q
            Quazz Moderator
            last edited by

            We may have to enable kernel config option CONFIG_INTEL_IDLE to improve support for certain Intel CPUs.

            We may also want to to bump up CONFIG_NR_CPUS from the default of 8 to 512 (common value on modern kernels) at least on the x64 config, though this one shouldn’t cause a crash.

            That said, I am doubtful that would resolve this issue.

            george1421G 1 Reply Last reply Reply Quote 0
            • george1421G
              george1421 Moderator @Quazz
              last edited by

              @Quazz said in rcu_sched stall OR kernel panic on PowerEdge R640:

              We may also want to to bump up CONFIG_NR_CPUS from the default of 8 to 512

              I’ve seen this setting in the kernel. I considered requesting the value set to 0 so it uses all available processors, but then I had to think this is for imaging and not a general purposes so having 28 cores available for imaging does really help because at most 4 threads (guess) would be used during imaging since most of the process is single threaded.

              Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

              Q 1 Reply Last reply Reply Quote 0
              • Q
                Quazz Moderator @george1421
                last edited by

                @george1421 Yes, I think that’s why it was left at 8 in the config, though perhaps some CPUs don’t handle a majority of their cores being ignored very well?

                george1421G 1 Reply Last reply Reply Quote 1
                • george1421G
                  george1421 Moderator @Quazz
                  last edited by

                  @Quazz That is surely something we can test.

                  Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                  1 Reply Last reply Reply Quote 0
                  • D
                    djgalloway
                    last edited by

                    @george1421 are you working on building a kernel with @Quazz’s suggestions or should I? I don’t have experience building a kernel from scratch but I can probably figure it out.

                    george1421G 1 Reply Last reply Reply Quote 0
                    • george1421G
                      george1421 Moderator @djgalloway
                      last edited by

                      @djgalloway Sorry I got side tracked this AM. I almost had it built. Give me a few and I’ll send you a link to the kernel via IM chat.

                      Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                      1 Reply Last reply Reply Quote 0
                      • D
                        djgalloway
                        last edited by

                        Here’s the latest output using the debug kernel:

                        console [ttyS1] enabled
                        bootconsole [earlyvga0] disabled
                        ACPI: Core revision 20180810
                        clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
                        APIC: Switch to symmetric I/O mode setup
                        x2apic: IRQ remapping doesn't support X2APIC mode
                        x2apic disabled
                        Switched APIC routing to flat.
                        ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
                        clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fb633008a4, max_idle_ns: 440795292230 ns
                        Calibrating delay loop (skipped), value calculated using timer frequency.. 4400.00 BogoMIPS (lpj=2200000)
                        pid_max: default: 32768 minimum: 301
                        Mount-cache hash table entries: 131072 (order: 8, 1048576 bytes)
                        Mountpoint-cache hash table entries: 131072 (order: 8, 1048576 bytes)
                        ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
                        ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)
                        process: using mwait in idle threads
                        Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
                        Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
                        Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
                        Spectre V2 : Mitigation: Full generic retpoline
                        Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
                        Spectre V2 : Enabling Restricted Speculation for firmware calls
                        Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
                        Spectre V2 : User space: Mitigation: STIBP via seccomp and prctl
                        Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
                        MDS: Mitigation: Clear CPU buffers
                        Freeing SMP alternatives memory: 52K
                        smpboot: CPU0: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (family: 0x6, model: 0x55, stepping: 0x4)
                        Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver.
                        ... version:                4
                        ... bit width:              48
                        ... generic registers:      4
                        ... value mask:             0000ffffffffffff
                        ... max period:             00007fffffffffff
                        ... fixed-purpose events:   3
                        ... event mask:             000000070000000f
                        rcu: Hierarchical SRCU implementation.
                        smp: Bringing up secondary CPUs ...
                        x86: Booting SMP configuration:
                        .... node  #0, CPUs:      #1 #2 #3 #4 #5 #6 #7
                        smp: Brought up 1 node, 8 CPUs
                        smpboot: Max logical packages: 10
                        smpboot: Total of 8 processors activated (35220.85 BogoMIPS)
                        devtmpfs: initialized
                        clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
                        futex hash table entries: 2048 (order: 5, 131072 bytes)
                        xor: automatically using best checksumming function   avx       
                        pinctrl core: initialized pinctrl subsystem
                        rcu: INFO: rcu_sched self-detected stall on CPU
                        rcu:    0-....: (20999 ticks this GP) idle=04a/1/0x4000000000000002 softirq=10/10 fqs=5241 
                        rcu:     (t=21000 jiffies g=-1175 q=19)
                        NMI backtrace for cpu 0
                        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.65 #12
                        Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019
                        Call Trace:
                         <IRQ>
                         0xffffffff81d6ecad
                         0xffffffff81d7222f
                         ? 0xffffffff8102b073
                         0xffffffff81d7228a
                         0xffffffff8107ce90
                         0xffffffff8107c41d
                         0xffffffff810806b4
                         0xffffffff8108a34e
                         0xffffffff81e017d5
                         0xffffffff81e013af
                         </IRQ>
                        RIP: 0010:0xffffffff8108fa1d
                        Code: 36 48 89 de 89 c7 e8 ca ef cd 00 3b 05 c0 13 86 01 73 24 48 63 f0 49 8b 16 48 03 14 f5 30 83 61 82 8b 72 18 40 80 e6 01 74 04 <f3> 90 eb f3 eb d1 0f 0b e9 72 fe ff ff 48 83 c4 10 5b 5d 41 5c 41
                        RSP: 0000:ffffc9000007fae0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
                        RAX: 0000000000000001 RBX: ffff8897e101fac8 RCX: 0000000000000001
                        RDX: ffff8897e10621c0 RSI: 0000000000000001 RDI: ffff8897e101fac8
                        RBP: 000000000001fa80 R08: 0000000000000000 R09: 00000000016daed4
                        R10: ffffc9000007fb58 R11: 000fffffffe00000 R12: 0000000000000001
                        R13: 0000000000000008 R14: ffff8897e101fac0 R15: 0000000000000000
                         ? 0xffffffff81039a
                        
                        george1421G 1 Reply Last reply Reply Quote 0
                        • george1421G
                          george1421 Moderator @djgalloway
                          last edited by george1421

                          Just for grins I had the OP boot a 486 kernel I built for another poster for a specific dedicated machine to image with FOG. That kernel gave a bit more details than the full system kernel .

                          Checking if this processor honours the WP bit even in supervisor mode...Ok.
                          SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
                          rcu: Hierarchical RCU implementation.
                          NR_IRQS: 2304, nr_irqs: 1848, preallocated irqs: 16
                          Console: colour VGA+ 80x25
                          console [tty0] enabled
                          console [ttyS1] enabled
                          ACPI: Core revision 20180810
                          clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
                          APIC: Switch to symmetric I/O mode setup
                          Enabling APIC mode:  Flat.  Using 9 I/O APICs
                          ------------[ cut here ]------------
                          Kernel BUG at 0xc1028128 [verbose debug info unavailable]
                          invalid opcode: 0000 [#1] SMP
                          CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.65 #2
                          Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019
                          EIP: 0xc1028128
                          

                          It looks like the kernel is crashing at enabling apic mode or with the apic IO. The clock source hpet also is memorable for some reason.

                          So the kernel is crashing at the same point. For reference the 486 compatible kernel is also “Linux version 4.19.65”

                          acpi=ht acpi=oldboot acpi_osi=Linux

                          noapic

                          Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                          JunkhackerJ 1 Reply Last reply Reply Quote 0
                          • JunkhackerJ
                            Junkhacker Developer @george1421
                            last edited by

                            i was googling the problem a bit and i was curious, will it boot if you remove the raid card?
                            just trying to understand the source of the panic.

                            signature:
                            Junkhacker
                            We are here to help you. If you are unresponsive to our questions, don't expect us to be responsive to yours.

                            george1421G 1 Reply Last reply Reply Quote 0
                            • george1421G
                              george1421 Moderator @Junkhacker
                              last edited by george1421

                              @Junkhacker @Sebastian-Roth

                              I was able to get the OP going by doing this and that.

                              We are not sure if it was this or that that got the kernel to boot. What I did was unlocked the max CPUs (that was capped at 😎 in the kernel and I also enabled almost all of the ACPI modules in the kernel. We also tried the acpi_osi=Linux kernel parameter.

                              We ruled out the acpi_osi=Linux kernel parameter fixing the issue so it must be something I enabled in the kernel. Tomorrow AM I’m going to reset the kernel environment and only unlock the max CPUs. The OP is going to test that new kernel to see if it was unlocking the max cpu or it was the acpi modules I enabled.

                              Either way I’ll report where we ended up and which kernel change fixed the issue. I have also seen other recent CPU stalls like this that was fixed by setting acpi=off so we may need to move what ever fixed the issue into the main kernel build because new hardware/cpus may require it.

                              Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                              george1421G 1 Reply Last reply Reply Quote 1
                              • george1421G
                                george1421 Moderator @george1421
                                last edited by Sebastian Roth

                                @developers Here’s the final update on this issue.

                                I reset my kernel build environment and then created 2 new kernel builds. The first was to remove the imposed CPU limit on the linux kernel this kernel was called bzImageMaxCPU. I reset the kernel build environment and then went through the ACPI settings turning on what I turned on in the debug kernel. This kernel was called bzImageACPI.

                                The OP tested both and the bzImageMaxCPU was the only kernel that booted on those Dell servers. So in the end @Quazz was right about the CPU not liking some of its cores disabled.

                                So I would recommend that we add the following settings to the official kernel build

                                CONFIG_INTEL_IDLE
                                and
                                Processor type and features —>

                                Enable Maximum number of SMP Processors and NUMA Nodes

                                We have seen a recent uptick in reports of rcu_sched stalls with kernel panics Maybe we are running into this issue more often as the core counts go up on these processors.

                                Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                                1 Reply Last reply Reply Quote 4
                                • S
                                  Sebastian Roth Moderator
                                  last edited by

                                  @george1421 @Quazz @djgalloway Great work!!! Thanks to you all. I will add this in the next days!

                                  Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                                  Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                                  1 Reply Last reply Reply Quote 1
                                  • S
                                    Sebastian Roth Moderator
                                    last edited by

                                    @george1421 @Quazz I found a bit of time to look into this. Adding CONFIG_INTEL_IDLE should be just fine I think. But I am not exactly sure about adding CONFIG_MAXSMP (Enable Maximum number of SMP Processors and NUMA Nodes). Found this topic: https://www.xenomai.org/pipermail/xenomai/2018-July/039297.html

                                    Though I am not convinced this will actually cause trouble it’s still a bit risky. @Testers @Moderators. Would you be able to run a test kernel on several different client machines so we get a feeling of this being troublesome or not?

                                    Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                                    Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                                    george1421G Q 2 Replies Last reply Reply Quote 0
                                    • george1421G
                                      george1421 Moderator @Sebastian Roth
                                      last edited by

                                      @Sebastian-Roth I can test it here, but I don’t have a system that is causing this rcu_sched issue. But I can surely test it against our current fleet of Dell systems to see if it does any harm.

                                      We can also hold this “test” kernel in reserve in case this issue comes up again if you don’t want to release it as general availability. What I would not like to see is having a special kernel for this, and a different special kernel for that.

                                      Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                                      1 Reply Last reply Reply Quote 0
                                      • S
                                        Sebastian Roth Moderator
                                        last edited by

                                        @george1421 said in rcu_sched stall OR kernel panic on PowerEdge R640:

                                        We can also hold this “test” kernel in reserve in case this issue comes up again if you don’t want to release it as general availability.

                                        Don’t get me wrong on this. I am more than happy to make this the default kernel for everyone. It comes at low cost. But I’d like to see this tested on several different machines (PC as well as notebooks and even servers if possible) before we make it the new default kernel.

                                        Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                                        Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                                        1 Reply Last reply Reply Quote 2
                                        • Q
                                          Quazz Moderator @Sebastian Roth
                                          last edited by

                                          @Sebastian-Roth As far as I understand it, Xenomai implements a patch to the kernel that does all kinds of stuff, potentially it’s not compatible with their patches, but as far as I know CONFIG_MAXSMP is in fact enabled by default on Kernel 4.4+ or so on all major distributions without issues.

                                          That said, I don’t mind testing it.

                                          1 Reply Last reply Reply Quote 0
                                          • S
                                            Sebastian Roth Moderator
                                            last edited by

                                            @Quazz said in rcu_sched stall OR kernel panic on PowerEdge R640:

                                            as far as I know CONFIG_MAXSMP is in fact enabled by default on Kernel 4.4+ or so on all major distributions without issues

                                            That’s valuable information! Any reference for this?

                                            Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                                            Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                                            Q 1 Reply Last reply Reply Quote 0
                                            • 1
                                            • 2
                                            • 3
                                            • 2 / 3
                                            • First post
                                              Last post

                                            148

                                            Online

                                            12.0k

                                            Users

                                            17.3k

                                            Topics

                                            155.2k

                                            Posts
                                            Copyright © 2012-2024 FOG Project