rcu_sched stall OR kernel panic on PowerEdge R640
-
Using the latest versions of kernels and inits I get the following repeating indefinitely:
rcu: INFO: rcu_sched self-detected stall on CPU rcu: 0-....: (20999 ticks this GP) idle=042/1/0x4000000000000002 softirq=8/8 fqs=5248 rcu: (t=21000 jiffies g=-1179 q=18)
I tried rolling back just inits to 1.5.2 as suggested here, as well as rolling back kernels AND inits but both result in this kernel panic:
Kernel BUG at (ptrval) [verbose debug info unavailable] invalid opcode: 0000 [#1] SMP PTI Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.2 #5 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019 RIP: 0010:0xffffffff810252dd RSP: 0000:ffffffff82803ed8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000002191c0 RCX: 00000000000001ac RDX: 0000007abb318fee RSI: 0000000000000002 RDI: 0000000000000020 RBP: 0000007abb318fee R08: 0000000000000000 R09: ffffffff82d93854 R10: 0000000000000000 R11: 0000000000000048 R12: 0000000000000000 R13: ffffffff82d1a0a0 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88183fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88183ffff000 CR3: 0000000002812001 CR4: 00000000000606b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: 0xffffffff82c99749 0xffffffff82c8fed4 0xffffffff82c89c9a 0xffffffff810000d5 Code: 45 85 e4 74 10 59 5b 5d 41 5c 41 5d 41 5e 41 5f e9 6a 1f 00 00 e8 d8 e1 fd ff 48 8b 05 2d 32 60 01 ff 90 b0 00 00 00 85 c0 75 02 <0f> 0b 48 8b 05 1a 32 60 01 ff 90 c0 00 00 00 48 8b 05 0d 32 60 RIP: 0xffffffff810252dd RSP: ffffffff82803ed8 ---[ end trace 4f4168bda6c10f2c ]--- Kernel panic - not syncing: Attempted to kill the idle task! ---[ end Kernel panic - not syncing: Attempted to kill the idle task! random: crng init done
This is on a Dell PowerEdge R640 running BIOS 2.2.11. I confirmed the NIC is set to boot in ‘BIOS’ mode (not UEFI). Also tried another R640 with the same result.
-
First lets return the kernel and inits back to the way they were installed by FOG. You can upgrade the kernel to the latest, but the inits need to be what was installed by FOG.
Second, lets try a kernel parameter of
acpi=off
. If this system hasn’t been registered yet, you can set this in the global parameters under FOG Settings->FOG Configuration page. Just be aware this is a global parameter and will apply to all hosts when they are pxe booted. If the host has been registered, then you can go to the host definition and set the kernel parameters there, then it will only apply to this host.I don’t know if this is a fix for your case, but we’ve been tracking the CPU stall messages and this has fixed the issue on other hardware.
-
@george1421 OK, I have the latest kernel but put the original inits back in place.
# sha256sum init.xz b690ba1f6a0888401e53bd680a86eaa8231d32649add60a4fe9e94d3972e2bc3 init.xz # sha256sum init_32.xz 147619f3b1a5af1362c3e66d927ef1281ba04976a487f67cdb75003b03e1190a init_32.xz # file bzImage bzImage: Linux kernel x86 boot executable bzImage, version 4.19.64 (jenkins-agent@Tollana) #1 SMP Mon Aug 5 11:08:49 CDT 2, RO-rootFS, swap_dev 0x8, Normal VGA
I added
acpi=off
and booting just hangs now attftp://10.8.128.2/default.ipxe... ok http://10.8ok28.2/fog/service/ipxe/boot.php... ok init.xz...
-
@djgalloway In the fog settings -> fog configuration page there is a field called “log level” or something close to that. The default is 4, set it to 7 to see if we can get some of the prestartup error logs. It should not just hang.
-
@george1421 There is no additional output. Here is the machine’s boot.php:
#!ipxe set fog-ip 10.8.128.2 set fog-webroot fog set boot-url http://${fog-ip}/${fog-webroot} kernel bzImage32 loglevel=7 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 web=http://10.8.128.2/fog/ consoleblank=0 rootfstype=ext4 console=tty0 console=ttyS0,115200 acpi=off mac=e4:43:4b:7d:a9:ba ftp=10.8.128.2 storage=10.8.128.2:/opt/fog/images/dev/ storageip=10.8.128.2 osid=50 irqpoll hostname=plena001 chkdsk=0 img=plena_rhel_7.7 imgType=n imgPartitionType=all imgid=46 imgFormat=0 PIGZ_COMP=-6 fdrive=/dev/sda hostearly=1 pct=5 ignorepg=1 type=up console=tty0 console=ttyS0,115200 acpi=off imgfetch init_32.xz boot
-
@djgalloway said in rcu_sched stall OR kernel panic on PowerEdge R640:
console=tty0 console=ttyS0,115200
Why is this in the kernel parameters, its switching the console over to the serial port and not the display.
I would also wonder if this will be a problem in the future
fdrive=/dev/sda
Does the raid array present itself as a SATA attached device (/dev/sda) or something else? -
@george1421
console=tty0
so I get output via VGA or the iDRAC console andconsole=ttyS0,115200
so I also get output via the BMC’s Serial-Over-LAN interface. I just got rid of bothconsole=
parameters and it’s still stuck with no additional output.I was first able to install this machine using Cobbler and the root drive is
/dev/sda
. Even if there was a disk problem, I would still expect to see some FOG output of an attempted Capture/Deploy task.I did just confirm that I can Deploy an OS on a different machine type just to make sure the kernels/inits were okay too.
Thank you for your help so far!
-
@djgalloway Seems like this kernel version hangs on this particular hardware. We’d need to compile a debug enabled kernel to figure out where exactly it hangs. Though as I don’t have much time these days I’d ask you to look into compiling the kernel yourself. We have instrcutions on this. Are you willing to go down this road?
-
@Sebastian-Roth Sure, I can follow docs. The wiki seems to be down at the moment though
-
@Sebastian-Roth I still have my kernel dev environment setup. What do we need to enable in the kernel for debugging?
-
@djgalloway Is this system in uefi or bios (legacy) mode?
So the only difference between the kernel starting and not is the
acpi=off
being used? -
@george1421 Yes, BIOS mode.
-
@djgalloway Just for clarity
So the only difference between the kernel starting and not is the acpi=off being used?
Is this still accurate?
-
@george1421 Right. So, it turns out I had the wrong serial TTY set. I changed it to
console=ttyS1,115200
withoutacpi=off
and got the following:Linux version 4.19.64 (jenkins-agent@Tollana) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP Mon Aug 5 11:08:49 CDT 2019 Command line: loglevel=7 initrd=init.xz root=/dev/ram0 rw ramdisk_size=127000 web=http://10.8.128.2/fog/ consoleblank=0 rootfstype=ext4 console=tty0 console=ttyS1,115200 mac=e4:43:4b:7d:a9:ba ftp=10.8.128.2 storage=10.8.128.2:/opt/fog/images/dev/ storageip=10.8.128.2 osi0 KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD Centaur CentaurHauls x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers' x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR' x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask' x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256' x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256' x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers' x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 x86/fpu: xstate_offset[3]: 832, xstate_sizes[3]: 64 x86/fpu: xstate_offset[4]: 896, xstate_sizes[4]: 64 x86/fpu: xstate_offset[5]: 960, xstate_sizes[5]: 64 x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]: 512 x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024 x86/fpu: xstate_offset[9]: 2560, xstate_sizes[9]: 8 x86/fpu: Enabled xstate features 0x2ff, context size is 2568 bytes, using 'compacted' format. BIOS-provided physical RAM map: BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved BIOS-e820: [mem 0x0000000000100000-0x000000005ddfefff] usable BIOS-e820: [mem 0x000000005ddff000-0x000000006cffefff] reserved BIOS-e820: [mem 0x000000006cfff000-0x000000006effefff] ACPI NVS BIOS-e820: [mem 0x000000006efff000-0x000000006f7fefff] ACPI data BIOS-e820: [mem 0x000000006f7ff000-0x000000006f7fffff] usable BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved BIOS-e820: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved BIOS-e820: [mem 0x00000000fec80000-0x00000000fed00fff] reserved BIOS-e820: [mem 0x00000000fed40000-0x00000000fed44fff] reserved BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved BIOS-e820: [mem 0x0000000100000000-0x000000183fffffff] usable NX (Execute Disable) protection: active SMBIOS 3.2 present. DMI: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019 tsc: Detected 2200.000 MHz processor last_pfn = 0x1840000 max_arch_pfn = 0x400000000 x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT x2apic: enabled by BIOS, switching to x2apic ops last_pfn = 0x6f800 max_arch_pfn = 0x400000000 Using GB pages for direct mapping RAMDISK: [mem 0x5ca97000-0x5dd50fff] ACPI: Early table checksum verification disabled ACPI: RSDP 0x00000000000FE320 000024 (v02 DELL ) ACPI: XSDT 0x000000006F41B188 0000F4 (v01 DELL PE_SC3 00000000 01000013) ACPI: FACP 0x000000006F7F9000 000114 (v06 DELL PE_SC3 00000000 DELL 00000001) ACPI: DSDT 0x000000006F507000 2E2494 (v02 DELL PE_SC3 00000003 DELL 00000001) ACPI: FACS 0x000000006EA6E000 000040 ACPI: SSDT 0x000000006F7FC000 00046C (v02 INTEL ADDRXLAT 00000001 INTL 20180508) ACPI: WDAT 0x000000006F7FB000 000134 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: SLIC 0x000000006F7FA000 000024 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: HPET 0x000000006F7F8000 000038 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: APIC 0x000000006F7F6000 0016DE (v04 DELL PE_SC3 00000000 DELL 00000001) ACPI: MCFG 0x000000006F7F5000 00003C (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: MIGT 0x000000006F7F4000 000040 (v01 DELL PE_SC3 00000000 DELL 00000001) ACPI: MSCT 0x000000006F7F3000 000090 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: PCAT 0x000000006F7F2000 000088 (v02 DELL PE_SC3 00000002 DELL 00000001) ACPI: PCCT 0x000000006F7F1000 00006E (v01 DELL PE_SC3 00000002 DELL 00000001) ACPI: RASF 0x000000006F7F0000 000030 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: SLIT 0x000000006F7EF000 00042C (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: SRAT 0x000000006F7EC000 002D30 (v03 DELL PE_SC3 00000002 DELL 00000001) ACPI: SVOS 0x000000006F7EB000 000032 (v01 DELL PE_SC3 00000000 DELL 00000001) ACPI: WSMT 0x000000006F7EA000 000028 (v01 DELL PE_SC3 00000000 DELL 00000001) ACPI: OEM4 0x000000006F459000 0AD1C1 (v02 INTEL CPU CST 00003000 INTL 20180508) ACPI: SSDT 0x000000006F421000 037465 (v02 INTEL SSDT PM 00004000 INTL 20180508) ACPI: SSDT 0x000000006F407000 000A1F (v02 DELL PE_SC3 00000000 DELL 00000001) ACPI: SSDT 0x000000006F41D000 00357F (v02 INTEL SpsNm 00000002 INTL 20180508) ACPI: SPCR 0x000000006F41C000 000050 (v02 00000000 00000000) ACPI: DMAR 0x000000006F7FD000 000260 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: HEST 0x000000006F3F6000 00017C (v01 DELL PE_SC3 00000002 DELL 00000001) ACPI: BERT 0x000000006F3F5000 000030 (v01 DELL PE_SC3 00000002 DELL 00000001) ACPI: ERST 0x000000006F3F4000 000230 (v01 DELL PE_SC3 00000002 DELL 00000001) ACPI: EINJ 0x000000006F3F3000 000150 (v01 DELL PE_SC3 00000002 DELL 00000001) Setting APIC routing to cluster x2apic. Zone ranges: DMA [mem 0x0000000000001000-0x0000000000ffffff] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] Normal [mem 0x0000000100000000-0x000000183fffffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000000001000-0x000000000008bfff] node 0: [mem 0x0000000000100000-0x000000005ddfefff] node 0: [mem 0x000000006f7ff000-0x000000006f7fffff] node 0: [mem 0x0000000100000000-0x000000183fffffff] Reserved but unavailable: 117 pages Initmem setup node 0 [mem 0x0000000000001000-0x000000183fffffff] ACPI: PM-Timer IO Port: 0x508 APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 8/0x4 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 9/0x24 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 10/0x18 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 11/0x38 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 12/0x10 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 13/0x30 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 14/0x16 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 15/0x36 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 16/0x12 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 17/0x32 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 18/0x14 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 19/0x34 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 20/0x1 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 21/0x21 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 22/0x9 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 23/0x29 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 24/0x3 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 25/0x23 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 26/0x7 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 27/0x27 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 28/0x5 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 29/0x25 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 30/0x19 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 31/0x39 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 32/0x11 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 33/0x31 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 34/0x17 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 35/0x37 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 36/0x13 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 37/0x33 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 38/0x15 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 39/0x35 ignored. ACPI: X2APIC_NMI (uid[0xffffffff] high level lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0xff] high level lint[0x1]) IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23 IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-31 IOAPIC[2]: apic_id 10, version 32, address 0xfec08000, GSI 32-39 IOAPIC[3]: apic_id 11, version 32, address 0xfec10000, GSI 40-47 IOAPIC[4]: apic_id 12, version 32, address 0xfec18000, GSI 48-55 IOAPIC[5]: apic_id 15, version 32, address 0xfec20000, GSI 72-79 IOAPIC[6]: apic_id 16, version 32, address 0xfec28000, GSI 80-87 IOAPIC[7]: apic_id 17, version 32, address 0xfec30000, GSI 88-95 IOAPIC[8]: apic_id 18, version 32, address 0xfec38000, GSI 96-103 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) Using ACPI (MADT) for SMP configuration information ACPI: HPET id: 0x8086a701 base: 0xfed00000 ACPI: SPCR: console: uart,io,0x2f8,115200 smpboot: 40 Processors exceeds NR_CPUS limit of 8 smpboot: Allowing 8 CPUs, 0 hotplug CPUs [mem 0x90000000-0xfcffffff] available for PCI devices Booting paravirtualized kernel on bare hardware clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns random: get_random_bytes called from 0xffffffff82cafa32 with crng_init=0 setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1 percpu: Embedded 41 pages/cpu s130840 r8192 d28904 u262144 Built 1 zonelists, mobility grouping on. Total pages: 24376830 Kernel command line: loglevel=7 initrd=init.xz root=/dev/ram0 rw ramdisk_size=127000 web=http://10.8.128.2/fog/ consoleblank=0 rootfstype=ext4 console=tty0 console=ttyS1,115200 mac=e4:43:4b:7d:a9:ba ftp=10.8.128.2 storage=10.8.128.2:/opt/fog/images/dev/ storageip=10.8.120 Misrouted IRQ fixup and polling support enabled This may significantly impact system performance Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes) Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes) Memory: 97287048K/99055148K available (16392K kernel code, 992K rwdata, 4548K rodata, 1056K init, 2416K bss, 1768100K reserved, 0K cma-reserved) SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1 Kernel/User page tables isolation: enabled rcu: Hierarchical RCU implementation. NR_IRQS: 4352, nr_irqs: 1848, preallocated irqs: 16 Console: colour VGA+ 80x25 console [tty0] enabled console [ttyS1] enabled ACPI: Core revision 20180810 clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns APIC: Switch to symmetric I/O mode setup x2apic: IRQ remapping doesn't support X2APIC mode x2apic disabled Switched APIC routing to flat. ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fb633008a4, max_idle_ns: 440795292230 ns Calibrating delay loop (skipped), value calculated using timer frequency.. 4400.00 BogoMIPS (lpj=2200000) pid_max: default: 32768 minimum: 301 Mount-cache hash table entries: 131072 (order: 8, 1048576 bytes) Mountpoint-cache hash table entries: 131072 (order: 8, 1048576 bytes) ENERGY_PERF_BIAS: Set to 'normal', was 'performance' ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8) process: using mwait in idle threads Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8 Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 Spectre V2 : Mitigation: Full generic retpoline Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch Spectre V2 : Enabling Restricted Speculation for firmware calls Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier Spectre V2 : User space: Mitigation: STIBP via seccomp and prctl Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp MDS: Mitigation: Clear CPU buffers Freeing SMP alternatives memory: 52K smpboot: CPU0: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (family: 0x6, model: 0x55, stepping: 0x4) Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver. ... version: 4 ... bit width: 48 ... generic registers: 4 ... value mask: 0000ffffffffffff ... max period: 00007fffffffffff ... fixed-purpose events: 3 ... event mask: 000000070000000f rcu: Hierarchical SRCU implementation. smp: Bringing up secondary CPUs ... x86: Booting SMP configuration: .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 smp: Brought up 1 node, 8 CPUs smpboot: Max logical packages: 10 smpboot: Total of 8 processors activated (35221.20 BogoMIPS) devtmpfs: initialized clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns futex hash table entries: 2048 (order: 5, 131072 bytes) xor: automatically using best checksumming function avx pinctrl core: initialized pinctrl subsystem rcu: INFO: rcu_sched self-detected stall on CPU rcu: 0-....: (20999 ticks this GP) idle=03e/1/0x4000000000000002 softirq=10/10 fqs=5247 rcu: (t=21000 jiffies g=-1175 q=18) NMI backtrace for cpu 0 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.64 #1 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019 Call Trace: <IRQ> 0xffffffff81d4c3d5 0xffffffff81d4f95f ? 0xffffffff8102aa32 0xffffffff81d4f9b8 0xffffffff8107aafa 0xffffffff8107a08b 0xffffffff8107e1e6 0xffffffff81087ecc 0xffffffff81e01794 0xffffffff81e0139f </IRQ> RIP: 0010:0xffffffff8108d4db Code: ee 89 c7 e8 40 ec cb 00 3b 05 45 63 86 01 73 1e 48 63 f0 49 8b 55 00 48 03 14 f5 00 53 62 82 8b 72 18 40 80 e6 01 74 04 f3 90 <eb> f3 eb d0 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 31 c9 85 RSP: 0000:ffffc9000007fae8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000001 RDX: ffff8897e1063000 RSI: 0000000000000001 RDI: ffff8897e101fc48 RBP: ffff8897e101fc48 R08: 00000000000000ff R09: ffff888000000000 R10: ffffc9000007fb60 R11: 0000000000000001 R12: 000000000001fc00 R13: ffff8897e101fc40 R14: 0000000000000000 R15: ffffffff82625300 ? 0xffffffff8108d4b9 ? 0xffffffff8103857d ? 0xffffffff8103857d 0xffffffff8108d507 0xffffffff8108d51b 0xffffffff81035988 0xffffffff81035a8e ? 0xffffffff810d853b ? 0xffffffff81cb3e73 ? 0xffffffff81d4c10f ? 0xffffffff810cbe08 0xffffffff81035cb8 0xffffffff810366f2 0xffffffff81095dbd 0xffffffff81cb432c ? 0xffffffff82caf70b 0xffffffff81cb43ea ? 0xffffffff82cee8ce 0xffffffff82cef5c7 0xffffffff82cee950 0xffffffff8100040e ? 0xffffffff82caf70b 0xffffffff82cafeed ? 0xffffffff81d5c631 0xffffffff81d5c636 0xffffffff81e00215
WITH
acpi=off
and usingttyS1
, it still hangs with no output to tty0 or ttyS1. -
@george1421 said in rcu_sched stall OR kernel panic on PowerEdge R640:
I still have my kernel dev environment setup. What do we need to enable in the kernel for debugging?
First enable CONFIG_EARLY_PRINTK and CONFIG_EARLY_PRINTK_EFI in the kernel config and edit
arch/x86/boot/compressed/eboot.c
and search for the function calledefi_main
. Add print statements likeefi_printk(sys_table, "Text output\n");
at various places in that function to find out where exactly it locks up.Then when using the kernel the OP needs to add
earlyprintk=efi
(orearlyprintk=vga
) to the kernel arguments. -
@djgalloway This is going to be a bit of a hunt and peck game here.
Remove the apci=off command and lets have it use just one cpu by adding in
nosmp
as a kernel parameter. It looks like it crashes just after it brings up smp.Also just for clarity, from the printout this is the processor that is currently in use:
Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (family: 0x6, model: 0x55, stepping: 0x4)
-
Kernel command line: loglevel=7 initrd=init.xz root=/dev/ram0 rw ramdisk_size=127000 web=http://10.8.128.2/fog/ consoleblank=0 rootfstype=ext4 console=tty0 console=ttyS1,115200 nosmp mac=e4:43:4b:7d:a9:ba ftp=10.8.128.2 storage=10.8.128.2:/opt/fog/images/dev/ storageip=1p Misrouted IRQ fixup and polling support enabled This may significantly impact system performance Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes) Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes) Memory: 97288196K/99055148K available (16392K kernel code, 992K rwdata, 4548K rodata, 1056K init, 2416K bss, 1766952K reserved, 0K cma-reserved) SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 Kernel/User page tables isolation: enabled rcu: Hierarchical RCU implementation. rcu: RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=1. rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1 NR_IRQS: 4352, nr_irqs: 32, preallocated irqs: 16 Console: colour VGA+ 80x25 console [tty0] enabled console [ttyS1] enabled ACPI: Core revision 20180810 ACPI: setting ELCR to 0200 (from 0820) clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns APIC: SMP mode deactivated APIC: Switch to symmetric I/O mode setup in no SMP routine BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.64 #1 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019 RIP: 0010:0xffffffff8102d1e6 Code: c2 48 8b 14 d5 00 53 62 82 4a 8b 1c 22 48 85 db 74 d7 3b 2b 75 d3 eb 14 48 8b 1d 25 60 d8 01 48 c7 05 1a 60 d8 01 00 00 00 00 <89> 2b 65 48 89 1d 98 7c fe 7e 65 8b 05 39 1f fe 7e 89 c0 f0 48 0f RSP: 0000:ffffffff82803e98 EFLAGS: 00010202 RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000040 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff828f36f0 RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff81408bd7 R10: 0000000000000000 R11: 000000000000005c R12: 0000000000014e88 R13: ffffffff82d460a0 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8897e1000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000002812001 CR4: 00000000000606b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: 0xffffffff81028ad4 0xffffffff82cc1652 0xffffffff82cb691a 0xffffffff82cafd33 0xffffffff810000d4 Modules linked in: CR2: 0000000000000000 ---[ end trace f19259880c7c4bbb ]--- RIP: 0010:0xffffffff8102d1e6 Code: c2 48 8b 14 d5 00 53 62 82 4a 8b 1c 22 48 85 db 74 d7 3b 2b 75 d3 eb 14 48 8b 1d 25 60 d8 01 48 c7 05 1a 60 d8 01 00 00 00 00 <89> 2b 65 48 89 1d 98 7c fe 7e 65 8b 05 39 1f fe 7e 89 c0 f0 48 0f RSP: 0000:ffffffff82803e98 EFLAGS: 00010202 RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000040 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff828f36f0 RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff81408bd7 R10: 0000000000000000 R11: 000000000000005c R12: 0000000000014e88 R13: ffffffff82d460a0 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8897e1000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000002812001 CR4: 00000000000606b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Kernel panic - not syncing: Attempted to kill the idle task! ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
-
We may have to enable kernel config option
CONFIG_INTEL_IDLE
to improve support for certain Intel CPUs.We may also want to to bump up
CONFIG_NR_CPUS
from the default of 8 to 512 (common value on modern kernels) at least on the x64 config, though this one shouldn’t cause a crash.That said, I am doubtful that would resolve this issue.
-
@Quazz said in rcu_sched stall OR kernel panic on PowerEdge R640:
We may also want to to bump up CONFIG_NR_CPUS from the default of 8 to 512
I’ve seen this setting in the kernel. I considered requesting the value set to 0 so it uses all available processors, but then I had to think this is for imaging and not a general purposes so having 28 cores available for imaging does really help because at most 4 threads (guess) would be used during imaging since most of the process is single threaded.
-
@george1421 Yes, I think that’s why it was left at 8 in the config, though perhaps some CPUs don’t handle a majority of their cores being ignored very well?