rcu_sched stall OR kernel panic on PowerEdge R640
-
@george1421 OK, I have the latest kernel but put the original inits back in place.
# sha256sum init.xz b690ba1f6a0888401e53bd680a86eaa8231d32649add60a4fe9e94d3972e2bc3 init.xz # sha256sum init_32.xz 147619f3b1a5af1362c3e66d927ef1281ba04976a487f67cdb75003b03e1190a init_32.xz # file bzImage bzImage: Linux kernel x86 boot executable bzImage, version 4.19.64 (jenkins-agent@Tollana) #1 SMP Mon Aug 5 11:08:49 CDT 2, RO-rootFS, swap_dev 0x8, Normal VGA
I added
acpi=off
and booting just hangs now attftp://10.8.128.2/default.ipxe... ok http://10.8ok28.2/fog/service/ipxe/boot.php... ok init.xz...
-
@djgalloway In the fog settings -> fog configuration page there is a field called “log level” or something close to that. The default is 4, set it to 7 to see if we can get some of the prestartup error logs. It should not just hang.
-
@george1421 There is no additional output. Here is the machine’s boot.php:
#!ipxe set fog-ip 10.8.128.2 set fog-webroot fog set boot-url http://${fog-ip}/${fog-webroot} kernel bzImage32 loglevel=7 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=127000 web=http://10.8.128.2/fog/ consoleblank=0 rootfstype=ext4 console=tty0 console=ttyS0,115200 acpi=off mac=e4:43:4b:7d:a9:ba ftp=10.8.128.2 storage=10.8.128.2:/opt/fog/images/dev/ storageip=10.8.128.2 osid=50 irqpoll hostname=plena001 chkdsk=0 img=plena_rhel_7.7 imgType=n imgPartitionType=all imgid=46 imgFormat=0 PIGZ_COMP=-6 fdrive=/dev/sda hostearly=1 pct=5 ignorepg=1 type=up console=tty0 console=ttyS0,115200 acpi=off imgfetch init_32.xz boot
-
@djgalloway said in rcu_sched stall OR kernel panic on PowerEdge R640:
console=tty0 console=ttyS0,115200
Why is this in the kernel parameters, its switching the console over to the serial port and not the display.
I would also wonder if this will be a problem in the future
fdrive=/dev/sda
Does the raid array present itself as a SATA attached device (/dev/sda) or something else? -
@george1421
console=tty0
so I get output via VGA or the iDRAC console andconsole=ttyS0,115200
so I also get output via the BMC’s Serial-Over-LAN interface. I just got rid of bothconsole=
parameters and it’s still stuck with no additional output.I was first able to install this machine using Cobbler and the root drive is
/dev/sda
. Even if there was a disk problem, I would still expect to see some FOG output of an attempted Capture/Deploy task.I did just confirm that I can Deploy an OS on a different machine type just to make sure the kernels/inits were okay too.
Thank you for your help so far!
-
@djgalloway Seems like this kernel version hangs on this particular hardware. We’d need to compile a debug enabled kernel to figure out where exactly it hangs. Though as I don’t have much time these days I’d ask you to look into compiling the kernel yourself. We have instrcutions on this. Are you willing to go down this road?
-
@Sebastian-Roth Sure, I can follow docs. The wiki seems to be down at the moment though
-
@Sebastian-Roth I still have my kernel dev environment setup. What do we need to enable in the kernel for debugging?
-
@djgalloway Is this system in uefi or bios (legacy) mode?
So the only difference between the kernel starting and not is the
acpi=off
being used? -
@george1421 Yes, BIOS mode.
-
@djgalloway Just for clarity
So the only difference between the kernel starting and not is the acpi=off being used?
Is this still accurate?
-
@george1421 Right. So, it turns out I had the wrong serial TTY set. I changed it to
console=ttyS1,115200
withoutacpi=off
and got the following:Linux version 4.19.64 (jenkins-agent@Tollana) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP Mon Aug 5 11:08:49 CDT 2019 Command line: loglevel=7 initrd=init.xz root=/dev/ram0 rw ramdisk_size=127000 web=http://10.8.128.2/fog/ consoleblank=0 rootfstype=ext4 console=tty0 console=ttyS1,115200 mac=e4:43:4b:7d:a9:ba ftp=10.8.128.2 storage=10.8.128.2:/opt/fog/images/dev/ storageip=10.8.128.2 osi0 KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD Centaur CentaurHauls x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers' x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR' x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask' x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256' x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256' x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers' x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 x86/fpu: xstate_offset[3]: 832, xstate_sizes[3]: 64 x86/fpu: xstate_offset[4]: 896, xstate_sizes[4]: 64 x86/fpu: xstate_offset[5]: 960, xstate_sizes[5]: 64 x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]: 512 x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024 x86/fpu: xstate_offset[9]: 2560, xstate_sizes[9]: 8 x86/fpu: Enabled xstate features 0x2ff, context size is 2568 bytes, using 'compacted' format. BIOS-provided physical RAM map: BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved BIOS-e820: [mem 0x0000000000100000-0x000000005ddfefff] usable BIOS-e820: [mem 0x000000005ddff000-0x000000006cffefff] reserved BIOS-e820: [mem 0x000000006cfff000-0x000000006effefff] ACPI NVS BIOS-e820: [mem 0x000000006efff000-0x000000006f7fefff] ACPI data BIOS-e820: [mem 0x000000006f7ff000-0x000000006f7fffff] usable BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved BIOS-e820: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved BIOS-e820: [mem 0x00000000fec80000-0x00000000fed00fff] reserved BIOS-e820: [mem 0x00000000fed40000-0x00000000fed44fff] reserved BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved BIOS-e820: [mem 0x0000000100000000-0x000000183fffffff] usable NX (Execute Disable) protection: active SMBIOS 3.2 present. DMI: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019 tsc: Detected 2200.000 MHz processor last_pfn = 0x1840000 max_arch_pfn = 0x400000000 x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT x2apic: enabled by BIOS, switching to x2apic ops last_pfn = 0x6f800 max_arch_pfn = 0x400000000 Using GB pages for direct mapping RAMDISK: [mem 0x5ca97000-0x5dd50fff] ACPI: Early table checksum verification disabled ACPI: RSDP 0x00000000000FE320 000024 (v02 DELL ) ACPI: XSDT 0x000000006F41B188 0000F4 (v01 DELL PE_SC3 00000000 01000013) ACPI: FACP 0x000000006F7F9000 000114 (v06 DELL PE_SC3 00000000 DELL 00000001) ACPI: DSDT 0x000000006F507000 2E2494 (v02 DELL PE_SC3 00000003 DELL 00000001) ACPI: FACS 0x000000006EA6E000 000040 ACPI: SSDT 0x000000006F7FC000 00046C (v02 INTEL ADDRXLAT 00000001 INTL 20180508) ACPI: WDAT 0x000000006F7FB000 000134 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: SLIC 0x000000006F7FA000 000024 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: HPET 0x000000006F7F8000 000038 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: APIC 0x000000006F7F6000 0016DE (v04 DELL PE_SC3 00000000 DELL 00000001) ACPI: MCFG 0x000000006F7F5000 00003C (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: MIGT 0x000000006F7F4000 000040 (v01 DELL PE_SC3 00000000 DELL 00000001) ACPI: MSCT 0x000000006F7F3000 000090 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: PCAT 0x000000006F7F2000 000088 (v02 DELL PE_SC3 00000002 DELL 00000001) ACPI: PCCT 0x000000006F7F1000 00006E (v01 DELL PE_SC3 00000002 DELL 00000001) ACPI: RASF 0x000000006F7F0000 000030 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: SLIT 0x000000006F7EF000 00042C (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: SRAT 0x000000006F7EC000 002D30 (v03 DELL PE_SC3 00000002 DELL 00000001) ACPI: SVOS 0x000000006F7EB000 000032 (v01 DELL PE_SC3 00000000 DELL 00000001) ACPI: WSMT 0x000000006F7EA000 000028 (v01 DELL PE_SC3 00000000 DELL 00000001) ACPI: OEM4 0x000000006F459000 0AD1C1 (v02 INTEL CPU CST 00003000 INTL 20180508) ACPI: SSDT 0x000000006F421000 037465 (v02 INTEL SSDT PM 00004000 INTL 20180508) ACPI: SSDT 0x000000006F407000 000A1F (v02 DELL PE_SC3 00000000 DELL 00000001) ACPI: SSDT 0x000000006F41D000 00357F (v02 INTEL SpsNm 00000002 INTL 20180508) ACPI: SPCR 0x000000006F41C000 000050 (v02 00000000 00000000) ACPI: DMAR 0x000000006F7FD000 000260 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: HEST 0x000000006F3F6000 00017C (v01 DELL PE_SC3 00000002 DELL 00000001) ACPI: BERT 0x000000006F3F5000 000030 (v01 DELL PE_SC3 00000002 DELL 00000001) ACPI: ERST 0x000000006F3F4000 000230 (v01 DELL PE_SC3 00000002 DELL 00000001) ACPI: EINJ 0x000000006F3F3000 000150 (v01 DELL PE_SC3 00000002 DELL 00000001) Setting APIC routing to cluster x2apic. Zone ranges: DMA [mem 0x0000000000001000-0x0000000000ffffff] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] Normal [mem 0x0000000100000000-0x000000183fffffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000000001000-0x000000000008bfff] node 0: [mem 0x0000000000100000-0x000000005ddfefff] node 0: [mem 0x000000006f7ff000-0x000000006f7fffff] node 0: [mem 0x0000000100000000-0x000000183fffffff] Reserved but unavailable: 117 pages Initmem setup node 0 [mem 0x0000000000001000-0x000000183fffffff] ACPI: PM-Timer IO Port: 0x508 APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 8/0x4 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 9/0x24 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 10/0x18 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 11/0x38 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 12/0x10 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 13/0x30 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 14/0x16 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 15/0x36 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 16/0x12 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 17/0x32 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 18/0x14 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 19/0x34 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 20/0x1 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 21/0x21 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 22/0x9 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 23/0x29 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 24/0x3 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 25/0x23 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 26/0x7 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 27/0x27 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 28/0x5 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 29/0x25 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 30/0x19 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 31/0x39 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 32/0x11 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 33/0x31 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 34/0x17 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 35/0x37 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 36/0x13 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 37/0x33 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 38/0x15 ignored. APIC: NR_CPUS/possible_cpus limit of 8 reached. Processor 39/0x35 ignored. ACPI: X2APIC_NMI (uid[0xffffffff] high level lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0xff] high level lint[0x1]) IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23 IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-31 IOAPIC[2]: apic_id 10, version 32, address 0xfec08000, GSI 32-39 IOAPIC[3]: apic_id 11, version 32, address 0xfec10000, GSI 40-47 IOAPIC[4]: apic_id 12, version 32, address 0xfec18000, GSI 48-55 IOAPIC[5]: apic_id 15, version 32, address 0xfec20000, GSI 72-79 IOAPIC[6]: apic_id 16, version 32, address 0xfec28000, GSI 80-87 IOAPIC[7]: apic_id 17, version 32, address 0xfec30000, GSI 88-95 IOAPIC[8]: apic_id 18, version 32, address 0xfec38000, GSI 96-103 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) Using ACPI (MADT) for SMP configuration information ACPI: HPET id: 0x8086a701 base: 0xfed00000 ACPI: SPCR: console: uart,io,0x2f8,115200 smpboot: 40 Processors exceeds NR_CPUS limit of 8 smpboot: Allowing 8 CPUs, 0 hotplug CPUs [mem 0x90000000-0xfcffffff] available for PCI devices Booting paravirtualized kernel on bare hardware clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns random: get_random_bytes called from 0xffffffff82cafa32 with crng_init=0 setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1 percpu: Embedded 41 pages/cpu s130840 r8192 d28904 u262144 Built 1 zonelists, mobility grouping on. Total pages: 24376830 Kernel command line: loglevel=7 initrd=init.xz root=/dev/ram0 rw ramdisk_size=127000 web=http://10.8.128.2/fog/ consoleblank=0 rootfstype=ext4 console=tty0 console=ttyS1,115200 mac=e4:43:4b:7d:a9:ba ftp=10.8.128.2 storage=10.8.128.2:/opt/fog/images/dev/ storageip=10.8.120 Misrouted IRQ fixup and polling support enabled This may significantly impact system performance Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes) Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes) Memory: 97287048K/99055148K available (16392K kernel code, 992K rwdata, 4548K rodata, 1056K init, 2416K bss, 1768100K reserved, 0K cma-reserved) SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1 Kernel/User page tables isolation: enabled rcu: Hierarchical RCU implementation. NR_IRQS: 4352, nr_irqs: 1848, preallocated irqs: 16 Console: colour VGA+ 80x25 console [tty0] enabled console [ttyS1] enabled ACPI: Core revision 20180810 clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns APIC: Switch to symmetric I/O mode setup x2apic: IRQ remapping doesn't support X2APIC mode x2apic disabled Switched APIC routing to flat. ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fb633008a4, max_idle_ns: 440795292230 ns Calibrating delay loop (skipped), value calculated using timer frequency.. 4400.00 BogoMIPS (lpj=2200000) pid_max: default: 32768 minimum: 301 Mount-cache hash table entries: 131072 (order: 8, 1048576 bytes) Mountpoint-cache hash table entries: 131072 (order: 8, 1048576 bytes) ENERGY_PERF_BIAS: Set to 'normal', was 'performance' ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8) process: using mwait in idle threads Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8 Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 Spectre V2 : Mitigation: Full generic retpoline Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch Spectre V2 : Enabling Restricted Speculation for firmware calls Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier Spectre V2 : User space: Mitigation: STIBP via seccomp and prctl Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp MDS: Mitigation: Clear CPU buffers Freeing SMP alternatives memory: 52K smpboot: CPU0: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (family: 0x6, model: 0x55, stepping: 0x4) Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver. ... version: 4 ... bit width: 48 ... generic registers: 4 ... value mask: 0000ffffffffffff ... max period: 00007fffffffffff ... fixed-purpose events: 3 ... event mask: 000000070000000f rcu: Hierarchical SRCU implementation. smp: Bringing up secondary CPUs ... x86: Booting SMP configuration: .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 smp: Brought up 1 node, 8 CPUs smpboot: Max logical packages: 10 smpboot: Total of 8 processors activated (35221.20 BogoMIPS) devtmpfs: initialized clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns futex hash table entries: 2048 (order: 5, 131072 bytes) xor: automatically using best checksumming function avx pinctrl core: initialized pinctrl subsystem rcu: INFO: rcu_sched self-detected stall on CPU rcu: 0-....: (20999 ticks this GP) idle=03e/1/0x4000000000000002 softirq=10/10 fqs=5247 rcu: (t=21000 jiffies g=-1175 q=18) NMI backtrace for cpu 0 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.64 #1 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019 Call Trace: <IRQ> 0xffffffff81d4c3d5 0xffffffff81d4f95f ? 0xffffffff8102aa32 0xffffffff81d4f9b8 0xffffffff8107aafa 0xffffffff8107a08b 0xffffffff8107e1e6 0xffffffff81087ecc 0xffffffff81e01794 0xffffffff81e0139f </IRQ> RIP: 0010:0xffffffff8108d4db Code: ee 89 c7 e8 40 ec cb 00 3b 05 45 63 86 01 73 1e 48 63 f0 49 8b 55 00 48 03 14 f5 00 53 62 82 8b 72 18 40 80 e6 01 74 04 f3 90 <eb> f3 eb d0 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 31 c9 85 RSP: 0000:ffffc9000007fae8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000001 RDX: ffff8897e1063000 RSI: 0000000000000001 RDI: ffff8897e101fc48 RBP: ffff8897e101fc48 R08: 00000000000000ff R09: ffff888000000000 R10: ffffc9000007fb60 R11: 0000000000000001 R12: 000000000001fc00 R13: ffff8897e101fc40 R14: 0000000000000000 R15: ffffffff82625300 ? 0xffffffff8108d4b9 ? 0xffffffff8103857d ? 0xffffffff8103857d 0xffffffff8108d507 0xffffffff8108d51b 0xffffffff81035988 0xffffffff81035a8e ? 0xffffffff810d853b ? 0xffffffff81cb3e73 ? 0xffffffff81d4c10f ? 0xffffffff810cbe08 0xffffffff81035cb8 0xffffffff810366f2 0xffffffff81095dbd 0xffffffff81cb432c ? 0xffffffff82caf70b 0xffffffff81cb43ea ? 0xffffffff82cee8ce 0xffffffff82cef5c7 0xffffffff82cee950 0xffffffff8100040e ? 0xffffffff82caf70b 0xffffffff82cafeed ? 0xffffffff81d5c631 0xffffffff81d5c636 0xffffffff81e00215
WITH
acpi=off
and usingttyS1
, it still hangs with no output to tty0 or ttyS1. -
@george1421 said in rcu_sched stall OR kernel panic on PowerEdge R640:
I still have my kernel dev environment setup. What do we need to enable in the kernel for debugging?
First enable CONFIG_EARLY_PRINTK and CONFIG_EARLY_PRINTK_EFI in the kernel config and edit
arch/x86/boot/compressed/eboot.c
and search for the function calledefi_main
. Add print statements likeefi_printk(sys_table, "Text output\n");
at various places in that function to find out where exactly it locks up.Then when using the kernel the OP needs to add
earlyprintk=efi
(orearlyprintk=vga
) to the kernel arguments. -
@djgalloway This is going to be a bit of a hunt and peck game here.
Remove the apci=off command and lets have it use just one cpu by adding in
nosmp
as a kernel parameter. It looks like it crashes just after it brings up smp.Also just for clarity, from the printout this is the processor that is currently in use:
Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (family: 0x6, model: 0x55, stepping: 0x4)
-
Kernel command line: loglevel=7 initrd=init.xz root=/dev/ram0 rw ramdisk_size=127000 web=http://10.8.128.2/fog/ consoleblank=0 rootfstype=ext4 console=tty0 console=ttyS1,115200 nosmp mac=e4:43:4b:7d:a9:ba ftp=10.8.128.2 storage=10.8.128.2:/opt/fog/images/dev/ storageip=1p Misrouted IRQ fixup and polling support enabled This may significantly impact system performance Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes) Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes) Memory: 97288196K/99055148K available (16392K kernel code, 992K rwdata, 4548K rodata, 1056K init, 2416K bss, 1766952K reserved, 0K cma-reserved) SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 Kernel/User page tables isolation: enabled rcu: Hierarchical RCU implementation. rcu: RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=1. rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1 NR_IRQS: 4352, nr_irqs: 32, preallocated irqs: 16 Console: colour VGA+ 80x25 console [tty0] enabled console [ttyS1] enabled ACPI: Core revision 20180810 ACPI: setting ELCR to 0200 (from 0820) clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns APIC: SMP mode deactivated APIC: Switch to symmetric I/O mode setup in no SMP routine BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.64 #1 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.2.11 06/13/2019 RIP: 0010:0xffffffff8102d1e6 Code: c2 48 8b 14 d5 00 53 62 82 4a 8b 1c 22 48 85 db 74 d7 3b 2b 75 d3 eb 14 48 8b 1d 25 60 d8 01 48 c7 05 1a 60 d8 01 00 00 00 00 <89> 2b 65 48 89 1d 98 7c fe 7e 65 8b 05 39 1f fe 7e 89 c0 f0 48 0f RSP: 0000:ffffffff82803e98 EFLAGS: 00010202 RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000040 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff828f36f0 RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff81408bd7 R10: 0000000000000000 R11: 000000000000005c R12: 0000000000014e88 R13: ffffffff82d460a0 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8897e1000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000002812001 CR4: 00000000000606b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: 0xffffffff81028ad4 0xffffffff82cc1652 0xffffffff82cb691a 0xffffffff82cafd33 0xffffffff810000d4 Modules linked in: CR2: 0000000000000000 ---[ end trace f19259880c7c4bbb ]--- RIP: 0010:0xffffffff8102d1e6 Code: c2 48 8b 14 d5 00 53 62 82 4a 8b 1c 22 48 85 db 74 d7 3b 2b 75 d3 eb 14 48 8b 1d 25 60 d8 01 48 c7 05 1a 60 d8 01 00 00 00 00 <89> 2b 65 48 89 1d 98 7c fe 7e 65 8b 05 39 1f fe 7e 89 c0 f0 48 0f RSP: 0000:ffffffff82803e98 EFLAGS: 00010202 RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000040 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff828f36f0 RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff81408bd7 R10: 0000000000000000 R11: 000000000000005c R12: 0000000000014e88 R13: ffffffff82d460a0 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8897e1000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000002812001 CR4: 00000000000606b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Kernel panic - not syncing: Attempted to kill the idle task! ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
-
We may have to enable kernel config option
CONFIG_INTEL_IDLE
to improve support for certain Intel CPUs.We may also want to to bump up
CONFIG_NR_CPUS
from the default of 8 to 512 (common value on modern kernels) at least on the x64 config, though this one shouldn’t cause a crash.That said, I am doubtful that would resolve this issue.
-
@Quazz said in rcu_sched stall OR kernel panic on PowerEdge R640:
We may also want to to bump up CONFIG_NR_CPUS from the default of 8 to 512
I’ve seen this setting in the kernel. I considered requesting the value set to 0 so it uses all available processors, but then I had to think this is for imaging and not a general purposes so having 28 cores available for imaging does really help because at most 4 threads (guess) would be used during imaging since most of the process is single threaded.
-
@george1421 Yes, I think that’s why it was left at 8 in the config, though perhaps some CPUs don’t handle a majority of their cores being ignored very well?
-
@Quazz That is surely something we can test.
-
@george1421 are you working on building a kernel with @Quazz’s suggestions or should I? I don’t have experience building a kernel from scratch but I can probably figure it out.