rcu_sched self detected stall on CPU when Deploying
-
For this specific host manually register it and then in the host definition add the following to the kernel args field
acpi=off
. Lets see where that gets us. -
The original problem in that thread was solved afaik.
That said, rcu_sched can actually be caused by many different things, your case seems different.
Since it’s directly preceded by ACPI errors, disabling ACPI as George suggested is a good start.
-
FYI 4.16.6 64 – works
- Reset Kernals back to v4.19.64
- Deleted host from Fog database
- Manual registration
- Set Host Kernel Arguments with acpi=off
- Set to Deploy
- Kernal panic
-
@Wolfbane8653 Can you share a picture of the result with acpi=off?
Also can you share the specifications of the system?
-
Kernal 4.17.0 64 – does not work.
- Kernal 4.19.64 with acpi=off
- Kernal 4.19.64 with acpi=off
-
@Quazz –
HP Compaq dc7800p Small Form Factor
Intel Core2 Duo CPU E6750 @ 2.66GHz
6GB RAM
160 GB HDD – WDC WD1600AAJS-00B4A0BIOS Version786F1 v01.04
Motherboard 0AA8h -
@Wolfbane8653 Also make sure the firmware is updated on this target computer.
We just ran through debugging this issue with a new intel platinum processor. That processor had an issue with the number of cores being capped at 8. In this case the processor is old so we should not be hitting that bug here.
-
@Wolfbane8653 A BIOS update may be required.
Other than that, trying out this kernel here might be interesting: https://drive.google.com/open?id=1ZiRWrrN3dv26bLwW8GAEdLtzGw5xkyQI
-
BIOS updated to v1.35 (Latest)
Kernal 4.19.64 still does not work.Custom Kernal still does not work
-
@Wolfbane8653 Please try kernel argument
tsc=unstable
Then try kernel argument
clocksource=hpet
-
@Wolfbane8653 I’m currently building a FOS Linux kernel without acpi support to see if we can get past the rcu_sched issue. I can say tracking down this type of issue does take time because its hardware/model specific. If you can get one of the kernel parameters that Quazz mentioned to work is the preferable route. I’ll post a link to the noacpi generated kernel when its done building.
-
Here is a test kernel with no acpi functions supported: https://drive.google.com/open?id=1siERUC9h8MfQIXbqrQShKOHc55h5xK3q
Download it as bzImageNoACPI and move it to /var/www/html/fog/service/ipxe directory on your fog server. Then go into the host definition for this specific host with the rcu_sched error and enter bzImageNoACPI (watch the case) into the kernel field and save the host configuration. Then pxe boot the target computer into imaging to see if we can get past the cpu stall.
-
- tsc=unstable – works with Kernel 4.19.64
- clocksource=hpet – works with Kernel 4.19.64
So both of these commands work!
@george1421 – bzImageNoACPI creates a kernal panic. IDE is turned on in the BIOS. I do not use the RAID function for these machines.
-
I’m glad those commands worked.
So this is a problem that I think was introduced in the Spectre/Meltdown patches and only affects Core 2 CPUs.
I thought it was supposed to be fixed in Kernel 4.19, but apparently not.
-
@Wolfbane8653 said in rcu_sched self detected stall on CPU when Deploying:
tsc=unstable – works with Kernel 4.19.64
clocksource=hpet – works with Kernel 4.19.64Great on fixing it with the timing source. That is the solution. As for my noacpi I figured that would happen because I also removed the acpi boot device drivers too. It was a risk, but the right answer is with the kernel parameters with the stock kernel. Well done!
-
So I’m guessing I’m going to need to edit all 100 of my units to have this argument? Or are you working on having a new bzImage for me to test?
-
@Wolfbane8653 You can safely set this globally, unless you have even older CPUs
-
Current Solution set Kernel to 4.19.64 and set global option in Fog Configuration–> FOG Settings --> General Settings --> Kernel ARGS to
tsc=unstable
Luckly this is the last year for these machines.