rcu_sched self detected stall on CPU when Deploying

Wolfbane8653

Fog v1.5.4
Kernal 4.19.64
Device: HP DC7800 (YES OLD!)

When deploying image “rcu_sched self detected stall on CPU when” is displayed an some cpu statics are shown. I figured bad motherboard or cpu so I replaced the entire board and still the same issue occurred. Tried 3 other machines of the same model and no luck. rcu_sched self detected stall on CPU when capture gave me the info I needed but this apparently has not been fixed.

Is this due to a conflict with the age of the hardware and the new kernal updates?

This does not effect my other hardware at all they all imaged perfectly with Kernal 4.19.64. Only dc7800’s are the issue.

FYI:
Kernal 4.19.64 64 – not working
Kernal 4.19.48 64 – not working
Kernal 4.19.36 64 – not working
Kernal 4.19.6 64 – need to test
Kernal 4.19.1 64 – need to test
…
Kernal 4.17.0 64 – not working
Kernal 4.16.6 64 – works
Kernal 4.15.2 64 – works

HP Compaq dc7800p Small Form Factor
Intel Core2 Duo CPU E6750 @ 2.66GHz
6GB RAM
160 GB HDD – WDC WD1600AAJS-00B4A0

BIOS Version786F1 ~~v01.04~~ v01.35
Motherboard 0AA8h

george1421

For this specific host manually register it and then in the host definition add the following to the kernel args field acpi=off. Lets see where that gets us.

Quazz

The original problem in that thread was solved afaik.

That said, rcu_sched can actually be caused by many different things, your case seems different.

Since it’s directly preceded by ACPI errors, disabling ACPI as George suggested is a good start.

Wolfbane8653

FYI 4.16.6 64 – works

Reset Kernals back to v4.19.64
Deleted host from Fog database
Manual registration
Set Host Kernel Arguments with acpi=off
Set to Deploy
Kernal panic

Quazz

@Wolfbane8653 Can you share a picture of the result with acpi=off?

Also can you share the specifications of the system?

Wolfbane8653

Kernal 4.17.0 64 – does not work.

Kernal 4.19.64 with acpi=off

Wolfbane8653

@Quazz –

HP Compaq dc7800p Small Form Factor
Intel Core2 Duo CPU E6750 @ 2.66GHz
6GB RAM
160 GB HDD – WDC WD1600AAJS-00B4A0

BIOS Version786F1 v01.04
Motherboard 0AA8h

george1421

@Wolfbane8653 Also make sure the firmware is updated on this target computer.

We just ran through debugging this issue with a new intel platinum processor. That processor had an issue with the number of cores being capped at 8. In this case the processor is old so we should not be hitting that bug here.

Quazz

@Wolfbane8653 A BIOS update may be required.

Other than that, trying out this kernel here might be interesting: https://drive.google.com/open?id=1ZiRWrrN3dv26bLwW8GAEdLtzGw5xkyQI

Wolfbane8653

BIOS updated to v1.35 (Latest)
Kernal 4.19.64 still does not work.

Custom Kernal still does not work

Quazz

@Wolfbane8653 Please try kernel argument tsc=unstable

Then try kernel argument clocksource=hpet

george1421

@Wolfbane8653 I’m currently building a FOS Linux kernel without acpi support to see if we can get past the rcu_sched issue. I can say tracking down this type of issue does take time because its hardware/model specific. If you can get one of the kernel parameters that Quazz mentioned to work is the preferable route. I’ll post a link to the noacpi generated kernel when its done building.

george1421

Here is a test kernel with no acpi functions supported: https://drive.google.com/open?id=1siERUC9h8MfQIXbqrQShKOHc55h5xK3q

Download it as bzImageNoACPI and move it to /var/www/html/fog/service/ipxe directory on your fog server. Then go into the host definition for this specific host with the rcu_sched error and enter bzImageNoACPI (watch the case) into the kernel field and save the host configuration. Then pxe boot the target computer into imaging to see if we can get past the cpu stall.

Wolfbane8653

@Quazz

tsc=unstable – works with Kernel 4.19.64
clocksource=hpet – works with Kernel 4.19.64

So both of these commands work!

@george1421 – bzImageNoACPI creates a kernal panic. IDE is turned on in the BIOS. I do not use the RAID function for these machines.

Quazz

I’m glad those commands worked.

So this is a problem that I think was introduced in the Spectre/Meltdown patches and only affects Core 2 CPUs.

I thought it was supposed to be fixed in Kernel 4.19, but apparently not.

george1421

@Wolfbane8653 said in rcu_sched self detected stall on CPU when Deploying:

tsc=unstable – works with Kernel 4.19.64
clocksource=hpet – works with Kernel 4.19.64

Great on fixing it with the timing source. That is the solution. As for my noacpi I figured that would happen because I also removed the acpi boot device drivers too. It was a risk, but the right answer is with the kernel parameters with the stock kernel. Well done!

Wolfbane8653

So I’m guessing I’m going to need to edit all 100 of my units to have this argument? Or are you working on having a new bzImage for me to test?

Quazz

@Wolfbane8653 You can safely set this globally, unless you have even older CPUs

Wolfbane8653

Current Solution set Kernel to 4.19.64 and set global option in Fog Configuration–> FOG Settings --> General Settings --> Kernel ARGS to tsc=unstable

Luckly this is the last year for these machines.

rcu_sched self detected stall on CPU when Deploying

176

12.7k

17.6k

156.5k