@fenix_team said in Error "rcu_sched self detected stall on CPU" on legacy BIOS Capture job:
My original FOS image is the latest available at Kernel Update GUI page, which currently is 4.19.6 (both bzImage and bzImage32)
I just finished one of the 2 machines with systems as described in OP. The capture job succeeded with not much as a single warning! The system with American Megatrends BIOS is also smoothly past the point in which the issue was happening.
Hey, thanks for reporting so many details about this! I started to look into this and reading all the messages posted. That one really caught my attention. Are you saying that it does work “sometimes” without an issue. Is that on the same kernel version 4.19.6 that is causing the error initially posted?? Would make it even harder for us to nail this issue down.
And a quick comment on the kernel/init versions. There is no strict rule that kernels are compiled against exactly one init version or vice versa. But looking into this more closely I just figured something out that I wasn’t aware of until now: There is an option within buildroot (the toolstack we use for the inits) that is used to optimize glibc compilation. The more recent kernel version you choose the less compatibility code is needed to be build into glibc and therefore the binaries are smaller in size. Sounds pretty straight forward and if I had known this before (there are hundreds of buildroot options and I really don’t know what they are all doing exactly) I would have build with more compatibility!
I will compile a new set of inits with more compatibility now and see if it matters in size much. I guess it won’t as the inits are huge ( just under 20 MB) anyway. We’ll see. I will let you all know.
Ok, back to the initial posted issue: Trying to figure out what might be causing this on your hardware I started by reading the kernel docs on this. Essentially it says that this can be caused by many different things (see a detailed list in the document linked) and we might need to turn on CONFIG_RCU_TRACE in the kernel to get an idea where things go wrong. But as a start we would need to have a clear picture of the exact error messages on screen.
@fenix_team said:
I also noticed another change, all of these machines sometimes got stuck in iPXE boot while loading “/default.ipxe”, at 0%, and forced me to reboot lots of times until randomly it boot correctly. After changing kernel and init versions, that problem vanished (I don’t know if things are related, tho).
From my point of view those two things can’t be related as the Linux kernel is not running when you get to loading default.ipxe yet! It’s interesting you seem to have this fixed by changing the Linux kernel and inits though. I suspect it to be just a coincidence. Usually when thinks don’t load properly at that stage it’s a network driver problem within the iPXE code. Another thing very hard to debug as it is hardware specific and needs to be reproduced to find and fix. But for iPXE there might be a different solution for you. We provide a set of different binaries which you all find in /tftpboot
on your FOG server. Default for legacy BIOS machines is undionly.kkpxe
. You can try undionly.*pxe
(UNDI network stack only), ipxe.*pxe
(native driver stack all included), intel.*pxe
(native driver but Intel NICs only) and realtek.*pxe
(native driver but Realtek NICs only).