SOLVED FOG Kernel panic at startup with some systems

  • We have a small cluster at our company.
    We are looking at using FOG to deploy Linux to the compute notes of the cluster.
    We have had success (so far) with all systems we’ve tried that contain a “conventional” Xeon CPU.
    However, we also have some more unusual CPU architectures we need to cover.
    For the computers we’ve tried with ATOM CPUs, and systems with Xeon Phi Knights Landing (KNL) CPUs, we get a kernel panic just as the FOG Kernel is starting. PXE and file transfers are all working correctly as near as we can tell.
    Here is a screen shot of the panic screen (ATOM CPU, but the KNL nodes do the same thing):
    We have already tried updating the Kernel to the newest version.
    We have also tried several different kernel arguments; only “nolapic” did anything, which only changed the panic to “BIOS has enabled x2apic but kernel doesn’t support x2apic, please disable x2apic in BIOS” (Which isn’t an option available in these systems)

    FOG Version: 1.5.4
    Default Kernel: 4.16.6
    Target system is a SuperMicro A2SDi-16C-HLN4F (In the case of the ATOM computer).

    Thank you in advance for any help.

  • Moderator

    @mckayj Okay, I think this one should have it enabled

  • Senior Developer

    @mckayj @Quazz Great this has turned out to fix the issues on those CPUs. Thanks heaps!!

    Will add x2apic to the official FOG kernel build. Let’s hope this is not going to cause any trouble for other devices. 😉

    Edit: Added the option, will be in the official binaries soon:

  • Moderator

    @mckayj That’s great news! I’m not sure why x2apic support wasn’t enabled. As far as I know, it’s typically only used for multi CPU systems which aren’t your typical usecase for FOG I suppose. It’s very strange that it would be enabled on a single CPU board to me, but there you go anyway.

    I don’t think it will harm anything if we enable x2apic support going forward, it can always be disabled with a kernel parameter on boot anyway.

  • @Quazz
    IT WORKED! Thank you!
    The ATOM based node started and went into it’s debug task, and is able to see the network (I was concerned about the network as well, as some slightly older Linux distributions couldn’t see the network cards)
    We also tested it on a KNL based node, and two Xeon based nodes, and they can all use the Kernel fine.

    So, I am curious to know: why isn’t that Kernel feature enabled all the time?

    Also, do you guys have any merchandise or swag? We’d like to support you, but a donation may be difficult from within our company’s various structures.

    Again, thank you so much!

    (Screen shot of ATOM based node in “debug” task)

  • Moderator

    @mckayj Okay, I think this one should have it enabled

  • @Sebastian-Roth
    I have downloaded and tried Ubuntu 18.04.1 on Live CD as requested.
    It works completely OK!
    Systems starts, goes into GUI, and network cards all work.

    I’m hoping the Kernel with x2apic support will help, as I have seen messages about that with some kernels, and some options.
    The computers in question have options in their BIOS menu that appear to be for that, but the options are not available to be changed.

  • Senior Developer

    @mckayj Interesting to see that all the old 2.6.x kernels are having issues on this CPU. Enabling debug did not what I was expecting it to do. Although I have compiled a fair amount of kernels and even wrote a little bit of driver code for the fun of it I am not a real wizard when it comes to reading those kernel oops messages and debugging the kernel altogether.

    It’s good to know that newer Linux distros seem to work better than older ones. So we’ll hopefully find out what’s causing this and fix it in our latest kernel. Can you please try booting Ubuntu 18.04.1 CD to see if a recent 4.x kernel is booting fine.

  • Moderator

    @mckayj I’m not a kernel guru, but I’d be interested to seeing what would happen if you were to boot using a kernel that has x2apic support. From some light googling, it seems to suggest a kernel panic is normal behavior if x2apic is enabled in BIOS (even if you can’t see the option), but the support isn’t in the kernel.

    Trying to compile said kernel now.

  • @Quazz
    I tried a number of self boot Linux CDs:

    • UBCD (Parted Magic): OK, no network
    • Ubuntu 16.04: OK, no network
    • CentOS 7.5: OK (This is currently installed on the computer’s own HDD, and will be reinstalled once FOG is running)
    • Puppy 5.3.3: Failed (Can’t load drivers for USB, so not the same issue), but not a Kernel Panic
    • Damn Small Linux: Non responsive Black Screen
    • Ubuntu 8.04: Kernel Panic: Attempted to kill init!
    • CentOS 6.7 Live: Kernel Panic
    • CentOS 7.5 Installer: OK (After all, I installed it from this on this computer a few months ago)
    • Scientific Linux 48 (Very old): Kernel Panic: Oops

    (Screen shots below)

    SHIFT+PAGEUP scrolls a working terminal, but once the kernel panics, that doesn’t work.

    To be honest, I have not tried this on every node, as this is a running cluster, so I can’t take too many nodes offline for this testing. Here nodes I have tried:

    • ATOM Node (which I’ve mostly been talking about): Kernel Panic
    • KNL Node: Kernel Panic
    • Xeon CPU (HP brand computer): OK
    • Xeon CPU (Supermicro brand computer): OK
    • Various desktops, laptops, virtual computers: OK

    Ubuntu 8.04:

    CentOS 6.7 Live:

    Scientific Linux 48 (Not sure about that version number, that’s just what was written on the CD):

  • Moderator

    Would be interesting to see if they can boot a live USB of say Ubuntu 18.04.

    Additionally, I think you should be able to scroll up a bit using shift + pageup iirc, there might be some useful info up there.

    Also, does this happen on all the nodes?

  • Thank you for the quick response!
    I set the ATOM based node in question to use the kernel you linked to me.
    While it did make some difference, the difference is limited to some of the HEX numbers being a different value, and the order of things printed is changed:
    Certainly nothing I can read.

  • Senior Developer

    @mckayj Mind trying this freshly build kernel image?

    I enabled CONFIG_DEBUG_INFO in hope to get some more readable “call trace” information.