I work in a testing lab where we use FOG to deploy operating systems. We have a very wide range of server platforms, from a few generations ago up to current and even a few pre-release platforms.
Generally FOG works well on all of our platforms and PXE boots with no issue. We are currently running FOG 1.5.9 to be compatible with our in-house automation, but we have the latest FOS kernel release loaded:
file /var/www/html/fog/service/ipxe/bzImage*
/var/www/html/fog/service/ipxe/bzImage: Linux kernel x86 boot executable bzImage, version 6.1.22 (runner@fv-az565-7) #1 SMP PREEMPT_DYNAMIC Fri Mar 31 00:29:42 UTC 2023, RO-rootFS, swap_dev 0x9, Normal VGA
/var/www/html/fog/service/ipxe/bzImage32: Linux kernel x86 boot executable bzImage, version 6.1.22 (runner@fv-az576-383) #1 SMP PREEMPT_DYNAMIC Fri Mar 31 00:26:56 UTC 2023, RO-rootFS, swap_dev 0x8, Normal VGA
The issue comes with some of our newest platforms, listed here:
Platform #1: Dell R760 server, showed issues out of the box
Platform #2: Supermicro 4U server with X13DEG-OA motherboard, worked out of the box, stopped working with latest BIOS loaded.
Platform #3: Pre-release Gen4 Xeon based system, worked out of the box, stopped working with latest BIOS loaded.
They all happen to be using 4th generation Xeon Scalable processors, but I’m not sure that is the specific problem. When attempting to boot them via PXE, the system halts at the hand-off point to the FOS kernel until manually rebooted. The last visible message before hang is “EFI stub: Loaded initrd from command line option”.
This seems to be an issue with the FOS kernel rather than the iPXE chain leading up to it, since creating a USB boot disk and chainloading into the FOS kernel from GRUB also locks up the system in the same way. I’m a newbie where this sort of thing is concerned but judging by the little bit of diagnostic output I was able to get from GRUB, it looks like the failure is instant as soon as FOS takes over, and doesn’t happen anywhere while GRUB (or iPXE) is still running.
The issue also appears to be related to BIOS somehow - two of our three platforms initially worked fine with FOS and FOG deployment, but started exhibiting the issue after an update to the latest BIOS. Secure Boot is disabled on all platforms.
At this point we’ve been trying to root cause and fix this issue for a couple weeks and haven’t made any progress. Any suggestions for what we could try to resolve this or ways we could generate additional diagnostics information for you folks would be much appreciated. We have some quite capable engineers over here but nobody has much experience with low-level firmware or kernel stuff that it seems like we might be dealing with.