FOG menu boot loop after image deployment

AllFoggedOut

Server

FOG Version: 1.3.1 RC7
OS: CentOS Linux 7.2.1511

Client

Service Version: ?
OS: Win7x64

Description

I have an Intel NUC NUC6i5SYH, with a 256GB M.2 NVMe drive. I appear to be 1 version behind the latest BIOS update (v55) but according to the ReadMe, the only change is “Updated RaidDriver”.

For work reasons, I have Win7x64 installed and running on the device. I will be also installing/imaging Windows 10 at some point.

I can acquire and deploy the Win7x64 image w/o issue. The problem comes after task completion and reboot.

The FOG Boot Menu is displayed with “Boot from hard disk” selected. After countdown, the screen turns black and I get a text version of the FOG Boot Menu. At this point, the boot menu enters an infinite loop, repeatedly counting down from 3.

I’ve tried changing “Exit to Hard Drive Type” between all of the available options - none result in a successful boot to Windows. I also tried updating Syslinux to the latest version (as described in a Wiki document), but that also didn’t help.

The BIOS is set to LEGACY boot, and Secure Boot is DISABLED (triple checked). Boot order is LAN, NVMe drive.

I recall seeing an error message flash up on the screen similar to “Boot from SAN device 0X80 failed: Operation cancelled (http://ipxe.org/0b8080a0)”. Not sure if this has any bearing.

Using the “Exit to Hard Drive Type” of “Exit”, I get an error about chainloading failed, then an automatic reboot after 10s.

If I change the BIOS boot order to boot from the NVMe drive, Windows boots w/o issue.

Thanks for any help!

Edit: solution was BIOS v57 and a cold-boot of the NUC.

AllFoggedOut

Just a quick update: Intel are still investigating. They did release v57 BIOS asking me to test it ~~but it hasn’t resolved the issue~~.

Will update if any/more progress is made.

Edit: see my update below.

Wayne Workman

Asking @george1421 to look at this.

george1421

Well it sounds like you have been pretty diligent on testing this hardware. We ( i ) have the NUC5i on my campus and they do work (with the exception I don’t boot though the FOG server on an operational basis. We require the techs to press F12 to get the boot menu for imaging to avoid an accidental imaging event).

Just as a point to note, if you did something with syslinux then you have a problem. FOG does not use syslinux since the early days. iPXE is now used for the boot loader. What are you setting the dhcp option 67 to?

This sure does sound like an exit mode issue. Are you changing the bios exit mode in the host definition or the global setting?

Also does that NUC have a personality selector? Some allow you to define Win7,Win8,W10, Linux and we found that it IS important for that personality to be set correctly. This is set in the firmware and not the OS or FOG.

AllFoggedOut

@george1421 said in FOG menu boot loop after image deployment:

Just as a point to note, if you did something with syslinux then you have a problem. FOG does not use syslinux since the early days. iPXE is now used for the boot loader. What are you setting the dhcp option 67 to?

Perhaps a big red banner across the top of “https://wiki.fogproject.org/wiki/index.php?title=Boot_looping_and_Chainloading” to indicate it’s no longer relevant might be a good idea

DHCP hasn’t been fiddled with since install, so it’s whatever comes out of the box. Checking dhcpd.conf, it doesn’t appear to be defined. Actually, reading dhcpd.conf again, there’s a “Class Legacy” definition with “filename undionly.kkpxe”. I assume this is being applied.

This sure does sound like an exit mode issue. Are you changing the bios exit mode in the host definition or the global setting?

I’ve fiddled with both. I understand host definition takes priority? Last round of testing was with the host definition.

Also does that NUC have a personality selector? Some allow you to define Win7,Win8,W10, Linux and we found that it IS important for that personality to be set correctly. This is set in the firmware and not the OS or FOG.

Just had another look - not that I can see.

Wayne Workman

@AllFoggedOut said in FOG menu boot loop after image deployment:

Perhaps a big red banner across the top of “https://wiki.fogproject.org/wiki/index.php?title=Boot_looping_and_Chainloading” to indicate it’s no longer relevant might be a good idea

Done.

george1421

@AllFoggedOut said in FOG menu boot loop after image deployment:

DHCP hasn’t been fiddled with since install, so it’s whatever comes out of the box. Checking dhcpd.conf, it doesn’t appear to be defined. Actually, reading dhcpd.conf again, there’s a “Class Legacy” definition with “filename undionly.kkpxe”. I assume this is being applied.

I just wanted to ensure you weren’t use pxelinux.0 to boot your system. That would cause other issues.

I’ve fiddled with both. I understand host definition takes priority? Last round of testing was with the host definition.

Host definitions do take priority over global ones. So the host is the right spot.

I can’t think of any reason off the top of my head why sanboot wouldn’t work correctly for these systems. The M.2 disk should function just like a sata attached disk. I can go grab one of the NUC5i systems on Monday and see if they exit properly with sanboot. I know that won’t be an apples to apples test but the NUC5i we have do have msata drives, again they are not the same as M.2 but its the best I have to compare with.

The only other thought is if rEFInd or maybe grub can be configured to find these disks properly. You may have to tweak the refind.conf on the fog server.

AllFoggedOut

@george1421 said in FOG menu boot loop after image deployment:

The M.2 disk should function just like a sata attached disk.

So I’ve discovered something a little…odd.

Firstly, I had SATA completely disabled in BIOS, which I thought might be the cause of my problems (the NUC functions perfectly w/o it, save for this issue we’re discussing). Turns out, it’s not. Even with SATA enabled and set to AHCI, I’m still unable to boot via SANBOOT.

However, (with SATA completely disabled) if I hit F10 during boot (Boot Menu), it displays:

LAN : IBA CL Slot 00F3 v0104
INTEL SSDPEKKW256G7 : PART 0 : Boot Drive

If I then select LAN, once the graphical Fog Menu counts down to 0, the NUC boots to Windows! If I don’t hit F10 during boot, the Fog Menu goes back to its infinite loop.

Wth? It’s almost as if the act of hitting F10 is populating something necessary that is otherwise empty/blank/skipped…? Is this a BIOS bug?

george1421

@AllFoggedOut When you don’t hit F10 you have the default boot set to the LAN?

While this isn’t an answer to the issue, do you need unattended imaging with these NUCs? If not why not just change the boot order to boot to the local hard drive and then press F10 if you need to reimage them.?

AllFoggedOut

@george1421 said in FOG menu boot loop after image deployment:

@AllFoggedOut When you don’t hit F10 you have the default boot set to the LAN?

While this isn’t an answer to the issue, do you need unattended imaging with these NUCs? If not why not just change the boot order to boot to the local hard drive and then press F10 if you need to reimage them.?

Yes, the default boot option is LAN. Yes, unfortunately, I do need unattended imaging

george1421

@AllFoggedOut Did you try an exit mode using rEFInd (yes I know its intended for EFI systems, but the docs also say it supports bios based systems too)? Its a long shot but it might find the boot disk. You need to change this in the host definition and for the bios exit mode.

The sanboot should use the first hard drive reported to the bios, which is why I find it a bit surprising that it doesn’t work. But I can’t say its consistent for M.2 disks.

AllFoggedOut

@george1421 said in FOG menu boot loop after image deployment:

@AllFoggedOut Did you try an exit mode using rEFInd (yes I know its intended for EFI systems, but the docs also say it supports bios based systems too)? Its a long shot but it might find the boot disk. You need to change this in the host definition and for the bios exit mode.

I haven’t looked at rEFInd yet - I’ll take a look and see what’s involved.

The sanboot should use the first hard drive reported to the bios, which is why I find it a bit surprising that it doesn’t work. But I can’t say its consistent for M.2 disks.

Yeah, it’s odd to say the least.

AllFoggedOut

I’ve set “Host Bios Exit Type” to “REFIND_EFI” under the host definition. I’ve edited “packages/web/service/ipxe/refind.conf” to have “scanfor” set to “hdbios”. All I seem to be getting is a blank screen with a flashing cursor in the top left.

I also set “scanfor” to “manual”, and defined a manual stanza at the bottom of refind.conf. Also bumped timeout to 10. Same behaviour.

Same behaviour with SATA enabled/disabled, pressing/not pressing F10 (as described previously),

Do I need to restart something for changes to take effect?

Tom Elliott

@AllFoggedOut This sounds an awful lot two possibilities causing the issue you’re seeing.

The image was pushed up and the disk was detected as GPT even though you have it setup as MBR in windows. This, by itself, shouldn’t cause any major issues but the deploy back to disk might. Though it would be ultimately better to fix the partition layout for the capture, you can fix this for deploy by using postdownloadscripts. Essentially, if you can boot the system into a FOS Debug mode and run fixparts <devicename> I suspect you’ll see it asking to fix the partition table. If it is, confirm and save, cancel your created tasking and reboot the system that booted up.
(Least likely if image was captured relatively recently) The MBR is not setting the partition boot partition in a “bootable” state.

Just my thoughts.

Tom Elliott

In a mild attempt to check if the GPT partition thing I’m suspecting is the issue, I may have found a way to properly fix the “hung disk” issue that once was. This same fix should work for upload/capture etc…

The reason things weren’t getting hung is because we were piping yes into the gdisk command. This meant imaging would continue going and all would be more or less fine. (The data is actually copied or placed back on the disk.) But it could also leave the system in a strange state (for example the system being unable to boot after being deployed to.)

I’ve updated the init’s in hopes to try out a method to verify the partition table first. If the partition table is invalid, try to run fixparts on the disk.

The source has been updated within the working-1.3.2 branch, and the development init’s have been updated to contain this new test.

If the debug->fixparts method works to make the system bootable would you mind seeing if the development init’s are working for you too?
If so, please try downloading the dev init’s on your system and seeing if your systems are working after a deploy. I don’t know IF they will work and I have no means to replicate the problem currently.

wget --no-check-certificate -O /var/www/fog/service/ipxe/init.xz https://fogproject.org/inits/init.xz
wget --no-check-certificate -O /var/www/fog/service/ipxe/init_32.xz https://fogproject.org/inits/init_32.xz

AllFoggedOut

@Tom-Elliott said in FOG menu boot loop after image deployment:

The image was pushed up and the disk was detected as GPT even though you have it setup as MBR in windows. This, by itself, shouldn’t cause any major issues but the deploy back to disk might. Though it would be ultimately better to fix the partition layout for the capture, you can fix this for deploy by using postdownloadscripts. Essentially, if you can boot the system into a FOS Debug mode and run fixparts <devicename> I suspect you’ll see it asking to fix the partition table. If it is, confirm and save, cancel your created tasking and reboot the system that booted up.

In FOG Debug, I run ‘fixparts /dev/nvme0n1’, and I get ‘MBR command (? for help)’. No sign of an error state or request to fix the partition table.

(Least likely if image was captured relatively recently) The MBR is not setting the partition boot partition in a “bootable” state.

If I print out the existing MBR Partition Table via ‘p’ command, I get 2 partitions, the first is set with the boot flag; all looks healthy.

Just my thoughts.

I appreciate your thoughts!

AllFoggedOut

@Tom-Elliott said in FOG menu boot loop after image deployment:

If the debug->fixparts method works to make the system bootable would you mind seeing if the development init’s are working for you too?
If so, please try downloading the dev init’s on your system and seeing if your systems are working after a deploy. I don’t know IF they will work and I have no means to replicate the problem currently.

FWIW, these work. I was able to capture and deploy my Win7x64 image w/o any obvious errors. Windows boots when I select the NVMe drive in F10 Boot Menu. Unfortunately, my original problem persists.

I’m going to install/clone/deploy Win10 in EFI mode and see how that goes.

AllFoggedOut

Win10x64 unattended capture/deployment works fine using rEFInd (SANBOOT does not work - same issue as with Win7x64 above).

Initially I had rEFInd complaining about my scanfor line containing legacy BIOS options which were incompatible since my BIOS lacked the necessary Compatibility Support Module. Removing hdbios from the list made the error message go away. Had a minor bit of confusion initially because I was editing the wrong refind.conf - correct path is ‘/var/www/html/fog/service/ipxe/refind.conf’.

I might try rEFInd again for my Win7x64 issue (now that I know which file to edit), but I don’t think it’s going to help - the fact that I got a blank screen and flashing cursor vs a rEFInd menu + error message under EFI suggests it’s not happy.

During the course of testing I had to move my image storage directory owing to a lack of space. I moved all files under /images to another LVM volume group. I then updated the path in Storage Management. I then ran into an issue post-image capture (repeated “Database Update failed” messages) which was resolved by changing permissions on the new folder (and sub-folders to fog:fog and mode 775). Not sure if this is correct, but it worked. Apache error_log showed the FTP rename operation failing. I then had a 2nd issue during image deployment where “Checking Mounted File System” failed. Seems the script “src/buildroot/package/fog/scripts/bin/fog.checkmount” still contained the old storage path (possible bug?). I also must have missed the “.mntcheck” file when moving files around - had to recreate it.

Anyway, that aside, Win10x64 clone/deploy is now working seamlessly.

Tom Elliott

@AllFoggedOut as for the ‘possible bug’ you are not experiencing a bug for that. If the .mntcheck file is missing it will fail because it has no way else to know whether the system mounted or not. We use the .mntcheck file to determine if this is the case or not.

george1421

@AllFoggedOut OK, I’m a bit confused now.

It was my understanding that imaging worked perfectly, but where the issue was when you exited from the FOG iPXE menu to boot the local host OS. The target computer would not boot the local OS, but if you changed the bios boot order to the disk first (instead of PXE) the system would boot properly.

Shouldn’t this should be an iPXE kernel exit mode issue??

Tom Elliott

@george1421 it is, the OP is just stating what was tested that appears to be working. During those tests they ran into a full disk and added more space. When adding more space they forgot to setup the permissions and mntcheck files.

FOG menu boot loop after image deployment

Server

Client

Description

165

12.1k

17.3k

155.4k