FOG menu boot loop after image deployment



  • Server
    • FOG Version: 1.3.1 RC7
    • OS: CentOS Linux 7.2.1511
    Client
    • Service Version: ?
    • OS: Win7x64
    Description

    I have an Intel NUC NUC6i5SYH, with a 256GB M.2 NVMe drive. I appear to be 1 version behind the latest BIOS update but according to the ReadMe, the only change is “Updated RaidDriver”.

    For work reasons, I have Win7x64 installed and running on the device. I will be also installing/imaging Windows 10 at some point.

    I can acquire and deploy the Win7x64 image w/o issue. The problem comes after task completion and reboot.

    The FOG Boot Menu is displayed with “Boot from hard disk” selected. After countdown, the screen turns black and I get a text version of the FOG Boot Menu. At this point, the boot menu enters an infinite loop, repeatedly counting down from 3.

    I’ve tried changing “Exit to Hard Drive Type” between all of the available options - none result in a successful boot to Windows. I also tried updating Syslinux to the latest version (as described in a Wiki document), but that also didn’t help.

    The BIOS is set to LEGACY boot, and Secure Boot is DISABLED (triple checked). Boot order is LAN, NVMe drive.

    I recall seeing an error message flash up on the screen similar to “Boot from SAN device 0X80 failed: Operation cancelled (http://ipxe.org/0b8080a0)”. Not sure if this has any bearing.

    Using the “Exit to Hard Drive Type” of “Exit”, I get an error about chainloading failed, then an automatic reboot after 10s.

    If I change the BIOS boot order to boot from the NVMe drive, Windows boots w/o issue.

    Thanks for any help!


  • Moderator

    Afaik, aside from updating the Storage node, you also need to update the /opt/fog/.fogsettings file with the correct path and then rerun the installer. (not necessarily addressed at anyone in particular, more for people who might run into similar problems)



  • Thanks to @TomElliott for clearing things up via Chat. Seems I dug myself a hole when changing storage paths.

    I’ve posted about the F10 boot weirdness on the Intel Support forums. Will update based on what comes back.


  • Senior Developer

    Pinging you on chat so as to keep the arguing in forums to a minimum.

    Ultimately, you should not have the problem any more now that the system is setup properly. If you want to test this, you can send the client to deploy again. It will pick up the updated storage information and should work properly.

    If you can post screen shots of the error (if it is indeed occurring) it would certainly go a lot further to verifying and fixing the issue. This, however, I have not seen.

    This, of course, doesn’t fix the boot loop mechanism. I’m just trying to help clarify points as best I can.


  • Senior Developer

    @AllFoggedOut The client (FOS) debug or otherwise, will only update the elements if the client is rebooted.

    I’m going to guess you were in debug mode on the client you were telling to re-deploy the image?



  • Ok, I’ll try and explain this from a different angle.

    I have an issue with iPXE Exit Mode loop that is affecting Win7x64/Legacy BIOS.

    In the process of trying to diagnose this issue, I decided to install Windows 10/EFI to see if this also suffered from the same problem.

    During the cloning of Windows 10 I realised I was running out of disk space in “/images” on the FOG server.

    I aborted the clone, cancelled the imaging task and set about creating a folder for use by FOG to store images.

    I moved everything under “/images” to “/stor/fog”. Via the FOG web interface, I went into Storage Management -> DefaultMember, and changed “Image Path” and “FTP Path” to the new folder.

    I then recreated the imaging task which failed because I had the wrong permissions on my folder (my fault, not a bug).

    When I then attempted to deploy the same image, it failed because the FOG script “fog.checkmount” still contained the OLD storage path. Why did this script still contain the old path? Shouldn’t it have been updated to the new path when I updated DefaultMember?

    I also failed to copy “.mntcheck” from /images (my fault, not a bug).

    I also deleted “/images” from the server. If I had left this folder in place, I’m guessing the “Checking Mounted File System” would have just continued to work, despite pointing to a defunct/no longer used location.

    Now that I’m thinking about it, the NFS exports were also not updated. I had to manually edit /etc/exports, put the new folder in and run exportfs -a.

    My assumption was that updating DefaultMember would mean:

    a) NFS exports is updated with the new folder
    b) fog.checkmount is updated with the new folder

    Perhaps the preferred/supported approach is to simply add new storage areas, rather that editing the existing ones.

    None of the above has any relation to my iPXE Boot Menu issue with Win7x64/Legacy. It is a coincidental observation that I thought I’d bring to your attention in case it’s a bug with updating Storage Management paths.


  • Senior Developer

    @AllFoggedOut said in FOG menu boot loop after image deployment:

    During the course of testing I had to move my image storage directory owing to a lack of space. I moved all files under /images to another LVM volume group. I then updated the path in Storage Management. I then ran into an issue post-image capture (repeated “Database Update failed” messages) which was resolved by changing permissions on the new folder (and sub-folders to fog:fog and mode 775). Not sure if this is correct, but it worked. Apache error_log showed the FTP rename operation failing. I then had a 2nd issue during image deployment where “Checking Mounted File System” failed. Seems the script “src/buildroot/package/fog/scripts/bin/fog.checkmount” still contained the old storage path (possible bug?). I also must have missed the “.mntcheck” file when moving files around - had to recreate it.

    This is what I’m referring to.

    Your first issue is because permissions, once that was done the next issue is because the .mntcheck file didn’t exist. Now that those issues are corrected (these are both within the init’s btw) the bugs you thought you were seeing will no longer be there.


  • Senior Developer

    @AllFoggedOut You’re starting to confuse me.

    The “bugs” you were referring to were in direct relation to the imaging things. What I’m trying to state is the “bugs” you saw were in regards to the FOS system. This has nothing to do with what you’re seeing in regards to the boot menu loop.

    The only reason you saw issues in imaging earlier is because you made a change somewhere. These issues should no longer be present since you’ve made the corrective actions.



  • @Tom-Elliott said in FOG menu boot loop after image deployment:

    It’s simpler just to reboot the machine (or machines as the case may be) to pick up the new data.

    The script I’m referring to is/was on the server.



  • @Omar.rodriguez said in FOG menu boot loop after image deployment:

    I understand I might be late to this thread… but we’ve ran into this issue here at the office I work at. Do you happen to have 2 hard drives on your computer?

    The NUC contains a single M.2 SSD

    When you get the FOG menu on the PC and you press the “Esc” key does it boot into the OS?

    No, I get an error message about chainloading failed, and an auto reboot in 10 seconds. This is for Win7x64 using SANBOOT.



  • This post is deleted!

  • Senior Developer

    @AllFoggedOut Unfortunately, once the client is loaded into FOS, the only way to change the kernel based variables is to reboot the client. You could make a call to get new data, but that would/could put tremendous load back on the server (imagine 20 hosts booting to perform an imaging task, all calling out to the main server to update their scripts).

    It’s simpler just to reboot the machine (or machines as the case may be) to pick up the new data.



  • My “possible bug” comment related to the fact that, after changing “Image Path” and “FTP Path” in Storage Management -> DefaultMember, to a new path, the FOG script file “src/buildroot/package/fog/scripts/bin/fog.checkmount” still contained the old path, which caused image deployment to fail at the “Checking Mounted File System” stage. The reason it failed is because I deleted the old, defunct path from disk.

    @george1421, yes, this is still an iPXE exit mode issue.

    In summary, Win7x64 Legacy == iPXE Exit Mode loop. Win10x64 EFI == no issues.


  • Senior Developer

    @george1421 it is, the OP is just stating what was tested that appears to be working. During those tests they ran into a full disk and added more space. When adding more space they forgot to setup the permissions and mntcheck files.


  • Moderator

    @AllFoggedOut OK, I’m a bit confused now.

    It was my understanding that imaging worked perfectly, but where the issue was when you exited from the FOG iPXE menu to boot the local host OS. The target computer would not boot the local OS, but if you changed the bios boot order to the disk first (instead of PXE) the system would boot properly.

    Shouldn’t this should be an iPXE kernel exit mode issue??


  • Senior Developer

    @AllFoggedOut as for the ‘possible bug’ you are not experiencing a bug for that. If the .mntcheck file is missing it will fail because it has no way else to know whether the system mounted or not. We use the .mntcheck file to determine if this is the case or not.



  • Win10x64 unattended capture/deployment works fine using rEFInd (SANBOOT does not work - same issue as with Win7x64 above).

    Initially I had rEFInd complaining about my scanfor line containing legacy BIOS options which were incompatible since my BIOS lacked the necessary Compatibility Support Module. Removing hdbios from the list made the error message go away. Had a minor bit of confusion initially because I was editing the wrong refind.conf - correct path is ‘/var/www/html/fog/service/ipxe/refind.conf’.

    I might try rEFInd again for my Win7x64 issue (now that I know which file to edit), but I don’t think it’s going to help - the fact that I got a blank screen and flashing cursor vs a rEFInd menu + error message under EFI suggests it’s not happy.

    During the course of testing I had to move my image storage directory owing to a lack of space. I moved all files under /images to another LVM volume group. I then updated the path in Storage Management. I then ran into an issue post-image capture (repeated “Database Update failed” messages) which was resolved by changing permissions on the new folder (and sub-folders to fog:fog and mode 775). Not sure if this is correct, but it worked. Apache error_log showed the FTP rename operation failing. I then had a 2nd issue during image deployment where “Checking Mounted File System” failed. Seems the script “src/buildroot/package/fog/scripts/bin/fog.checkmount” still contained the old storage path (possible bug?). I also must have missed the “.mntcheck” file when moving files around - had to recreate it.

    Anyway, that aside, Win10x64 clone/deploy is now working seamlessly.



  • @Tom-Elliott said in FOG menu boot loop after image deployment:

    If the debug->fixparts method works to make the system bootable would you mind seeing if the development init’s are working for you too?
    If so, please try downloading the dev init’s on your system and seeing if your systems are working after a deploy. I don’t know IF they will work and I have no means to replicate the problem currently.

    FWIW, these work. I was able to capture and deploy my Win7x64 image w/o any obvious errors. Windows boots when I select the NVMe drive in F10 Boot Menu. Unfortunately, my original problem persists.

    I’m going to install/clone/deploy Win10 in EFI mode and see how that goes.



  • @Tom-Elliott said in FOG menu boot loop after image deployment:

    1. The image was pushed up and the disk was detected as GPT even though you have it setup as MBR in windows. This, by itself, shouldn’t cause any major issues but the deploy back to disk might. Though it would be ultimately better to fix the partition layout for the capture, you can fix this for deploy by using postdownloadscripts. Essentially, if you can boot the system into a FOS Debug mode and run fixparts <devicename> I suspect you’ll see it asking to fix the partition table. If it is, confirm and save, cancel your created tasking and reboot the system that booted up.

    In FOG Debug, I run ‘fixparts /dev/nvme0n1’, and I get ‘MBR command (? for help)’. No sign of an error state or request to fix the partition table.

    1. (Least likely if image was captured relatively recently) The MBR is not setting the partition boot partition in a “bootable” state.

    If I print out the existing MBR Partition Table via ‘p’ command, I get 2 partitions, the first is set with the boot flag; all looks healthy.

    0_1484534164241_DSC_0554.JPG

    Just my thoughts.

    I appreciate your thoughts! :)


  • Senior Developer

    In a mild attempt to check if the GPT partition thing I’m suspecting is the issue, I may have found a way to properly fix the “hung disk” issue that once was. This same fix should work for upload/capture etc…

    The reason things weren’t getting hung is because we were piping yes into the gdisk command. This meant imaging would continue going and all would be more or less fine. (The data is actually copied or placed back on the disk.) But it could also leave the system in a strange state (for example the system being unable to boot after being deployed to.)

    I’ve updated the init’s in hopes to try out a method to verify the partition table first. If the partition table is invalid, try to run fixparts on the disk.

    The source has been updated within the working-1.3.2 branch, and the development init’s have been updated to contain this new test.

    If the debug->fixparts method works to make the system bootable would you mind seeing if the development init’s are working for you too?
    If so, please try downloading the dev init’s on your system and seeing if your systems are working after a deploy. I don’t know IF they will work and I have no means to replicate the problem currently.

    wget --no-check-certificate -O /var/www/fog/service/ipxe/init.xz https://fogproject.org/inits/init.xz
    wget --no-check-certificate -O /var/www/fog/service/ipxe/init_32.xz https://fogproject.org/inits/init_32.xz
    

Log in to reply
 

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.