EFI_exit failed error when leaving FOG Menu on Dell Optiplex 7010 and Lenovo M83
-
@hancocza Ok, we still have backups of the iPXE binaries. Please make sure you don’t re-run the FOG installer (for whatever reason) before moving the single backup copy we have out of the way:
mv /tftpboot.prev /home/fog_tftpboot_1.5.7
That’s just to make sure we won’t loose those when re-running the FOG installer at some point.
The other directory listing is not of much help as I made a mistake when typing this from the top of my head. Please run
ls -al /home | grep fog_web
and post results here.I suppose you will see a directory named
/home/fog_web_1.5.9.BACKUP
. In that case you might runfile /home/fog_web_1.5.9.BACKUP/service/ipxe/bzImage
to find out the kernel used earlier.Now let’s start with testing if it’s the kernel or iPXE causing the error.
cp /home/fog_web_1.5.9.BACKUP/service/ipxe/bzImage /var/www/html/fog/service/ipxe/bzImage_1.5.7 cp /home/fog_web_1.5.9.BACKUP/service/ipxe/bzImage32 /var/www/html/fog/service/ipxe/bzImage32_1.5.7 chown www-data:www-data /var/www/html/fog/service/ipxe/bzImage*1.5.7
Now chose one of your Dell Optiplex 7010 or Lenovo M83 devices you have seen the error on before and edit its host settings in the FOG web UI. Set Host Kernel to
bzImage_1.5.7
(suppose this is a 64 bit CPU orbzImage32_1.5.7
if it’s 32 bit) and then boot it up to do whatever you have done before that ended up with that error.Just found a few interesting posts on this on the web:
https://support.lenovo.com/us/en/solutions/ht510865-system-fails-to-boot-into-rhel81-with-error-message-exit_boot-failed-efi_main-failed-lenovo-thinksystem
https://bugzilla.redhat.com/show_bug.cgi?id=1824005 (anyone with access to the redhat bugzilla portal?)
https://github.com/clearlinux/distribution/issues/1885 (though this one is about kernel 5.6 and a kernel setting which we do not use - not sure if this is related or not?!?!) -
@sebastian-roth May be an issue with the tftpboot.prev folder… When I first updated to 1.5.9, I thought that the issues I was seeing were from some file not installing correctly… so I ran the installer again after I ran into the issue. So I might not have the actual previous tftpboot folder.
Heres the output from ls -al /home | grep fog_web:
drwxr-xr-x 10 root root 4096 Feb 26 2020 fog_web_1.5.8.BACKUP
drwxr-xr-x 10 root root 4096 Nov 20 12:12 fog_web_1.5.9.BACKUP
drwxr-xr-x 11 root root 4096 Jun 22 10:22 fog_web_1.5.9-RC2.9.BACKUPI’ll have to test the older bzimage kernel next Wednesday, I’m on vacation until then. I will get back to you with the results that morning.
-
@sebastian-roth I was able to test out the kernel this morning. I swapped in the ones from the 1.5.9.BACKUP folder as well as the 1.5.8.BACKUP folder, both of them had the same error. Thinking this might then be an iPXE error?
-
@hancocza said in EFI_exit failed error when leaving FOG Menu:
I swapped in the ones from the 1.5.9.BACKUP folder as well as the 1.5.8.BACKUP folder, both of them had the same error. Thinking this might then be an iPXE error?
Sure seems to be in the iPXE side. So we will need to step back to an earlier version. As a first reference we have the date of the backup directories. Looks like you upgraded from 1.5.8 to 1.5.9-RC2 on 26th of Feb 2020. We will try using iPXE from around that time.
When you have the FOG installer in /root/fogproject the following commands should work for you. Just change according to your situation:
cd /root/ipxe-efi/ipxe/ git clean -fd git reset --hard git checkout e3ca2110712f6472465c70f2e83b745ff8a25fcc cd src/ cp ../../fogproject/src/ipxe/src-efi/ipxescript . cp ../../fogproject/src/ipxe/src-efi/ipxescript10sec . cp ../../fogproject/src/ipxe/src-efi/config/general.h config/ cp ../../fogproject/src/ipxe/src-efi/config/settings.h config/ cp ../../fogproject/src/ipxe/src-efi/config/console.h config/ make EMBED=ipxescript bin-x86_64-efi/ipxe.efi CERT=/opt/fog/snapins/ssl/CA/.fogCA.pem TRUST=/opt/fog/snapins/ssl/CA/.fogCA.pem
You should end up with a 64bit UEFI binary in
/root/ipxe-efi/ipxe/src/bin-x86_64-efi/ipxe.efi
to use. Copy that over to/tftpboot
(rename the current one) and give it a try. -
@sebastian-roth That seems to have worked. Was able to boot to to a task on an Optiplex 7010. Will test on a few others tomorrow to be sure.
-
@hancocza That sounds like we hit an error that was introduced in iPXE some time within the last year. Would you be keen to find out which change exacly it was to hopefully report to the iPXE devs and get it fixed?
That would mean you’d compile a fair amount of binaries and test all of them. Would you be willing? Half a day of work should be enough I suppose.
It’s not too complicated and you will learn using an awesome feature of git called “bisect”…
-
@sebastian-roth Yeah I can do that. Give me till next week, gotta get a few imaged while it’s up and running.
-
@hancocza Just wondering if you are still keen to dig into this?
-
@sebastian-roth Can do. Feel free to post the instructions and I’ll get working.
-
@hancocza Essentially what we want to do is find the commit that introduced the issue. For that we can either compile binaries from each and every commit that was added since then (lots of unneeded work) or we can use the well known strategy called divide and conquer. So for example if we have 13 commits with number 1 being the latest and we know it has the issue and number 13 being an older one we tested to not show the issue. Now divide the list by half and test number 7. Let’s assume it shows the same issue as 1. So we know 2-6 have the bug as well without needing to test those. Next we compile and test number 10 and so forth. This method of divide and conquer is all pretty much handled by git (bisect subcommand).
Now to add a bit of complexity git does not have commit ids in a row but uses kind of random hash values. So it might be helpful for you to keep an eye on the list of commits of the iPXE code repo. On the other hand you should not need to keep track of the commit ids yourself but I just though I let you know about this list so you can take a look in case you find yourself lost somehow.
Now let’s get into the details of this:
cd /root/ipxe-efi/ipxe/ git clean -fd git reset --hard git checkout master git bisect start git bisect bad git bisect good e3ca2110712f6472465c70f2e83b745ff8a25fcc
Essentially you go to the ipxe code used earlier, prepare it (cleanup), checkout the latest commit (in master branch), start bisecting and mark this latest commit as “bad” and the one back from Feb 16th, 2020 as “good”. Now git will tell you:
Bisecting: 77 revisions left to test after this (roughly 6 steps) [02201417104c751545dda261eb33f0012703d1ff] [efi] Fix reporting of USB supported languages array
Now the repository is checked out with the code of the commit halfway between the known bad and the known good commit. We just need to compile iPXE with that code base and test it:
cd src/ cp ../../fogproject/src/ipxe/src-efi/ipxescript . cp ../../fogproject/src/ipxe/src-efi/config/general.h config/ cp ../../fogproject/src/ipxe/src-efi/config/settings.h config/ cp ../../fogproject/src/ipxe/src-efi/config/console.h config/ make EMBED=ipxescript bin-x86_64-efi/ipxe.efi CERT=/opt/fog/snapins/ssl/CA/.fogCA.pem TRUST=/opt/fog/snapins/ssl/CA/.fogCA.pem cp bin-x86_64-efi/ipxe.efi /tftpboot cd .. git clean -fd git reset --hard
Now you will either see the issue happen or not. If the issue is seen you just enter
git bisect bad
and otherwise (no issue) rungit bisect good
. Now you can do the next test loop using the exact same commands as seen above staring withcd src/
.Keep going until you get a message
...is the first bad commit
after telling bisect good or bad. Should take about 6-7 rounds of testing in this case. -
@sebastian-roth Finished this. Ended up with “c70b3e04e86cefca335e36f883829d89583a6921 is the first bad commit”.
-
@hancocza Nice work! Looking at this commit I see that several other people reported an issue with this particular commit: https://github.com/ipxe/ipxe/commit/c70b3e04e86cefca335e36f883829d89583a6921 (scroll down to the comments)
A huge discussion sparked of in issue #164 about specific EFI drivers that need to be blocked in the code on specific hardware with faulty drivers. This is on a Dell OptiPlex 9020, Dell OptiPlex 3020M and HP t620.
EDIT: Ok, reading all the way to the bottom of this issue report I see that iPXE developer Michael Brown got himself a machine to reproduce the issue and just pushed out a new commit a few hours ago!! Wow. So please go ahead, pull the very latest souce from iPXE github repo, compile a binary and test.
Let us know if this works on your Dell Optiplex 7010 and Lenovo M83 devices as well.
-
@sebastian-roth said in EFI_exit failed error when leaving FOG Menu on Dell Optiplex 7010 and Lenovo M83:
Let us know if this works on your Dell Optiplex 7010
I realize you are talking about EFI exit mode but the 7010s have an issue when ipxe initializes on the 7010. It will hang at initializing… in uefi mode and never proceed. Unless this fix addresses that part too it might not be possible to test the efi exit function.
-
@george1421 As far as I get it this is not about EFI exit mode but iPXE not being able to chainload to the Linux kernel for a task. It’s interesting you mention the hang on “Initializing devices…” because that is mentioned for the 7010 in this particular issue report as well: https://github.com/ipxe/ipxe/issues/164#issuecomment-726391430 (though never addressed)
I am not sure if @hancocza is able to boot in the 7010’s by using snponly.efi or plain ipxe.efi?! Or maybe this issue has been fixed some time ago already.
-
@sebastian-roth As long as I can remember, I haven’t had an issue with iPXE on an EFI usb drive on an Optiplex 7010. I have had the issue where it gets stuck on “Initializing iPXE…” on 790s when using the EFI usb drive. I can’t speak to the built in iPXE booting on those machines since our network structure doesn’t allow for direct pxe booting.
-
@sebastian-roth Just reran the test with the latest ipxe version. Still has the same issue on a 7010.
-
@hancocza said in EFI_exit failed error when leaving FOG Menu on Dell Optiplex 7010 and Lenovo M83:
I can’t speak to the built in iPXE booting on those machines since our network structure doesn’t allow for direct pxe booting.
Are you saying you don’t PXE boot at all but using iPXE on USB all the time?
Just reran the test with the latest ipxe version. Still has the same issue on a 7010.
Make sure you see the latest commit id in the header when iPXE loads (just the first 4-5 characters).
-
-
Yep, use the iPXE on USB to boot to the iPXE on the fogserver, then use that to chainload.
-
I think it’s the latest: g47098 matches the 47098 start of the commit id
-
-
@hancocza Well then the next step would probably be to boot into UEFI shell and get some specific output from your machine (read 1, 2 and 3) and update that topic on github. Maybe you can even provide hardware to Michael for debugging.
If there is a chance you can get this fixed then now is the time to engage with the iPXE dev. Make sure you tell them you are booting off a USB key not via PXE. Not sure if that makes a difference in this context.
Did you try on the Lenovo M83? Is that fixed or still showing the same issue?
-
@sebastian-roth Back tracking on this a bit…
I was able to get it to work by following George’s instructions at the start of this thread, with making a USB the easy way (using the latest iPXE version directly on the USB drive). My other test this morning was using an older version of iPXE on the USB drive that then boots to the iPXE version on the server. That does mean though that their latest version fixed the issue, since I tried the easy method USB drive at the start of this thread with the same error. My thought is that it exits fine on the server, but then when it tries to exit the older iPXE version on the flash drive that it encounters the error.