refind not working properly
-
Hey guys. I seem to be having a new problem with FOG. When I have my hosts configured to boot from NIC at first priority, they will boot to PXE and then when no input happens they will skip that and attempt to load Windows. This is where the problem starts. Instead of loading Windows, they will either just freeze solid on a black screen or go into a boot loop and keep booting PXE, attempting to load Windows and then rebooting. When I disable PXE booting, the machines boot into Windows without a problem.
I have posted about this on reddit here. One user there has suggested something about non-signed kernels having issues chainloading to a signed Windows kernel.
This problem does not happen at my primary mine location where the FOG server is running an older version of Ubuntu and an older version of FOG. That same user on reddit suggested the possibility of changing the kernel at the secondary location to the same one used at the primary location. Is this a feasible solution and if so, how do I go about doing that? If not, is there some other way to fix this issue? I need to be able to image these machines remotely and cannot be on-site to manually reconfigure BIOS back and forth between PXE and local disk boot whenever I need to image them.
Thanks.
-
@Huecuva Are all your machines set to boot in UEFI mode? Just asking because only UEFI will be able use rEFInd at all.
In general chainloading the OS from disk depends a lot on the hardware you have. Some play nicely with the different exit types we provide (have you switched to others and tried again yet?) bu there are machines that don’t.
This problem does not happen at my primary mine location where the FOG server is running an older version of Ubuntu and an older version of FOG.
Are these all set to UEFI as well or will we be comparing apples and pears here?
Yes you can switch the rEFInd binary used to chainload but I would really want to know if all your machines are UEFI in the first place and if you have tried the other exit types yet.
-
@Sebastian-Roth I’m not sure exactly that it is rEFInd that isn’t working. That’s just what one of the guys on the reddit thread suggested. He also asked about the hardware.
I have basically two different hardware configurations for my mining rigs. They’re either using a certain model of MSI motherboard or they’re using a certain model of Biostar motherboard. Other than that, the hardware is all the same.
Also, yes, they are all booting UEFI.As far as I can tell, the exit types at both locations are just left at default (which is to say there isn’t one selected). I can try different exit types.
Both locations have a combination of both hardware configurations. At the primary location with the older Ubuntu and FOG software, it’s all working just dandy. At the secondary location with Ubuntu 20.04 and FOG 1.5.9, the MSI hosts seem to be having this problem.
It’s also strange because when I first registered the hosts on FOG at the secondary location, they booted into Windows just fine. It was only after I captured an image and subsequently tried to deploy it to a few of the hosts that this problem started happening.
-
We have a subreddit!?
But anyway, assuming you have all uefi images and machines and no custom exit types on the hosts, there are some known problems with newer versions of rEFInd that affect select hardware. Version 0.11.0
You can find that version here https://sourceforge.net/projects/refind/files/0.11.0/There are various hp computers that have issues with versions after 0.11.0, and reports of various other machines in the forums. I contacted the refind developer about this back in like 2017 or 2018 when the problem first started, but his response ultimately ended up being that he didn’t think it was a widespread enough issue to isolate and fix. In his defense he’s the only dev on the project as far as I know. But since then I just revert to 0.11.0 since it does the job.
I usually download the flashdrive one and extract it with 7zip (the zip and then the file extracted from the zip). Then you can use your favorite ftp/scp client such as winscp to copy the refind binary to
/var/www/fog/html/service/ipxe/refind.efi
you may also need to copy overrefind_x64.efi
I would suggest making.old
versions of the existing binaries.
You can also mess with therefind.conf
file in that same folder.
Note that these binaries and that conf file will be overwritten with every update of the fog server.Hope that helps.
-
@JJ-Fullmer That does indeed help. I think will probably end up switching to rEFInd 0.11.0, if that seems to be what’s working for you for the majority of machines. How do I tell what version of rEFInd is currently being used? Just for my own curiosity I’d like to check the primary location.
It should be pretty easy to get this set up and test it from my office at the primary location tomorrow.
-
@Huecuva that is a good question I don’t think I have a great answer too sadly. There’s probably a way but I haven’t actually found one other than comparing file sizes of the binary. There’s probably metadata in the file somewhere to access the version.
-
Hi folks, been having problems with refind myself today, I can confirm that the 0.11.0 seem to have fixed my boot problems so thank you very much!
-
@Goll420 Awesome. That’s good to know. Thanks. I will be getting 0.11.0 set up today. I am optimistic that it will solve my issues.
-
@JJ-Fullmer Well, that didn’t work. I am at a loss.
I downloaded the
bin-0.11.0
zip file and copied therefind_x64.efi
file to the/var/www/fog/service/ipxe/
directory (there was no/var/www/fog/html/service/ipxe
) and that made no difference. There was norefind.efi
file in the zip. When that failed, I copied all of the relevant refind files (refind.conf
,refind.efi
,refind_x64.efi
andrefind_ia32.efi
) from the FOG server at the primary location into the/var/www/fog/service/ipxe
directory at the secondary location. I actually had to put them all on a USB stick and drive to the secondary location to do this because gmail would not let me email them to myself. That also made no difference. I then created a/var/www/fog/html/service/ipxe/
directory to put everything in in the hopes that would solve the problems. It did not.Then I downloaded the .deb file and ran that. It also didn’t fix anything.
When all of that failed to accomplish anything useful, I then tried to mess around with a few different UEFI exit types (I tried FIRST_FOUND_HDD and FIRST_FOUND_WINDOWS) for a couple of the rigs. All that resulted in was the inability to boot past the FOG menu. The countdown would simply keep restarting. Even after setting the UEFI exit type back to default, the only way I could get it to boot past that menu was to delete and re-register the host.
I feel like I’m beating my head against a wall. I wonder if the gnuefi download might help?
-
@Goll420 How did you go about switching to 0.11.0? I’m concerned that perhaps I did something wrong.
-
@Huecuva said in refind not working properly:
How did you go about switching to 0.11.0?
By overwriting the files in /var/www/html/fog/service/ipxe directory with the ones from the zip file. For refind.efi take refind_x64.efi and copy that to refind.efi . The refind.efi was recently changed to refind_x64.efi but to keep backwards compatibility it was left in the build.
-
@george1421 So refind.efi is just refind_x64.efi renamed? Interesting. I’m not sure it will make a difference though, as I have already copied the refind.efi file from the primary location to the secondary and it didn’t fix anything but I will try it anyway. Thanks.
-
@Huecuva So what exactly is refind not doing? Are you seeing the refind menu?
-
@Huecuva said in refind not working properly:
So refind.efi is just refind_x64.efi renamed?
yes, several versions back it was discovered that refind.efi caused an issue when the target computer was a 32 bit uefi computers and the ia86 version of efi would not boot on ARM processors. So they were split out as delivered from the refind project.
-
@george1421 I’m not really sure what you mean. I don’t know what the refind menu is.
What’s happening is that when I try to boot these particular mining rigs with MSI motherboards (which work fine at my primary location with the same hardware configuration but the FOG server is running older software) from PXE, it detects media and properly boots into the FOG menu where it gives the option to boot from the local drive by default or to register, delete or image the host. When I select the first option or the let the timer run down, it then begins to load Windows, as I can see the circling Windows loading indicator, but then the screen goes black and it eventually reboots and does the whole thing over again. And then keeps doing that.
When I remove the network card from the list of available boot options (or for that matter even set it to second priority after the local drive) it will then boot into Windows just fine. The worst that happens is it sometimes requires a few minutes to auto-repair.
When I tried changing the UEFI exit type for one of the misbehaving hosts to either HDD or WINDOWS, all that happened was that the timer in the FOG menu would just keep resetting and it would never boot past that until I either changed BIOS to not boot from the NIC or deleted and re-registered the host in FOG.
So far replacing the refind files with older ones has not worked, but I will try swapping the refind.efi for the refind_x64.efi and see if that helps. I have to admit though, that I have my doubts, since the refind.efi file from the primary location did not solve the problem.
-
@Huecuva said in refind not working properly:
What’s happening is that when I try to boot these particular mining rigs with MSI motherboards (which work fine at my primary location with the same hardware configuration but the FOG server is running older software) from PXE, it detects media and properly boots into the FOG menu where it gives the option to boot from the local drive by default or to register, delete or image the host. When I select the first option or the let the timer run down, it then begins to load Windows, as I can see the circling Windows loading indicator, but then the screen goes black and it eventually reboots and does the whole thing over again. And then keeps doing that.
OK what I learned in another thread with thinking refind is at fault, lets make sure we understand what your configuration is so I’m not going to assume anything here.
So these msi computers at the remote site, are they in bios or uefi mode?
What device is your dhcp server? (mfg and model)
What specifically do you have listed for dhcp options 66 and 67 at this remote site?
-
@george1421 They are all in UEFI mode and are basically identical in every respect from hardware to the OS image that it’s running to the rigs at the primary location. Completely interchangeable.
I also have rigs with Biostar motherboards that otherwise have exactly the same hardware as the MSI machines. These rigs are not experiencing these issues. I’m not sure if that renders your DHCP questions moot, but I can get you that information tomorrow if it’s still necessary.
-
@Huecuva said in refind not working properly:
They are all in UEFI mode and are basically identical in every respect from hardware to the OS image
When I tried changing the UEFI exit type for one of the misbehaving hosts to either HDD or WINDOWS, all that happened was that the timer in the FOG menu would just keep resetting and it would never boot past that until I…
If refind can not locate the boot partition it will display a refind menu not reload the iPXE menu. To me this is an indication of a bios computer being told to boot a uefi loader or a uefi computer being told to boot a bios. It fails to start so it just fails back to the FOG iPXE menu.
I want you to try this hack that Sebastian came up with to help debug.
On the Master FOG server, there is a directory called /tftpboot. In there there is a text file called default.ipxe Lets rename that file to default.ipxe.sav and then create a new default.ipxe file. In that file enter this text and save it.
#!ipxe show platform
Now pxe boot the computer. The iPXE menu will not be displayed, but text will be displayed. Tell me what that text says.
Also for the remote sites it is important to know the values of the dhcp options. They “should” point to the local storage node at that location. The storage node then should redirect the client to load boot.php from the master node.
Edit: I just looked over the thread and it doesn’t look like you are using storage nodes in your environment, so you must have a full fog server at each remote location. So if that is the case then you will need to edit the default.ipxe file at the remote fog server.
Also lets also understand what version of FOG is at the main location and what version of FOG is at the remote locations.
-
@george1421 There are only two locations. The primary location and the secondary location. The secondary location is running FOG 1.5.9 on Ubuntu Server 20.04, however several of the refind files have been replaced with those from the primary location. The primary location is running Ubuntu Server 18.04 but I don’t know precisely what version of FOG is running. Is there a way to tell inside the FOG dashboard or is there some other way to tell? I can get you the FOG version at the primary location tomorrow.
The FOG server at the Primary location has FOG running on a 250GB SSD I think, with the default /images on the same drive for images. There are only three images stored and there is plenty of room. The secondary location has a 1TB HDD mounted to /images which doesn’t even have 200GB of data on it including images for both the MSI and Biostar machines and more than half of that is because the MSI image isn’t resized. The images are only about 30GB when the partitions are resized.
I will try your hack tomorrow.
-
@Huecuva said in refind not working properly:
Is there a way to tell inside the FOG dashboard
yes at the bottom of the web gui it should tell you what version.
Do you just have 2 independent FOG server or is one a storage node?