refind not working properly
-
@JJ-Fullmer That does indeed help. I think will probably end up switching to rEFInd 0.11.0, if that seems to be what’s working for you for the majority of machines. How do I tell what version of rEFInd is currently being used? Just for my own curiosity I’d like to check the primary location.
It should be pretty easy to get this set up and test it from my office at the primary location tomorrow.
-
@Huecuva that is a good question I don’t think I have a great answer too sadly. There’s probably a way but I haven’t actually found one other than comparing file sizes of the binary. There’s probably metadata in the file somewhere to access the version.
-
Hi folks, been having problems with refind myself today, I can confirm that the 0.11.0 seem to have fixed my boot problems so thank you very much!
-
@Goll420 Awesome. That’s good to know. Thanks. I will be getting 0.11.0 set up today. I am optimistic that it will solve my issues.
-
@JJ-Fullmer Well, that didn’t work. I am at a loss.
I downloaded the
bin-0.11.0
zip file and copied therefind_x64.efi
file to the/var/www/fog/service/ipxe/
directory (there was no/var/www/fog/html/service/ipxe
) and that made no difference. There was norefind.efi
file in the zip. When that failed, I copied all of the relevant refind files (refind.conf
,refind.efi
,refind_x64.efi
andrefind_ia32.efi
) from the FOG server at the primary location into the/var/www/fog/service/ipxe
directory at the secondary location. I actually had to put them all on a USB stick and drive to the secondary location to do this because gmail would not let me email them to myself. That also made no difference. I then created a/var/www/fog/html/service/ipxe/
directory to put everything in in the hopes that would solve the problems. It did not.Then I downloaded the .deb file and ran that. It also didn’t fix anything.
When all of that failed to accomplish anything useful, I then tried to mess around with a few different UEFI exit types (I tried FIRST_FOUND_HDD and FIRST_FOUND_WINDOWS) for a couple of the rigs. All that resulted in was the inability to boot past the FOG menu. The countdown would simply keep restarting. Even after setting the UEFI exit type back to default, the only way I could get it to boot past that menu was to delete and re-register the host.
I feel like I’m beating my head against a wall. I wonder if the gnuefi download might help?
-
@Goll420 How did you go about switching to 0.11.0? I’m concerned that perhaps I did something wrong.
-
@Huecuva said in refind not working properly:
How did you go about switching to 0.11.0?
By overwriting the files in /var/www/html/fog/service/ipxe directory with the ones from the zip file. For refind.efi take refind_x64.efi and copy that to refind.efi . The refind.efi was recently changed to refind_x64.efi but to keep backwards compatibility it was left in the build.
-
@george1421 So refind.efi is just refind_x64.efi renamed? Interesting. I’m not sure it will make a difference though, as I have already copied the refind.efi file from the primary location to the secondary and it didn’t fix anything but I will try it anyway. Thanks.
-
@Huecuva So what exactly is refind not doing? Are you seeing the refind menu?
-
@Huecuva said in refind not working properly:
So refind.efi is just refind_x64.efi renamed?
yes, several versions back it was discovered that refind.efi caused an issue when the target computer was a 32 bit uefi computers and the ia86 version of efi would not boot on ARM processors. So they were split out as delivered from the refind project.
-
@george1421 I’m not really sure what you mean. I don’t know what the refind menu is.
What’s happening is that when I try to boot these particular mining rigs with MSI motherboards (which work fine at my primary location with the same hardware configuration but the FOG server is running older software) from PXE, it detects media and properly boots into the FOG menu where it gives the option to boot from the local drive by default or to register, delete or image the host. When I select the first option or the let the timer run down, it then begins to load Windows, as I can see the circling Windows loading indicator, but then the screen goes black and it eventually reboots and does the whole thing over again. And then keeps doing that.
When I remove the network card from the list of available boot options (or for that matter even set it to second priority after the local drive) it will then boot into Windows just fine. The worst that happens is it sometimes requires a few minutes to auto-repair.
When I tried changing the UEFI exit type for one of the misbehaving hosts to either HDD or WINDOWS, all that happened was that the timer in the FOG menu would just keep resetting and it would never boot past that until I either changed BIOS to not boot from the NIC or deleted and re-registered the host in FOG.
So far replacing the refind files with older ones has not worked, but I will try swapping the refind.efi for the refind_x64.efi and see if that helps. I have to admit though, that I have my doubts, since the refind.efi file from the primary location did not solve the problem.
-
@Huecuva said in refind not working properly:
What’s happening is that when I try to boot these particular mining rigs with MSI motherboards (which work fine at my primary location with the same hardware configuration but the FOG server is running older software) from PXE, it detects media and properly boots into the FOG menu where it gives the option to boot from the local drive by default or to register, delete or image the host. When I select the first option or the let the timer run down, it then begins to load Windows, as I can see the circling Windows loading indicator, but then the screen goes black and it eventually reboots and does the whole thing over again. And then keeps doing that.
OK what I learned in another thread with thinking refind is at fault, lets make sure we understand what your configuration is so I’m not going to assume anything here.
So these msi computers at the remote site, are they in bios or uefi mode?
What device is your dhcp server? (mfg and model)
What specifically do you have listed for dhcp options 66 and 67 at this remote site?
-
@george1421 They are all in UEFI mode and are basically identical in every respect from hardware to the OS image that it’s running to the rigs at the primary location. Completely interchangeable.
I also have rigs with Biostar motherboards that otherwise have exactly the same hardware as the MSI machines. These rigs are not experiencing these issues. I’m not sure if that renders your DHCP questions moot, but I can get you that information tomorrow if it’s still necessary.
-
@Huecuva said in refind not working properly:
They are all in UEFI mode and are basically identical in every respect from hardware to the OS image
When I tried changing the UEFI exit type for one of the misbehaving hosts to either HDD or WINDOWS, all that happened was that the timer in the FOG menu would just keep resetting and it would never boot past that until I…
If refind can not locate the boot partition it will display a refind menu not reload the iPXE menu. To me this is an indication of a bios computer being told to boot a uefi loader or a uefi computer being told to boot a bios. It fails to start so it just fails back to the FOG iPXE menu.
I want you to try this hack that Sebastian came up with to help debug.
On the Master FOG server, there is a directory called /tftpboot. In there there is a text file called default.ipxe Lets rename that file to default.ipxe.sav and then create a new default.ipxe file. In that file enter this text and save it.
#!ipxe show platform
Now pxe boot the computer. The iPXE menu will not be displayed, but text will be displayed. Tell me what that text says.
Also for the remote sites it is important to know the values of the dhcp options. They “should” point to the local storage node at that location. The storage node then should redirect the client to load boot.php from the master node.
Edit: I just looked over the thread and it doesn’t look like you are using storage nodes in your environment, so you must have a full fog server at each remote location. So if that is the case then you will need to edit the default.ipxe file at the remote fog server.
Also lets also understand what version of FOG is at the main location and what version of FOG is at the remote locations.
-
@george1421 There are only two locations. The primary location and the secondary location. The secondary location is running FOG 1.5.9 on Ubuntu Server 20.04, however several of the refind files have been replaced with those from the primary location. The primary location is running Ubuntu Server 18.04 but I don’t know precisely what version of FOG is running. Is there a way to tell inside the FOG dashboard or is there some other way to tell? I can get you the FOG version at the primary location tomorrow.
The FOG server at the Primary location has FOG running on a 250GB SSD I think, with the default /images on the same drive for images. There are only three images stored and there is plenty of room. The secondary location has a 1TB HDD mounted to /images which doesn’t even have 200GB of data on it including images for both the MSI and Biostar machines and more than half of that is because the MSI image isn’t resized. The images are only about 30GB when the partitions are resized.
I will try your hack tomorrow.
-
@Huecuva said in refind not working properly:
Is there a way to tell inside the FOG dashboard
yes at the bottom of the web gui it should tell you what version.
Do you just have 2 independent FOG server or is one a storage node?
-
@george1421 They are two completely independent FOG servers.
-
@Huecuva said in refind not working properly:
there was no
/var/www/fog/html/service/ipxe
I guess there is a typo in this. It should really be
/var/www/html/fog/service/ipxe
…Please run the following commands on your FOG server and post output here:
ls -al /var/www/ ls -al /var/www/fog/ ls -al /var/www/html/
-
@Sebastian-Roth Oh My goodness, how embarrassing. Yes it should have been
/var/www/html/fog/service/ipxe
@Huecuva The show platform thing should give you a better idea for sure on what the problem is.
And we do still want to know what the dhcp options are set to. That tells us how you’re getting your computers to boot to the fog server. One is pointing it to your fog server as a tftp server and the other tells us which pxe bootfile you are using. Sometimes a different pxe bootfile can make a difference in the boot behavior which is why a few options are provided. Most of the time the default ipxe.efi option does the trick for uefi options.Another thing you could try is create a bootable usb with refind. I suggest using rufus (https://rufus.ie/) to get the file on the usb but there are many ways. Here’s a link to how to get the different refind versions http://www.rodsbooks.com/refind/getting.html. I would go ahead and try the newest version. Usually if you can boot to a version of refind from usb, then it will work the same when booting from the network. I say usually as I have seen it work on a usb boot and then not via network, but if that happens it still helps to narrow down where the problem is. I would suggest trying the latest version (which I assume is what is included with fog 1.5.9) and see if it boots. If it doesn’t then go back to 0.11.0 and see if that helps. If none of them work, then perhaps contacting the refind developer with your hardware info would be wise to let him know it’s not working.
As another workaround option (hopefully we find a full solution though) you could see if your uefi firmwares support a wake on lan boot option. i.e. you set them to boot to network if they get a wake on lan packet, but the boot order for normal startups stays as the hard drive. Then when you image a computer you shut it down, set the wake on lan checkbox when deploying the image from fog, and let the wake on lan do the trick. Some computers this works and some give you a popup asking if you want ipv4 or ipv6 pxe, if you get that pop up then you’d need it to have an option to disable the ipv6 option so it just goes from WOL to ipv4 pxe boot. It’s for sure easier to just have network boot as the first option, but this is a workaround I employed before finding my refind solution.
-
@Huecuva Lets keep it simple for the moment. Lets make sure we fully understand how this second fog server is setup (since it is acting differently than the main site). Knowing they are 2 independent servers eliminates many of the potential issues because now we know the “problem” is localized to this new FOG server and its environment. Also what iPXE thinks about the target computer is important. I don’t want to chase something for several hours and have it be the CSM issue again. So knowing what exactly is configured for dhcp options 66 and 67 is important as well as what device is the dhcp server. I may ask you to capture some network packets so we can see exactly what the target computer is telling the dhcp server. If you know how to use wireshark we can get this answer in about 5 minutes. I don’t want to go this route until we fully understand the environment.
These are very contemporary mobos so they may be doing something we don’t expect in firmware simply because we don’t see them in a typical enterprise environment.