refind not working properly
-
@Huecuva So what exactly is refind not doing? Are you seeing the refind menu?
-
@Huecuva said in refind not working properly:
So refind.efi is just refind_x64.efi renamed?
yes, several versions back it was discovered that refind.efi caused an issue when the target computer was a 32 bit uefi computers and the ia86 version of efi would not boot on ARM processors. So they were split out as delivered from the refind project.
-
@george1421 I’m not really sure what you mean. I don’t know what the refind menu is.
What’s happening is that when I try to boot these particular mining rigs with MSI motherboards (which work fine at my primary location with the same hardware configuration but the FOG server is running older software) from PXE, it detects media and properly boots into the FOG menu where it gives the option to boot from the local drive by default or to register, delete or image the host. When I select the first option or the let the timer run down, it then begins to load Windows, as I can see the circling Windows loading indicator, but then the screen goes black and it eventually reboots and does the whole thing over again. And then keeps doing that.
When I remove the network card from the list of available boot options (or for that matter even set it to second priority after the local drive) it will then boot into Windows just fine. The worst that happens is it sometimes requires a few minutes to auto-repair.
When I tried changing the UEFI exit type for one of the misbehaving hosts to either HDD or WINDOWS, all that happened was that the timer in the FOG menu would just keep resetting and it would never boot past that until I either changed BIOS to not boot from the NIC or deleted and re-registered the host in FOG.
So far replacing the refind files with older ones has not worked, but I will try swapping the refind.efi for the refind_x64.efi and see if that helps. I have to admit though, that I have my doubts, since the refind.efi file from the primary location did not solve the problem.
-
@Huecuva said in refind not working properly:
What’s happening is that when I try to boot these particular mining rigs with MSI motherboards (which work fine at my primary location with the same hardware configuration but the FOG server is running older software) from PXE, it detects media and properly boots into the FOG menu where it gives the option to boot from the local drive by default or to register, delete or image the host. When I select the first option or the let the timer run down, it then begins to load Windows, as I can see the circling Windows loading indicator, but then the screen goes black and it eventually reboots and does the whole thing over again. And then keeps doing that.
OK what I learned in another thread with thinking refind is at fault, lets make sure we understand what your configuration is so I’m not going to assume anything here.
So these msi computers at the remote site, are they in bios or uefi mode?
What device is your dhcp server? (mfg and model)
What specifically do you have listed for dhcp options 66 and 67 at this remote site?
-
@george1421 They are all in UEFI mode and are basically identical in every respect from hardware to the OS image that it’s running to the rigs at the primary location. Completely interchangeable.
I also have rigs with Biostar motherboards that otherwise have exactly the same hardware as the MSI machines. These rigs are not experiencing these issues. I’m not sure if that renders your DHCP questions moot, but I can get you that information tomorrow if it’s still necessary.
-
@Huecuva said in refind not working properly:
They are all in UEFI mode and are basically identical in every respect from hardware to the OS image
When I tried changing the UEFI exit type for one of the misbehaving hosts to either HDD or WINDOWS, all that happened was that the timer in the FOG menu would just keep resetting and it would never boot past that until I…
If refind can not locate the boot partition it will display a refind menu not reload the iPXE menu. To me this is an indication of a bios computer being told to boot a uefi loader or a uefi computer being told to boot a bios. It fails to start so it just fails back to the FOG iPXE menu.
I want you to try this hack that Sebastian came up with to help debug.
On the Master FOG server, there is a directory called /tftpboot. In there there is a text file called default.ipxe Lets rename that file to default.ipxe.sav and then create a new default.ipxe file. In that file enter this text and save it.
#!ipxe show platform
Now pxe boot the computer. The iPXE menu will not be displayed, but text will be displayed. Tell me what that text says.
Also for the remote sites it is important to know the values of the dhcp options. They “should” point to the local storage node at that location. The storage node then should redirect the client to load boot.php from the master node.
Edit: I just looked over the thread and it doesn’t look like you are using storage nodes in your environment, so you must have a full fog server at each remote location. So if that is the case then you will need to edit the default.ipxe file at the remote fog server.
Also lets also understand what version of FOG is at the main location and what version of FOG is at the remote locations.
-
@george1421 There are only two locations. The primary location and the secondary location. The secondary location is running FOG 1.5.9 on Ubuntu Server 20.04, however several of the refind files have been replaced with those from the primary location. The primary location is running Ubuntu Server 18.04 but I don’t know precisely what version of FOG is running. Is there a way to tell inside the FOG dashboard or is there some other way to tell? I can get you the FOG version at the primary location tomorrow.
The FOG server at the Primary location has FOG running on a 250GB SSD I think, with the default /images on the same drive for images. There are only three images stored and there is plenty of room. The secondary location has a 1TB HDD mounted to /images which doesn’t even have 200GB of data on it including images for both the MSI and Biostar machines and more than half of that is because the MSI image isn’t resized. The images are only about 30GB when the partitions are resized.
I will try your hack tomorrow.
-
@Huecuva said in refind not working properly:
Is there a way to tell inside the FOG dashboard
yes at the bottom of the web gui it should tell you what version.
Do you just have 2 independent FOG server or is one a storage node?
-
@george1421 They are two completely independent FOG servers.
-
@Huecuva said in refind not working properly:
there was no
/var/www/fog/html/service/ipxe
I guess there is a typo in this. It should really be
/var/www/html/fog/service/ipxe
…Please run the following commands on your FOG server and post output here:
ls -al /var/www/ ls -al /var/www/fog/ ls -al /var/www/html/
-
@Sebastian-Roth Oh My goodness, how embarrassing. Yes it should have been
/var/www/html/fog/service/ipxe
@Huecuva The show platform thing should give you a better idea for sure on what the problem is.
And we do still want to know what the dhcp options are set to. That tells us how you’re getting your computers to boot to the fog server. One is pointing it to your fog server as a tftp server and the other tells us which pxe bootfile you are using. Sometimes a different pxe bootfile can make a difference in the boot behavior which is why a few options are provided. Most of the time the default ipxe.efi option does the trick for uefi options.Another thing you could try is create a bootable usb with refind. I suggest using rufus (https://rufus.ie/) to get the file on the usb but there are many ways. Here’s a link to how to get the different refind versions http://www.rodsbooks.com/refind/getting.html. I would go ahead and try the newest version. Usually if you can boot to a version of refind from usb, then it will work the same when booting from the network. I say usually as I have seen it work on a usb boot and then not via network, but if that happens it still helps to narrow down where the problem is. I would suggest trying the latest version (which I assume is what is included with fog 1.5.9) and see if it boots. If it doesn’t then go back to 0.11.0 and see if that helps. If none of them work, then perhaps contacting the refind developer with your hardware info would be wise to let him know it’s not working.
As another workaround option (hopefully we find a full solution though) you could see if your uefi firmwares support a wake on lan boot option. i.e. you set them to boot to network if they get a wake on lan packet, but the boot order for normal startups stays as the hard drive. Then when you image a computer you shut it down, set the wake on lan checkbox when deploying the image from fog, and let the wake on lan do the trick. Some computers this works and some give you a popup asking if you want ipv4 or ipv6 pxe, if you get that pop up then you’d need it to have an option to disable the ipv6 option so it just goes from WOL to ipv4 pxe boot. It’s for sure easier to just have network boot as the first option, but this is a workaround I employed before finding my refind solution.
-
@Huecuva Lets keep it simple for the moment. Lets make sure we fully understand how this second fog server is setup (since it is acting differently than the main site). Knowing they are 2 independent servers eliminates many of the potential issues because now we know the “problem” is localized to this new FOG server and its environment. Also what iPXE thinks about the target computer is important. I don’t want to chase something for several hours and have it be the CSM issue again. So knowing what exactly is configured for dhcp options 66 and 67 is important as well as what device is the dhcp server. I may ask you to capture some network packets so we can see exactly what the target computer is telling the dhcp server. If you know how to use wireshark we can get this answer in about 5 minutes. I don’t want to go this route until we fully understand the environment.
These are very contemporary mobos so they may be doing something we don’t expect in firmware simply because we don’t see them in a typical enterprise environment.
-
@Sebastian-Roth
$ ls -al /var/www/
total 20
drwxr-xr-x 4 root root 4096 Oct 14 19:58 .
drwxr-xr-x 14 root root 4096 Oct 14 19:53 …
drwxr-xr-x 11 www-data www-data 4096 Oct 22 19:49 fog
drwxr-xr-x 4 root root 4096 Oct 19 18:17 html
-rw-r–r-- 1 www-data www-data 52 Oct 14 19:58 index.php
$ ls -al /var/www/fog/
total 412
drwxr-xr-x 11 www-data www-data 4096 Oct 22 19:49 .
drwxr-xr-x 4 root root 4096 Oct 14 19:58 …
drwxr-xr-x 2 www-data www-data 4096 Oct 14 19:58 api
drwxr-xr-x 2 www-data www-data 4096 Oct 14 19:58 client
drwxr-xr-x 2 www-data www-data 4096 Oct 14 19:58 commons
-rw-r–r-- 1 www-data www-data 370070 Oct 14 19:58 favicon.ico
lrwxrwxrwx 1 www-data www-data 13 Oct 14 19:58 fog -> /var/www/fog/
drwxr-xr-x 2 www-data www-data 4096 Oct 14 19:58 fogdoc
drwxr-xr-x 3 root root 4096 Oct 22 19:50 html
-rw-r–r-- 1 www-data www-data 572 Oct 14 19:58 index.php
drwxr-xr-x 13 www-data www-data 4096 Oct 14 19:58 lib
drwxr-xr-x 10 www-data www-data 4096 Oct 14 19:58 management
drwxr-xr-x 3 www-data www-data 4096 Oct 14 19:58 service
drwxr-xr-x 2 www-data www-data 4096 Oct 14 19:58 status
$ ls -al /var/www/html/
total 28
drwxr-xr-x 4 root root 4096 Oct 19 18:17 .
drwxr-xr-x 4 root root 4096 Oct 14 19:58 …
drwxr-xr-x 7 root root 4096 Oct 19 18:17 admin
lrwxrwxrwx 1 root root 13 Oct 14 19:58 fog -> /var/www/fog/
-rw-r–r-- 1 root root 10918 Oct 14 19:53 index.html
drwxr-xr-x 2 root root 4096 Oct 19 18:17 piholeIt appears that I have both /var/www/fog/service/ipxe and /var/www/html/fog/service/ipxe directories and their contents appears to be identical. Is there a symlink or something. I didn’t even know there was a /var/www/html/fog/service/ipxe directory but the date of the files I changed matches that in /var/www/fog/service/ipxe.
-
@Huecuva
Yes,/var/www/fog
is a symlink to/var/www/html/fog
.This is created by the fog installer for backwards compatibility. The default path for httpd/apache sites in all linux distros used to be
/var/www
but it changed to/var/www/html
a few years ago.The symlink is maintained in case of any code (internal for fog or customized by users) doesn’t get broken if it’s still pointing to a path that starts with
/var/www/fog
-
@george1421 I thought the FOG server at the primary location was running at least FOG 1.5.7 but it appears to be running 1.5.6.
I don’t know how to find what DHCP options 66 and 67 are. Honestly, I don’t even know what that means.
I’m at the primary location right now. Unfortunately, the weather took a serious turn for the worse (as in we went from zero snow to like 4 inches overnight and it’s still snowing) so I’m not sure if I will make it down to the secondary location today. If that is the case I won’t be able to try your hack or tell you what the DHCP server is. Though if I do make it down there, I will try to post what I can. Otherwise that will have to wait until next week.
I’m not particularly familiar with wireshark, I’m afraid.
-
@Huecuva Ok no worries, do you have remote access to the other location? If you do there are still some things you can test. Do you have a tech at the remote location, or at least someone who knows how to pxe boot one of these computers? That is all we need to collect the rest of the data.
-
@george1421 I can remote to the secondary FOG server via SSH through the RDP into the mining manager there, but unfortunately there is no one on-site there to do any PXE booting of the rigs. I am the only one administering this mine at either location.
A strange new development, however: out of the blue, for no discernible reason whatsoever, a couple of the MSI rigs at the primary location randomly started having this issue. I guess they decided to reboot for some reason and when they wouldn’t come back online I plugged a cart into one of them and it was boot looping like the ones down at the secondary location. On a whim, I reset BIOS to defaults and reconfigured it and it worked. The same for the other one. I guess that’s another thing I can try at the secondary location. If that fixes the problem…
EDIT: I think I’m going to head down to the secondary location shortly here. There’s nothing else I can do from here.
-
@george1421 Alright. I am at the secondary location now. I’m going to try resetting and reconfiguring BIOS on one of these rigs first and see how that goes.
-
@Huecuva Ok if that doesn’t get it we can do a deep dive into the settings. From your end you will just need to probably install tcpdump on the fog server, run a command, then pxe boot the target computer. You can post the pcap to a file share site and post the link here. Lets first see the outcome of the bios reset.
-
@george1421 Unfortunately the BIOS reset did not behave. At first it appeared as if it was going to work. The machine booted into Windows after the FOG menu but when it was rebooted again, it once more started goofing off. Also, it seems these motherboards have an annoying habit of automatically changing their first boot priority back to the local Windows boot manager randomly.
I tried your hack just now. I tried adding that line to the beginning of the default.ipxe file and nothing changed, so I made a backup of that file and included only those two lines. The result was:
tftp://192.168.9.1/default.ipxe... ok builtin/platformstring = efi Chainloading failed, hit 's' for the iPXE shell; reboot in 10 seconds
192.168.9.1 is the IP of the FOG server. I will now replace the default.ipxe file with the backup.
EDIT: To answer another of your questions, it appears that my DHCP server is just a CISCO 1900 series router. CISCO1941/K9 I think.