refind not working properly
Hey guys. I seem to be having a new problem with FOG. When I have my hosts configured to boot from NIC at first priority, they will boot to PXE and then when no input happens they will skip that and attempt to load Windows. This is where the problem starts. Instead of loading Windows, they will either just freeze solid on a black screen or go into a boot loop and keep booting PXE, attempting to load Windows and then rebooting. When I disable PXE booting, the machines boot into Windows without a problem.
I have posted about this on reddit here. One user there has suggested something about non-signed kernels having issues chainloading to a signed Windows kernel.
This problem does not happen at my primary mine location where the FOG server is running an older version of Ubuntu and an older version of FOG. That same user on reddit suggested the possibility of changing the kernel at the secondary location to the same one used at the primary location. Is this a feasible solution and if so, how do I go about doing that? If not, is there some other way to fix this issue? I need to be able to image these machines remotely and cannot be on-site to manually reconfigure BIOS back and forth between PXE and local disk boot whenever I need to image them.
@huecuva That is an interesting update for sure. I believe graphics cards have a bios of some sort and sometimes there’s some option rom boot option related to the gpu in the computer’s bios setting. It would make some sense for the graphics drivers to also update the bios or boot option roms on gpus.
What GPU’s do you have? GPU’s of the gtx 10xx series and newer have a ‘studio’ driver option that’s supposed to be the sort of ‘stable’ branch option. If your cards can use that, maybe using that as a standard could help if it’s not something you’re already doing?
@Huecuva Thanks for the interesting update! I have read it a couple of times, gave it some time to think about but still have no idea why the nVidia driver can possibly cause such an issue. There is a slight chance the driver changes how the card is being initialized and thus causing a problem but it’s really strange.
Great to hear you figured this out. Though it’s sad the issues comes with the newer driver. There is 457.09 now. Maybe give that a try.
@Sebastian-Roth Here’s a strange new development that I happened upon today. I’m not 100% sure if this actually has anything to do with the issue, but at my primary location there is a particular rig with the troublesome MSI motherboard that was no having any issues booting past the iPXE menu…until I tried to update the nVidia graphics card driver. Once I updated the driver, then it started having the same boot-looping issue the rigs at the secondary location were having. The rigs at the secondary location all have newer drivers than some of the ones at the primary, since I made sure the image I took down there was up to date. I wouldn’t think the graphics card driver would have anything to do with it, but there it is. It’s really…weird.
EDIT: Confirmed. This particular rig, at least, will not boot past the iPXE menu when the newest driver I have downloaded (456.71) is installed. Despite a BSoD and hard crash during installation, the driver seemed to have installed correctly and the machine was mining and the manager was reporting the correct driver, it would not boot past the iPXE menu and into Windows. When I uninstalled that driver and reinstalled an older one (452.06) it has no problem booting past iPXE and into Windows. To confirm further, I reinstalled both drivers a couple of times just to make sure. The result was the same. I dunno.
@Sebastian-Roth Updating BIOS did not solve the problem. I don’t know what’s going on but I do know that I, personally, will never buy an MSI product for this and several other reasons.
@Sebastian-Roth It does certainly seem like buggy BIOS/EUFI, doesn’t it? It seems that some random rigs at the primary location have also started having this problem now with the exception that where resetting and reconfiguring BIOS at the secondary location does not solve the problem, it does seem to help at the primary.
This is so totally effed up I can feel my hair turning gray. I’m going to see if there is a newer BIOS for these boards and if there is maybe that will fix this? I don’t know. At this point I’m almost ready to just burn both buildings down.
@Huecuva Sounds like a BIOS/UEFI firmware behaving really badly. PXE booting is probably not used much on gaming motherboards and so people hardly ever report these kind of things to the manufacturer.
On the other hand it’s very interesting you never seem to get this in your primary location. Some things are not adding up and I can understand that you are saying it’s not worth the time for the secondary location.
@JJ-Fullmer The site is definitely not worth the work of setting up a local copy of ipxe.efi, but I guess I can try other efi files from the /tftpboot directory.
What’s happening is not always consistent. There are certain things it does most of the time, but occasionally it seems to decide it wants to switch things up a bit.
When I first registered (not imaged) the hosts on the FOG server, they booted into Windows just fine. And if that’s all I do, they continue to seem to boot into Windows just fine. However, as soon as I either image them or capture an image from one of them, they will no longer boot to Windows. What @george1421 was calling the ipxe menu (which I could have sworn was more closely FOG related) would time out and then the circling Windows loading indicator would appear briefly before the screen would go black and then shortly the whole thing would reboot and do that all over again. Sometimes, the screen would just go black and the monitor would either go to sleep or not (that just seemed to completely random) and it would hang that way until I rebooted it.
Resetting BIOS to defaults and then reconfiguring allowed it to boot past the ipxe menu and into Windows, but only once. After rebooting the machine, it would begin to boot loop again.
@Huecuva Well poo
I think a solution exists we just gotta find it.
So just want to do a quick review of where we’re at.
The problem is that after imaging it does boot into windows from the fog pxe menu and then it never works again?
Or is the problem that you image and then it doesn’t boot to windows at all? Sounds like you said it seems to work once, then the windows setup changes the boot order (which is something it does with
bcdeditand you can actually do this manually as well (try
bcdedit /enum allto see all the boot options windows can set from command line. You can make it put the network boot first, it’s just a bit of a complicated task)
Anyway, since refind did work from usb that means it should work. I have seen it not work from pxe when it didn’t work from usb.
Some things worth trying are as @george1421 mentioned different ipxe.efi files. If the one that is working at the other site didn’t do the trick one of the other included ones (realtek.efi, intel.efi, etc) might make a difference.
Another thing would be to take the refind.conf from the working usb and put that on the fog server. I’d adjust it to not use the gui (can’t remember the exact setting in the file, but I remember it being pretty clear) and see if it works via pxe.
You could also consider using a bootloader such as grub2win and putting a local copy of the ipxe.efi file on each computer’s efi partition. Then when you need it to boot to fog you have a script that changes the grub.conf boot order. But that’s a super complicated approach. I actually do this because I used to have random issues with the windows bootloader and I also like having the menu at each startup to go to fog, uefi firmware settings, or windows. I can walk you through this but I make it part of my image, so you have to get it to boot to network pxe at least once.
@JJ-Fullmer Yeah there is nothing like that in the BIOS of these MSI gaming boards. I assume it might have something to do with the LAN Option ROM, but other than enabling or disabling that I can’t do anything with that either. I have no idea how to configure it and the mobo manual is no help at all.
I’ve asked the guys on the MSI_Gaming subreddit about it but to be honest I’m not expecting much.
I think this whole endeavor is pretty much a wash.
@Huecuva On HP computers I believe it’s called
Wake Up Boot Sourceor at least it was once. I’m not next to one I can test at the moment. It might not exist on all boards. I’ve seen it on most business oriented machines
@JJ-Fullmer So I have confirmed that I am able to shut down the machines and have them wake on lan when I schedule a task in FOG, but I can’t make them boot from network if they have received a magic packet WOL request. They just boot into Windows and the task is not executed. I’m not sure how to change that. I know it would be in BIOS but I can’t seem to find anything related to that in the BIOS.
@george1421 Well, I don’t know. I’m also out of ideas. I’m pretty certain that any hardware moved from the secondary location to the primary location will not suffer this issue because, again, I’ve moved motherboards from the secondary to the primary before and they’ve worked. That’s not to say it’s 100% guaranteed to work but I’d say it’s 99.99% certain to work. The ipxe.efi file and all the refind files have not made any difference at all. As much as I would like to get this FOG server working properly at the secondary location, it’s not a priority compared to other stuff I need to do so I’m inclined to just call this a defeat. I generally shouldn’t need to image the secondary rigs that often anyway so, as inconvenient as it is, I think I will just have to physically go to the secondary site to do any imaging.
Thanks for all the help mate, but this site just keeps having strange issues that take too much time to solve. If you do come up with any other ideas, I’m happy to give them a try but I really don’t think moving any hardware is going to solve anything at the site. It’s just a lot of work that I don’t think will be worth doing, especially since the site is slowly being decommissioned anyway.
@george1421 Yeah, I didn’t think it would make a difference at all. I just figured it might be worth noting. I’ve managed to get the ipxe.efi file transferred to the secondary location without leaving the primary. I’m about to reboot one of the problematic rigs now to see if it will boot properly. Cross your fingers.
EDIT: Well, upon a simple reboot it managed to make it into Windows. However, it was one that I had not previously imaged with FOG. Once I had deployed the image to the rig, it will no longer boot into Windows. This is bizarre.
@Huecuva FOG server itself can not distinguish between bare metal and VM. So I’m almost 100% positive its not a factor here.
@george1421 A new discovery today: Apparently FOG at the primary location is running in a Proxmox VM. I was not aware of that. I don’t think it should make much of a difference, but there it is. The FOG server at the secondary location is running on bare metal.
@george1421 I can almost guarantee that the problem would not move with the hardware. I have moved rigs from the secondary location to the primary before, as we are slowly downsizing this secondary site and moving stuff to the primary as we sell off video cards. I’ve used motherboard from here to replace boards at the primary location without any problems. I do have an empty rack at the primary location where I can move a couple of the rigs from secondary though and make sure it will work. I will have to do that at some point next week. I can’t have a rig dismantled over the weekend.
@Huecuva Well I’m fresh out of ideas. The only other test would be to take a failing system from the remote site back to the main site to see if the problem moves with the hardware. I understand that may not be practical for your situation. But that would tell us if its FOG related to hardware related.
Safe travels back to your home site with the weather and everything.
@george1421 Unfortunately, I cannot access the primary FOG server from the secondary location and that particular step will have to wait until I’m back at the primary location next week. Then I will likely have to put it on a flash drive and manually drive it down here.
I just made a backup of the 1.5.9 ipxe.efi file though, so that’s already done.
If that’s all that can be done for now, I guess I might as well head home. There is nothing else I can do here.
@george1421 Okay, I just replaced the refind.conf file with the original.
The hardware at the main office is the same. MSI Z170A Gaming M7 motherboard with the same version of BIOS. The only differences between some of these rigs with the same motherboards (or even the ones with Biostar boards, for that matter) is that some are running Pentium G4400s and some are running Pentium G4650s. They’re all running a bunch of GTX 1070 or GTX 1070 Ti cards or a combination of the two.