refind not working properly

george1421

@Huecuva Ok no worries, do you have remote access to the other location? If you do there are still some things you can test. Do you have a tech at the remote location, or at least someone who knows how to pxe boot one of these computers? That is all we need to collect the rest of the data.

Huecuva

@george1421 I can remote to the secondary FOG server via SSH through the RDP into the mining manager there, but unfortunately there is no one on-site there to do any PXE booting of the rigs. I am the only one administering this mine at either location.

A strange new development, however: out of the blue, for no discernible reason whatsoever, a couple of the MSI rigs at the primary location randomly started having this issue. I guess they decided to reboot for some reason and when they wouldn’t come back online I plugged a cart into one of them and it was boot looping like the ones down at the secondary location. On a whim, I reset BIOS to defaults and reconfigured it and it worked. The same for the other one. I guess that’s another thing I can try at the secondary location. If that fixes the problem…

EDIT: I think I’m going to head down to the secondary location shortly here. There’s nothing else I can do from here.

Huecuva

@george1421 Alright. I am at the secondary location now. I’m going to try resetting and reconfiguring BIOS on one of these rigs first and see how that goes.

george1421

@Huecuva Ok if that doesn’t get it we can do a deep dive into the settings. From your end you will just need to probably install tcpdump on the fog server, run a command, then pxe boot the target computer. You can post the pcap to a file share site and post the link here. Lets first see the outcome of the bios reset.

Huecuva

@george1421 Unfortunately the BIOS reset did not behave. At first it appeared as if it was going to work. The machine booted into Windows after the FOG menu but when it was rebooted again, it once more started goofing off. Also, it seems these motherboards have an annoying habit of automatically changing their first boot priority back to the local Windows boot manager randomly.

I tried your hack just now. I tried adding that line to the beginning of the default.ipxe file and nothing changed, so I made a backup of that file and included only those two lines. The result was:

tftp://192.168.9.1/default.ipxe... ok
builtin/platformstring = efi
Chainloading failed, hit 's' for the iPXE shell; reboot in 10 seconds

192.168.9.1 is the IP of the FOG server. I will now replace the default.ipxe file with the backup.

EDIT: To answer another of your questions, it appears that my DHCP server is just a CISCO 1900 series router. CISCO1941/K9 I think.

Huecuva

@JJ-Fullmer Creating a bootable USB with rEFInd 0.12.0 (the latest listed here) and trying to boot from that resulted in immediate rejection and attempting to load Windows and another black screen.

Creating a bootable USB with rEFInd 0.11.0 and booting from that immediately brought up the rEFInd menu without any issues.

george1421

@Huecuva said in refind not working properly:

builtin/platformstring = efi

Great now we know a bunch of things that are right with your setup.

PXE booting is working as it should
Your computer is in uef mode,
Your dhcp server is setup correctly
Ipxe.efi is being sent to the client
The exit mode we should be working with is EFI Exit

So now on the remote FOG server did you swap out the refind.efi and the other three with the downloaded version 0.11.0? Don’t forget about resetting the default.ipxe file back to the original version.

I see from the usb booting side you get the refind menu. If your fog server is configured with the global efi exit mode of REFIND then when the iPXE menu times out you should at least get the refind menu. Is this not happening?

Huecuva

@george1421 I had already copied the 0.11.0 refind files to the FOG server but I did it again just for sanity’s sake. When that again failed to make a difference I decided to try resetting BIOS again. Again, the first time it worked, but when I rebooted it again I noticed it booted straight into Windows so I went into BIOS and changed the NIC back to first priority because it had reversed the priorities by itself. Then it counted down in the FOG menu, began to load Windows and then the screen went black and the monitor went to sleep. It’s still sitting there like that.

How do I set the global efi exit mode? Is that under FOG Configuration -> iPXE General Configuration -> Boot Exit Settings -> Exit To Hard Drive Type (EFI)? If that is where it is, it is already set to REFIND_EFI. If that’s what it’s supposed to be then no, I am not seeing the REFIND menu.

george1421

@Huecuva Yes that is the right location for it. Is the refind.conf file the same one that was setup by FOG or did you copy over the one from the zip file. The right answer should be the one delivered by FOG.

first priority because it had reversed the priorities by itself.

The windows installer will do this for you, even if you don’t want it to.

Huecuva

@george1421 I did not copy the one from the zip file as that one says it’s a sample. I had, however, previously copied the refind.conf file from the FOG server at the primary location and brought it down to the secondary location so that’s the refind.conf file it is using. I still have a .old version of the secondary FOG server’s original refind.conf file, if you think I should put it back.

EDIT: Ugh. Why does Windows have to suck so bad?

george1421

@Huecuva said in refind not working properly:

I should put it back

I would put it back only for the sake of us understanding what the configuration is. I don’t think its going to help in this case, but we know the one that is shipped with 1.5.9 works.

So you have these target hardware at the main office? Same bios version and such or is this hardware only at the remote site? I’m trying to understand why usb booting into refind 0.11.0 works and transferring it via iPXE is failing for us.

george1421

@Huecuva Ok the other variable here between FOG 1.5.6 and 1.5.9 is the version of ipxe that is being used. (again I’m grabbing at straws to explain why the main site acts one way and the remote site acts differently assuming the target hardware is exactly the same). If you have access copy over ipxe.efi from the 1.5.6 site to the remote site its in the /tftpboot directory. Make sure you save the 1.5.9 version if ipxe.efi just in case. With that file in place the two servers should be operationally equivalent at least in regards to pxe booting and exiting to disk.

Huecuva

@george1421 Okay, I just replaced the refind.conf file with the original.

The hardware at the main office is the same. MSI Z170A Gaming M7 motherboard with the same version of BIOS. The only differences between some of these rigs with the same motherboards (or even the ones with Biostar boards, for that matter) is that some are running Pentium G4400s and some are running Pentium G4650s. They’re all running a bunch of GTX 1070 or GTX 1070 Ti cards or a combination of the two.

Huecuva

@george1421 Unfortunately, I cannot access the primary FOG server from the secondary location and that particular step will have to wait until I’m back at the primary location next week. Then I will likely have to put it on a flash drive and manually drive it down here.

I just made a backup of the 1.5.9 ipxe.efi file though, so that’s already done.

If that’s all that can be done for now, I guess I might as well head home. There is nothing else I can do here.

george1421

@Huecuva Well I’m fresh out of ideas. The only other test would be to take a failing system from the remote site back to the main site to see if the problem moves with the hardware. I understand that may not be practical for your situation. But that would tell us if its FOG related to hardware related.

Safe travels back to your home site with the weather and everything.

Huecuva

@george1421 I can almost guarantee that the problem would not move with the hardware. I have moved rigs from the secondary location to the primary before, as we are slowly downsizing this secondary site and moving stuff to the primary as we sell off video cards. I’ve used motherboard from here to replace boards at the primary location without any problems. I do have an empty rack at the primary location where I can move a couple of the rigs from secondary though and make sure it will work. I will have to do that at some point next week. I can’t have a rig dismantled over the weekend.

Huecuva

@george1421 A new discovery today: Apparently FOG at the primary location is running in a Proxmox VM. I was not aware of that. I don’t think it should make much of a difference, but there it is. The FOG server at the secondary location is running on bare metal.

george1421

@Huecuva FOG server itself can not distinguish between bare metal and VM. So I’m almost 100% positive its not a factor here.

Huecuva

@george1421 Yeah, I didn’t think it would make a difference at all. I just figured it might be worth noting. I’ve managed to get the ipxe.efi file transferred to the secondary location without leaving the primary. I’m about to reboot one of the problematic rigs now to see if it will boot properly. Cross your fingers.

EDIT: Well, upon a simple reboot it managed to make it into Windows. However, it was one that I had not previously imaged with FOG. Once I had deployed the image to the rig, it will no longer boot into Windows. This is bizarre.

Huecuva

@george1421 Well, I don’t know. I’m also out of ideas. I’m pretty certain that any hardware moved from the secondary location to the primary location will not suffer this issue because, again, I’ve moved motherboards from the secondary to the primary before and they’ve worked. That’s not to say it’s 100% guaranteed to work but I’d say it’s 99.99% certain to work. The ipxe.efi file and all the refind files have not made any difference at all. As much as I would like to get this FOG server working properly at the secondary location, it’s not a priority compared to other stuff I need to do so I’m inclined to just call this a defeat. I generally shouldn’t need to image the secondary rigs that often anyway so, as inconvenient as it is, I think I will just have to physically go to the secondary site to do any imaging.

Thanks for all the help mate, but this site just keeps having strange issues that take too much time to solve. If you do come up with any other ideas, I’m happy to give them a try but I really don’t think moving any hardware is going to solve anything at the site. It’s just a lot of work that I don’t think will be worth doing, especially since the site is slowly being decommissioned anyway.

refind not working properly

116

12.2k

17.4k

155.5k