Hyper-V 2016 Gen2 VM (UEFI) fails to complete network boot
-
@sudburr I’ve found what I think is the problem. Not sure why this would be a problem but i need confirmation so I can give the background to report a proper bug report.
I’ve built a set of ipxe binaries that I hope will work. Please try:
git fetch --all git checkout hyperv-ipxetest1 git pull cd bin ./installfog.sh -y
Does this work? Essentially it’s the most current pull with the minor reversion from:
https://github.com/ipxe/ipxe/commit/a0f6e75532c68f49b3e1c73ca88151d9663f5269All it does is revert this as I’m fairly sure the message is not the part breaking it:
- return -EBUSY;
-
Of note: this shows all the changes. Notice the three elements dealing with hyper v?
I haven’t tried the second in the change set, as I’m really hoping not to have to mess with it. This, anyway, is not a means to say I believe it is or isn’t the problem, but just trying to figure out more specifically where the error is. In particular:
@mcb30 mcb30 [hyperv] Do not fail if guest OS ID MSR is already set … a0f6e75 @mcb30 mcb30 [hyperv] Remove redundant return status code from mapping functions … 276d618 @mcb30 mcb30 [hyperv] Cope with Windows Server 2016 enlightenments … b91cc98
These above items appear to be the only things related to hyper v.
I’m thinking the problem occurs, mainly, from the first 2 issues. I’m almost certain it’s got to be 276d618 as the call in a0f6e75 appears to be looking for an int to test against. If the ints are removed, would it still fail in the same fashion? (I don’t really know, just going on what I’m seeing).
-
So take 2, I’ve applied what I “suspect” is the problem. Apparently there were other changes and the notable portion was how it’s handling the vmprob_bus call. This caused a problem when I originally posted (I wasn’t noticing the error messages sorry.)
In the past, vmbus_probe only called hv_unmap_synic ( hv ) if the check failed.
If the hv_map_hypercall or hv_map_synic failed it would call hv_unmap_hypercall ( hv ) and hv_free_message ( hv ) (respectively).In the new code, if err_vmbus_probe fails it calls all three. The reason for the change, as I can gather it, is the hv_map_hypercall and hv_map_synic never returned a failed message, while the vmbus_probe would potentially fail.
My changes are just to comment the hv_unmap_hypercall and hv_free_message as they wouldn’t have been called in the past. These are still a guess.
-
@Tom-Elliott Yes, from quickly looking over the code changes I would guess that b91cc98 is probably the main cause. It seems like a bit change. Let’s see if you get a positive test from @sudburr…
-
Strangely:
git checkout hyperv-ipxetest1 error: Your local changes to the following files would be overwritten by checkout: packages/tftp/ipxe.efi Please, commit your changes or stash them before you can switch branches. Aborting
So I ran:
git checkout -- .
Which returned nothing but then I:
git checkout hyperv-ipxetest1 Branch hyperv-ipxetest1 set up to track remote branch hyperv-ipxetest1 from origin. Switched to a new branch 'hyperv-ipxetest1'
and
git pull Already up-to-date.
So I proceeded with installation and it reported:
Version: 9220986 Installer/Updater
iPXE.efi is now 356f … and it fails into a restart loop the same way.
-
@sudburr Mind retrying installer? I’m installing before the “big” change occurred.
Also, to fix the issue you saw with the original checkout, please try:
git reset --hard git checkout hyperv-ipxetest1 git pull
-
ipxe.efi 276d6 also fails the same way.
-
While working this out through chat, I’ve pushed patched, and working ipxe binaries that address this particular problem with booting. I should note, however, this is still not a “FOG” specific problem, rather something went wonky in ipxe binaries. I’ve made a posting on their forums and hope to hear back soon.
-
Many thanks for the live help Tom!
-
Any love from the folks at ipxe.org yet?
-
@sudburr Nope, but 1.4.2 should have the patched binaries anyway.
-
Based on the timeframe that this bug was submitted (may 2017) i’m guessing this is only a problem on Hyper-v builds 1703 (as available in, win 10 ent x64-1703, hyper-v server 2016-build 1703, win server 2016 w/ hyper-v build 1703). Most likely the issues are being caused by the ARP protocol problem found only in Gen 2 vms in the PXE stack of hyper-v build 1703. See this Microsoft forum post for more details on how ARP was broken in this build. Still no resolution to my knowledge, but this post is most likely what will drive the fix from M$'s perspective.
-
iPXE Developers finally got to look and hopefully have fixed this.
I’ve updated the ipxe binaries directly after pushing 1.5.0-RC-6. Both working and dev-branch have the updates though.
Please re-test and let us know if things are “still” working or if it breaks anything again.
Thank you.
I have been keeping up to date, but leaving my “reversion” code that seemed to fix the problem for users so hyperv gen 2 could still work. Hopefully it now works using iPXE’s own native source code.
See here:
https://git.ipxe.org/ipxe.git/commit/936657832f2262ad04bdf16b9229ce0b1d1c174f -
Good to know. I won’t be able to test right away though. I’m busy doing summer stuff right now. I’ll get back to you as soon as possible.