Hyper-V 2016 Gen2 VM (UEFI) fails to complete network boot
- FOG Version: 1.4.0
- OS: CentOS 7.3.1611
- Service Version: n/a
- OS: n/a
I am unable to complete a network boot of a Hyper-V 2016 Generation 2 (UEFI) virtual machine. I have duplicated the problem with separate Hyper-V 2016 hosts and VMs.
I have tried kernels
The text of what happens.
PXE Network Boot using IPv4 ( ESC to cancel ) Performing DHCP Negotiation... Station IP address is 10.12.40.120 Server IP address is 10.12.40.14 NBP filename is ipxe.efi NBP filesize is 994048 Bytes Downloading NBP file... Successfully downloaded NBP file. iPXE initialising devices...ok iPXE 1.0.0+ (a19ac) -- Open Source Network Boot Firmware -- http://ipxe.org Features: DNS FTP HTTP HTTPS ISCSI NFS TFTP SRP VLAN AoE EFI Menu
… then it immediately restarts the VM to begin again at:
PXE Network Boot using IPv4 ( ESC to cancel )
UEFI booting is working on physical, while LEGACY booting continues to work on everything.
VM UEFI booting was working on May 5th with 1.4.0-RC4.
Secondarily, while updating the kernels I generate the following errors in /var/log/httpd/error_log:
PHP Warning: ftp_mkdir(): Create directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 492, referer: http://devfog/fog/management/index.php?node=about&sub=kernel&file=aHR0cHM6Ly9mb2dwcm9qZWN0Lm9yZy9rZXJuZWxzL0tlcm5lbC5Ub21FbGxpb3R0LjQuMTEuMC42NA==&arch=64 PHP Warning: ftp_rename(): RNFR command failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 769, referer: http://devfog/fog/management/index.php?node=about&sub=kernel&file=aHR0cHM6Ly9mb2dwcm9qZWN0Lm9yZy9rZXJuZWxzL0tlcm5lbC5Ub21FbGxpb3R0LjQuMTEuMC42NA==&arch=64 PHP Warning: ftp_mkdir(): Create directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 492, referer: http://devfog/fog/management/index.php?node=about&sub=kernel&file=aHR0cHM6Ly9mb2dwcm9qZWN0Lm9yZy9rZXJuZWxzL0tlcm5lbC5Ub21FbGxpb3R0LjQuMTEuMC4zMg==&arch=32 PHP Warning: ftp_rename(): RNFR command failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 769, referer: http://devfog/fog/management/index.php?node=about&sub=kernel&file=aHR0cHM6Ly9mb2dwcm9qZWN0Lm9yZy9rZXJuZWxzL0tlcm5lbC5Ub21FbGxpb3R0LjQuMTEuMC4zMg==&arch=32
Good to know. I won’t be able to test right away though. I’m busy doing summer stuff right now. I’ll get back to you as soon as possible.
iPXE Developers finally got to look and hopefully have fixed this.
I’ve updated the ipxe binaries directly after pushing 1.5.0-RC-6. Both working and dev-branch have the updates though.
Please re-test and let us know if things are “still” working or if it breaks anything again.
I have been keeping up to date, but leaving my “reversion” code that seemed to fix the problem for users so hyperv gen 2 could still work. Hopefully it now works using iPXE’s own native source code.
Based on the timeframe that this bug was submitted (may 2017) i’m guessing this is only a problem on Hyper-v builds 1703 (as available in, win 10 ent x64-1703, hyper-v server 2016-build 1703, win server 2016 w/ hyper-v build 1703). Most likely the issues are being caused by the ARP protocol problem found only in Gen 2 vms in the PXE stack of hyper-v build 1703. See this Microsoft forum post for more details on how ARP was broken in this build. Still no resolution to my knowledge, but this post is most likely what will drive the fix from M$'s perspective.
@sudburr Nope, but 1.4.2 should have the patched binaries anyway.
Any love from the folks at ipxe.org yet?
Many thanks for the live help Tom!
While working this out through chat, I’ve pushed patched, and working ipxe binaries that address this particular problem with booting. I should note, however, this is still not a “FOG” specific problem, rather something went wonky in ipxe binaries. I’ve made a posting on their forums and hope to hear back soon.
ipxe.efi 276d6 also fails the same way.
@sudburr Mind retrying installer? I’m installing before the “big” change occurred.
Also, to fix the issue you saw with the original checkout, please try:
git reset --hard git checkout hyperv-ipxetest1 git pull
git checkout hyperv-ipxetest1 error: Your local changes to the following files would be overwritten by checkout: packages/tftp/ipxe.efi Please, commit your changes or stash them before you can switch branches. Aborting
So I ran:
git checkout -- .
Which returned nothing but then I:
git checkout hyperv-ipxetest1 Branch hyperv-ipxetest1 set up to track remote branch hyperv-ipxetest1 from origin. Switched to a new branch 'hyperv-ipxetest1'
git pull Already up-to-date.
So I proceeded with installation and it reported:
Version: 9220986 Installer/Updater
iPXE.efi is now 356f … and it fails into a restart loop the same way.
So take 2, I’ve applied what I “suspect” is the problem. Apparently there were other changes and the notable portion was how it’s handling the vmprob_bus call. This caused a problem when I originally posted (I wasn’t noticing the error messages sorry.)
In the past, vmbus_probe only called hv_unmap_synic ( hv ) if the check failed.
If the hv_map_hypercall or hv_map_synic failed it would call hv_unmap_hypercall ( hv ) and hv_free_message ( hv ) (respectively).
In the new code, if err_vmbus_probe fails it calls all three. The reason for the change, as I can gather it, is the hv_map_hypercall and hv_map_synic never returned a failed message, while the vmbus_probe would potentially fail.
My changes are just to comment the hv_unmap_hypercall and hv_free_message as they wouldn’t have been called in the past. These are still a guess.
Of note: this shows all the changes. Notice the three elements dealing with hyper v?
I haven’t tried the second in the change set, as I’m really hoping not to have to mess with it. This, anyway, is not a means to say I believe it is or isn’t the problem, but just trying to figure out more specifically where the error is. In particular:
@mcb30 mcb30 [hyperv] Do not fail if guest OS ID MSR is already set … a0f6e75 @mcb30 mcb30 [hyperv] Remove redundant return status code from mapping functions … 276d618 @mcb30 mcb30 [hyperv] Cope with Windows Server 2016 enlightenments … b91cc98
These above items appear to be the only things related to hyper v.
I’m thinking the problem occurs, mainly, from the first 2 issues. I’m almost certain it’s got to be 276d618 as the call in a0f6e75 appears to be looking for an int to test against. If the ints are removed, would it still fail in the same fashion? (I don’t really know, just going on what I’m seeing).
@sudburr I’ve found what I think is the problem. Not sure why this would be a problem but i need confirmation so I can give the background to report a proper bug report.
I’ve built a set of ipxe binaries that I hope will work. Please try:
git fetch --all git checkout hyperv-ipxetest1 git pull cd bin ./installfog.sh -y
Does this work? Essentially it’s the most current pull with the minor reversion from:
All it does is revert this as I’m fairly sure the message is not the part breaking it:
- return -EBUSY;
Dropping ipxe.efi 2d79 onto a 1.4.0 install works happily.
That’ll do the job. I was hoping there would be a way to just download only the ipxe.efi files without the need to re-install the entire beasty. I’m sure there’s a way, but I don’t use GIT professionally.
RC-4 with fd6d1 works
RC-5 with 84d4 works
RC-6 to RC-9.2 with 2d79 works
… and there we go, the problem starts with RC-9.3 and ipxe.efi 17887 .
@sudburr What do you mean?
I build at seemingly random intervals. So if you know RC-4 of 1.4.0 was the last known good, just install RC-4 from git:
git checkout tags/1.4.0-RC-4 cd bin ./installfog.sh -y
Figure out what version is in the string after confirming it works.
Then re-install the dev-branch (or whatever version you’re installing) with:
git checkout dev-branch git pull cd bin ./installfog.sh -y
Where can I find a repository of the ipxe.efi that have been used in FOG?
@george1421 It is indeed. But i doubt it will work. There was some effort in current ipxe to work with hyper-v I thought though.