Hyper-V 2016 Gen2 VM (UEFI) fails to complete network boot



  • Server
    • FOG Version: 1.4.0
    • OS: CentOS 7.3.1611
    Client
    • Service Version: n/a
    • OS: n/a
    Description

    I am unable to complete a network boot of a Hyper-V 2016 Generation 2 (UEFI) virtual machine. I have duplicated the problem with separate Hyper-V 2016 hosts and VMs.

    I have tried kernels
    4.11.0
    4.10.10
    4.10.9
    4.10.1
    4.9.11
    4.9.4
    4.9.1

    The text of what happens.

    PXE Network Boot using IPv4 ( ESC to cancel )
    Performing DHCP Negotiation...
     Station IP address is 10.12.40.120
     Server IP address is 10.12.40.14
     NBP filename is ipxe.efi
     NBP filesize is 994048 Bytes
     Downloading NBP file...
     Successfully downloaded NBP file.
    iPXE initialising devices...ok
    
    iPXE 1.0.0+ (a19ac) -- Open Source Network Boot Firmware -- http://ipxe.org
    Features: DNS FTP HTTP HTTPS ISCSI NFS TFTP SRP VLAN AoE EFI Menu
    

    … then it immediately restarts the VM to begin again at:

    PXE Network Boot using IPv4 ( ESC to cancel )
    

    UEFI booting is working on physical, while LEGACY booting continues to work on everything.

    VM UEFI booting was working on May 5th with 1.4.0-RC4.

    Secondarily, while updating the kernels I generate the following errors in /var/log/httpd/error_log:

    PHP Warning:  ftp_mkdir(): Create directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 492, referer: http://devfog/fog/management/index.php?node=about&sub=kernel&file=aHR0cHM6Ly9mb2dwcm9qZWN0Lm9yZy9rZXJuZWxzL0tlcm5lbC5Ub21FbGxpb3R0LjQuMTEuMC42NA==&arch=64
    PHP Warning:  ftp_rename(): RNFR command failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 769, referer: http://devfog/fog/management/index.php?node=about&sub=kernel&file=aHR0cHM6Ly9mb2dwcm9qZWN0Lm9yZy9rZXJuZWxzL0tlcm5lbC5Ub21FbGxpb3R0LjQuMTEuMC42NA==&arch=64
    PHP Warning:  ftp_mkdir(): Create directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 492, referer: http://devfog/fog/management/index.php?node=about&sub=kernel&file=aHR0cHM6Ly9mb2dwcm9qZWN0Lm9yZy9rZXJuZWxzL0tlcm5lbC5Ub21FbGxpb3R0LjQuMTEuMC4zMg==&arch=32
    PHP Warning:  ftp_rename(): RNFR command failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 769, referer: http://devfog/fog/management/index.php?node=about&sub=kernel&file=aHR0cHM6Ly9mb2dwcm9qZWN0Lm9yZy9rZXJuZWxzL0tlcm5lbC5Ub21FbGxpb3R0LjQuMTEuMC4zMg==&arch=32
    


  • Good to know. I won’t be able to test right away though. I’m busy doing summer stuff right now. I’ll get back to you as soon as possible.


  • Senior Developer

    iPXE Developers finally got to look and hopefully have fixed this.

    I’ve updated the ipxe binaries directly after pushing 1.5.0-RC-6. Both working and dev-branch have the updates though.

    Please re-test and let us know if things are “still” working or if it breaks anything again.

    Thank you.

    I have been keeping up to date, but leaving my “reversion” code that seemed to fix the problem for users so hyperv gen 2 could still work. Hopefully it now works using iPXE’s own native source code.

    See here:
    https://git.ipxe.org/ipxe.git/commit/936657832f2262ad04bdf16b9229ce0b1d1c174f



  • Based on the timeframe that this bug was submitted (may 2017) i’m guessing this is only a problem on Hyper-v builds 1703 (as available in, win 10 ent x64-1703, hyper-v server 2016-build 1703, win server 2016 w/ hyper-v build 1703). Most likely the issues are being caused by the ARP protocol problem found only in Gen 2 vms in the PXE stack of hyper-v build 1703. See this Microsoft forum post for more details on how ARP was broken in this build. Still no resolution to my knowledge, but this post is most likely what will drive the fix from M$'s perspective.

    https://social.technet.microsoft.com/Forums/en-US/436c67cb-1c7d-4c5f-8f62-3518a0cfaeb4/hyperv-build-1703-generation-2-pxe-has-faulty-arp-implementation?forum=winserverhyperv


  • Senior Developer

    @sudburr Nope, but 1.4.2 should have the patched binaries anyway.



  • Any love from the folks at ipxe.org yet?



  • Many thanks for the live help Tom!


  • Senior Developer

    While working this out through chat, I’ve pushed patched, and working ipxe binaries that address this particular problem with booting. I should note, however, this is still not a “FOG” specific problem, rather something went wonky in ipxe binaries. I’ve made a posting on their forums and hope to hear back soon.



  • ipxe.efi 276d6 also fails the same way.


  • Senior Developer

    @sudburr Mind retrying installer? I’m installing before the “big” change occurred.

    Also, to fix the issue you saw with the original checkout, please try:

    git reset --hard
    git checkout hyperv-ipxetest1
    git pull
    


  • Strangely:

    git checkout hyperv-ipxetest1
    error: Your local changes to the following files would be overwritten by checkout:
            packages/tftp/ipxe.efi
    Please, commit your changes or stash them before you can switch branches.
    Aborting
    

    So I ran:

    git checkout -- .
    

    Which returned nothing but then I:

    git checkout hyperv-ipxetest1
    Branch hyperv-ipxetest1 set up to track remote branch hyperv-ipxetest1 from origin.
    Switched to a new branch 'hyperv-ipxetest1'
    

    and

    git pull
    Already up-to-date.
    

    So I proceeded with installation and it reported:

    Version: 9220986 Installer/Updater
    

    iPXE.efi is now 356f … and it fails into a restart loop the same way.


  • Developer

    @Tom-Elliott Yes, from quickly looking over the code changes I would guess that b91cc98 is probably the main cause. It seems like a bit change. Let’s see if you get a positive test from @sudburr


  • Senior Developer

    So take 2, I’ve applied what I “suspect” is the problem. Apparently there were other changes and the notable portion was how it’s handling the vmprob_bus call. This caused a problem when I originally posted (I wasn’t noticing the error messages sorry.)

    In the past, vmbus_probe only called hv_unmap_synic ( hv ) if the check failed.
    If the hv_map_hypercall or hv_map_synic failed it would call hv_unmap_hypercall ( hv ) and hv_free_message ( hv ) (respectively).

    In the new code, if err_vmbus_probe fails it calls all three. The reason for the change, as I can gather it, is the hv_map_hypercall and hv_map_synic never returned a failed message, while the vmbus_probe would potentially fail.

    My changes are just to comment the hv_unmap_hypercall and hv_free_message as they wouldn’t have been called in the past. These are still a guess.


  • Senior Developer

    Of note: this shows all the changes. Notice the three elements dealing with hyper v?

    I haven’t tried the second in the change set, as I’m really hoping not to have to mess with it. This, anyway, is not a means to say I believe it is or isn’t the problem, but just trying to figure out more specifically where the error is. In particular:

    @mcb30	mcb30	[hyperv] Do not fail if guest OS ID MSR is already set  …			a0f6e75
    @mcb30	mcb30	[hyperv] Remove redundant return status code from mapping functions  …			276d618
    @mcb30	mcb30	[hyperv] Cope with Windows Server 2016 enlightenments  …			b91cc98
    

    These above items appear to be the only things related to hyper v.

    I’m thinking the problem occurs, mainly, from the first 2 issues. I’m almost certain it’s got to be 276d618 as the call in a0f6e75 appears to be looking for an int to test against. If the ints are removed, would it still fail in the same fashion? (I don’t really know, just going on what I’m seeing).

    https://github.com/ipxe/ipxe/compare/2d79b20…master


  • Senior Developer

    @sudburr I’ve found what I think is the problem. Not sure why this would be a problem but i need confirmation so I can give the background to report a proper bug report.

    I’ve built a set of ipxe binaries that I hope will work. Please try:

    git fetch --all
    git checkout hyperv-ipxetest1
    git pull
    cd bin
    ./installfog.sh -y
    

    Does this work? Essentially it’s the most current pull with the minor reversion from:
    https://github.com/ipxe/ipxe/commit/a0f6e75532c68f49b3e1c73ca88151d9663f5269

    All it does is revert this as I’m fairly sure the message is not the part breaking it:

    -		return -EBUSY;
    


  • Dropping ipxe.efi 2d79 onto a 1.4.0 install works happily.



  • That’ll do the job. I was hoping there would be a way to just download only the ipxe.efi files without the need to re-install the entire beasty. I’m sure there’s a way, but I don’t use GIT professionally.

    RC-4 with fd6d1 works
    RC-5 with 84d4 works
    RC-6 to RC-9.2 with 2d79 works
    … and there we go, the problem starts with RC-9.3 and ipxe.efi 17887 .


  • Senior Developer

    @sudburr What do you mean?

    I build at seemingly random intervals. So if you know RC-4 of 1.4.0 was the last known good, just install RC-4 from git:

    git checkout tags/1.4.0-RC-4
    cd bin
    ./installfog.sh -y
    

    Figure out what version is in the string after confirming it works.

    Then re-install the dev-branch (or whatever version you’re installing) with:

    git checkout dev-branch
    git pull
    cd bin
    ./installfog.sh -y
    


  • Where can I find a repository of the ipxe.efi that have been used in FOG?


  • Senior Developer

    @george1421 It is indeed. But i doubt it will work. There was some effort in current ipxe to work with hyper-v I thought though.


Log in to reply
 

355
Online

39.3k
Users

11.0k
Topics

104.4k
Posts

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.