UEFI booting with Yoga 370
-
@Iceman344 Yeah, one step closer. I feel this is failing very early in the process where iPXE hands off to the kernel. Alright, next with more debug output is
03_ipxe.efi
(download). I am sorry for this step by step action but AFAIK there is no other way to find where this actually hangs. We’ll take another 2 or 3 iterations till we actually hit the exact spot I suppose.Would you mind using the normal forum picture upload function (the most right symbol above the text field) as those photobucket pictures are not really working. Hope you didn’t mind me editing your other posts and adding the pics directly.
-
@sebastian-roth So update ! The new bootloader shows some more info and boots nicely to the menu. Problem right now is that the debug info goes a little quick. I got some footage with a camera but the 30fps limit doesn’t help, so i winged it with some burst mode pictures. Tomorrow i’ll capture the HDMI output straight to get more acurate info. Is there any way to make it dump the output to the file system its running on ?
Also i totally understand the iteration issue. I’ll be here to test !
This is as much info as i could get out of the footage.
This after selecting “Client System Information” as a test
Also on a side note i did not realise that pictures could be uploaded and inserted this way. I’ve replaced the ones in the previous posts
-
@iceman344 said in UEFI booting with Yoga 370:
… and boots nicely to the menu.
This is because you used a different client this time - see the different MAC addresses in the pictures. For this host no task has been scheduled and therefor you got the FOG menu. But that’s ok I reckon. The last of the three pictures you posted lately is a good pointer on where things go wrong.
Although we must be very close to the “hang” (as I don’t see many function calls further down that way in the code) I still have no clue why this would stall. Possibly this is not a hang but more of an infinite loop. We’ll see from the next debug output -
04_ipxe.efi
(download). Don’t worry about taking a slow motion video if it hangs. We just need to see the very last output. In case it loops over and over a video might be handy but I guess we could even go with a picture then. So no need for high tech video capturing I reckon.Also i totally understand the iteration issue. I’ll be here to test !
Thanks a lot for taking this up with me.
-
@sebastian-roth Ah yes, this is because i was using an already in place thunderbolt-dock, but had to fetch a new one as it was needed. The machine is still the same however.
So now this is the last message i get
-
@Iceman344 Right now to me this looks as if iPXE is just waiting for the TCP connection to close. For some reason the other communication partner (FOG server in this case I reckon) seems to not properly close the connection and iPXE is waiting for it. This is just an assumption up to now as we don’t have a packet dump of the communication yet. Might be one of the next steps. Don’t get me wrong. I still think this is something caused by the Realtek USB NIC… we’ll figure it out at some point I am sure.
But first please try simple waiting. Use
05_ipxe.efi
(again added more debug output - download), take a picture of the screen where it hangs and then just let it sit there for a couple of hours. Check on it every now and then to see if it went any further or if it just sits at this stage for ever. -
@sebastian-roth I loaded up the image and let it run for a bit (it took a while as it had to dump all the tcp handshakes). I let the laptop stay on for a while after this but nothing further happened.
-
@Iceman344 I am sorry but I still don’t see the logic behind this. Maybe I just can’t see the wood for the trees right now. Again added more debug output in
06_ipxe.efi
(download) and also compiled07_ipxe.efi
that skips the TCP shutdown code altogether. Sure this is really ugly but let’s see what happens with this. Again try it out and take a picture when it hangs. Thanks! -
@sebastian-roth Running 06_ipxe did not produce a different output to number 5 at first glance.
However 07_ipxe did trow out something else. It hung this time on removing devices. I’m guessing this is because of bad handover from the bootloader to the kernel ?
Not ugly imho, pretty cool to see the packets flow by
-
@Iceman344 This combination of hardware, USB NIC and UEFI firmware is just a piece of s**t I reckon. Sorry for using those words but I can’t believe it’s hanging on one of the other shutdown functions as well, now that we skipped tHe TCP shutdown function. There are six if I remember correctly…
I will look into this when I get home. -
@Iceman344 Alright I have tried hard to understand what’s going on here and I have had some new insights. But still no solution I’m afraid. First off, here is another binary
08_ipxe.efi
(download)which should print out which devices it tries to remove.Trying this binary on a UEFI MacBook I see the following:
This is not actually hanging - I just added a sleep call so I could capture a good picture of this. Knowing a bit about TCP I see the transfer being properly finished and the connections closed (PSH
= “push” the last bytes,FIN
= then close the connection,ACK
= acknowledging the last couple of bytes). Then we receiveFIN ACK
from the server and both finally terminate the connection withACK
. This is how it should be.In the pictures you posted I see things being all over the place. In the picture you took with
05_ipxe.efi
we see the client side wants toFIN ACK
the connection but the server has not sentFIN
yet. Looks like it kills the connection right in the middle somewhere. For 06 and 07 I see a properPSH FIN ACK
received from the server. The client sends a good lookingFIN ACK
but for whatever reason does not get the finalACK
from the server.Please do me a favor and capture a packet dump of the communication as well. See George’s instructions here: https://forums.fogproject.org/topic/9673/when-dhcp-pxe-booting-process-goes-bad-and-you-have-no-clue but use
tcpdump -w output.pcap host 192.168.12.x and port 80
as we only want to see the HTTP traffic of this one client (put in the correct IP of your test client). Please upload the PCAP file somewhere and post a link here or send me a private message with the download link.@george1421 Maybe you have an idea what could be the issue here?
-
@sebastian-roth Loaded up the image and took some pictures. i saw some devices being initalised so snapped that and the last line pushed to the screen. i don’t know if these where devices you where referencing too in the first lines of your post.
Now about the network trafic. Reading over your reasoning i agree and am afraid that the reality is very close to worse. Taking a quick look at the dump it is filled with retransmissions because of the “Frame check sequence” being incorrect (packet checksums don’t match). It seems that all the incorrect packets have a FCS of 0xd3600000 . I’m guessing its dropping the packets as from time to time the loading dump freezes for a decent amount of time.
I also spotted this at the end of a first partial dump i did. Could however not find the same again in the full capture. Probably the result from packet errors:
7zXZ Destination address too large XZ-compressed data is corrupt Bug in the XZ decompressor Destination physical address inappropriately aligned Destination virtual address inappropriately aligned XZ decompressor ran out of memory Input is not in the XZ format (wrong magic bytes) Input was encoded with settings that are not supported by this XZ decoder Kernel is not a valid ELF file Failed to allocate space for phdrs Avoiding potentially unsafe overlapping memcpy()! -- System halted EL64 EL32 Failed to handle fs_proto Failed to open volume initrd= Failed to alloc mem for rom Failed to read rom->vendor Failed to read rom->devid Failed to alloc mem for gdt efi_main() failed! exit_boot() failed! Failed to get handle for LOADED_IMAGE_PROTOCOL Failed to alloc lowmem for boot params Trying to load files to higher address Failed to alloc mem for pci_handle Failed to alloc mem for gdt structure efi_relocate_kernel() failed! efi= nochunk Failed to open file: Failed to get file info size Failed to get initrd info EFI stub: ERROR: Failed to alloc mem for file handle list Failed to alloc mem for file info EFI stub: ERROR: Failed to alloc highmem for files EFI stub: ERROR: We've run out of free low memory EFI stub: ERROR: Failed to read file EFI stub: ERROR: Failed to allocate usable memory for kernel. EFI stub: UEFI Secure Boot is enabled. EFI stub: ERROR: Could not determine UEFI Secure Boot status.
-
@Iceman344 Damn, I forgot to add the “skipping tcp_shutdown” part, so you ran into the same issue as earlier. Sorry for that. Try
09_ipxe.efi
(download)!Thanks for the packet capture. I’ll look into this later on when I have more time.
-
So last update from me. The device dump did get displayed now, nicely halting at EFI. Hope this gives a good insight into the problem. As i’m leaving my current job i’m handing this issue over to my boss. I explained the efi testing steps and he’s been following it for some time so he is up to date.
Thanks for all the help and support, its great to see dedication to an open project like this. I’ll be around on the forum still as i’ll keep using fog from now on personally.
-
@Iceman344 It’s been great to work with you. All the best to you.
TCP is pretty amazing! I’ve seen lots of packet dumps but it’s always a bit different and you see new things when taking a deep dive in. On first sight this looks as if the client (Realtek USB NIC) is just a bit slow processing the data it gets from the server. In this case it has to handle quite a lot of it as kernel/initrd is more than just a few HTTP bytes. So for when one end is slower than the other TCP has a good set of “regulative” algorithms. One is “flow control” (a.k.a. “window size” or “sliding window protocol”). Both ends can tell the other to send more or less data in one frame. Smaller window size slows down the transfer. But from what I can see the client does not make use of this.
I found an interesting post (https://ask.wireshark.org/questions/17730/retransmissions) that states:
Packets get lost for any number of reasons. Here are a few likely candidates for large number of retransmissions:
- Full Duplex / Half Duplex mismatch (check the configuration of the network card and switch interfaces)
- The server transmits data with a high speed (say 1 GBit) and the receiver is connected with a lower speed (say 100 MBit). Drops occur if the receiver is signalling a large TCP window size, found in the TCP header.
- One of your routers is configured with a quality of service rule that enforces a certain the bandwidth
- A broken cable offers very poor signal quality
- A wireless network is busy or suffers from interference
My first guess is server and client are going at different speeds. Could you please force the connection speed on the switch port on both server and client side to 100MBit/s! In case you cannot alter switch configuration you could also just take an old 100MBit/s mini switch and connect it in between. Again take a packet dump on the FOG server and see if you get see
TCP Dup ACK
andTCP Retransmission
packets.The device dump did get displayed now, nicely halting at EFI.
Although I am not sure I kind of hope that this is just caused by skipping the tcp cleanup stuff. Let’s see if we can get the connection/transfer play nicely first and then see if we still hang on removing EFI…
-
Yay me. I’m on this ship now too. I just received a Yoga 370 ( 20JJS2F200 ) and I’m also unable to perform operations after a UEFI network boot.
I thought it may have been due to the new hardware this model has, Thunderbolt and WiGig but disabling them has no effect.
-
@sudburr Sorry to say this but yeah!! I am real happy that you join the team. Hope you are willing to keep this up with me and keep the debugging going. Could you please get a packet dump using this command on you FOG server while booting up one of the Yogas:
tcpdump -w output.pcap host x.x.x.x and port 80
(just replace the x.x.x.x with the client’s IP). The capture file will be about 30-40 MB because the transfered kernel and initrd are within that capture file. But we need all this information to see if you have the same issue with retransmissions and such.I haven’t heard from @Iceman344 or his boss yet. Hope they’ll be back again as well.
-
I don’t have a lot of spare time just now for much testing; on the road; but I do have a USB-C adapter coming in and I have hooked Lenovo on the line to hopefully help out too.
-
@sudburr Did you get to test this yet?
@Iceman344 Are you still around? Who’s your boss? How would he get in touch with us?
@Brian-Hoehn Are you still onto this topic? -
Was able to squeeze in a baseline observation before I go onto other things and I’ve informed Lenovo as well.
I have the following:
1.Lenovo ThinkPad OneLink+ to RJ45 Adapter ( p/n: SC10J34224BB FRU: 00JT801 )
2.Lenovo ThinkPad USB 3.0 Ethernet Adapter ( p/n: SC10H30171/RTL8153 FRU: 03X6903 )
3.Lenovo USB-C to Ethernet Adapter ( p/n: SC10L66919/RTL8153-04 FRU: 03X7205 )All three work as desired on a Lenovo ThinkPad 13 (20GKSOA700 BIOS 1.24) for both Legacy and UEFI network boot. I also tested 1&2 on a Lenovo Yoga 260 (20FES0VJ00 BIOS 1.59) successfully.
Since there isn’t a OneLink+ connector on a Yoga 370 (1.17) here are my findings for the two USB models.
2a. Using the Lenovo ThinkPad USB 3.0 Ethernet Adapter ( p/n: SC10H30171/RTL8153 FRU: 03X6903 )
- Legacy mode (no secure boot, CSM, Legacy only)
- Selecting PCI LAN > Realtek PXE B000 D14
- PXE boots okay, FOS environment hands off successfully and process succeeds
2b. Using the Lenovo ThinkPad USB 3.0 Ethernet Adapter ( p/n: SC10H30171/RTL8153 FRU: 03X6903 )
- UEFI mode (no secure boot, no CSM, UEFI only)
- Selecting PCI LAN > Thinkpad USB LAN-IPv4
- PXE boots okay, but FOS environment then hangs on black screen after loading their inits for performing the selected action
3a. Using the Lenovo USB-C to Ethernet Adapter ( p/n: SC10L66919/RTL8153-04 FRU: 03X7205 )
- Legacy mode (no secure boot, CSM, Legacy only)
- Selecting PCI LAN > IBA CL Slot 00FE v0109
- Attempts to PXE boot but fails with: PXE-E61: Media test failure, check cable ; PXE-M0F: Exiting Intel Boot Agent.
3b. Using the Lenovo USB-C to Ethernet Adapter ( p/n: SC10L66919/RTL8153-04 FRU: 03X7205 )
- UEFI mode (no secure boot, no CSM, UEFI only)
- Selecting PCI LAN > Intel Gigabit 0.0.18-IPv4
- goes directly to a blank screen for 5 seconds then returns to the F12 boot menu
There is clearly something amiss with the 370’s code.
Lenovo says that the 4X90E51405 should work, but I don’t have one … yet 8).
-
@sudburr Nice, thanks a lot for testing and posting all those many details here!!
Do you think you can squeeze in another short test to capture a packet dump for us to look at? Run
tcpdump -w output.pcap host x.x.x.x and port 80
on the server, and boot up the client. When it hangs jst quit tcpdump (Ctrl-C), upload and post a link here.