UEFI booting with Yoga 370
-
@Brian-Hoehn @Iceman344 Beside testing the binary, could you also boot up your Yogas into a debug upload task and run the following command when you get to the shell:
lsusb
Please take a picture and post here. As well you might want to send the exact name of this USB NIC(s) you are using. -
@Brian-Hoehn @Iceman344 Any news on this??
-
@Sebastian-Roth we are 100% focused on organizing a conference atm so i have not had time for this issue. Begining to end of August however i will have full dedication to this.
-
@sebastian-roth So, i followed the suggested steps and here are my results.
Booting with the EFI fog stick with debug enabled gave some more information, though i’m sure it will say more for you then me.
I pulled verbose lists of both the usb and pci devices and attached them below.
As far as i see the hub is under the 0x17ef vendor-id.http://www.mediafire.com/file/dh153eh6ar3o391/lsusb.txt
http://www.mediafire.com/file/n7s11lvpz7v27cc/lspci.txtApologies for my late reply on the problem.
-
@Iceman344 Thanks for getting back to this and updating the information. Interestingly enough I see this in the
lspci
output:00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (4) I219-V (rev 21)
And why I search for the specs of the Yoga 370 I find this indeed saying that the Yoga 370 comes with a proper Intel NIC chip onboard. But maybe you’d need to buy those extension cords to use the onboard Intel NIC. And we still don’t know of UEFI firmware would still fail to play a nice game with us. But I thought I’d mention this in case you wanna give it a shot.
Ok back to the USB NIC, it has USB ID 17ef:3062 which seems to be another one of those honky RTL8152 (USB2.0) or RTL8153 (USB3.0) based chips - see here.
Here is the next debug build iPXE binary
02_ipxe.efi
for you to test (DEBUG=image,init,efi_image
). Let’s see if we can get more information. As I don’t have a machine here to test right now this is kind of a “blind build”. Let’s hope I got it all right. Please take a picture and post here again. -
@sebastian-roth Yep they do have a nice onboard NIC but as you also correctly noted they need a special adapter to break out the connector to be actually usable (100% crap imho). But as we got a bunch of the lenovo hubs for this device we would love to be able to use that one so no extra desk diving and replugging needs to happen.
Anyway this is the result after booting the new bootloader,
-
@Iceman344 Yeah, one step closer. I feel this is failing very early in the process where iPXE hands off to the kernel. Alright, next with more debug output is
03_ipxe.efi
(download). I am sorry for this step by step action but AFAIK there is no other way to find where this actually hangs. We’ll take another 2 or 3 iterations till we actually hit the exact spot I suppose.Would you mind using the normal forum picture upload function (the most right symbol above the text field) as those photobucket pictures are not really working. Hope you didn’t mind me editing your other posts and adding the pics directly.
-
@sebastian-roth So update ! The new bootloader shows some more info and boots nicely to the menu. Problem right now is that the debug info goes a little quick. I got some footage with a camera but the 30fps limit doesn’t help, so i winged it with some burst mode pictures. Tomorrow i’ll capture the HDMI output straight to get more acurate info. Is there any way to make it dump the output to the file system its running on ?
Also i totally understand the iteration issue. I’ll be here to test !
This is as much info as i could get out of the footage.
This after selecting “Client System Information” as a test
Also on a side note i did not realise that pictures could be uploaded and inserted this way. I’ve replaced the ones in the previous posts
-
@iceman344 said in UEFI booting with Yoga 370:
… and boots nicely to the menu.
This is because you used a different client this time - see the different MAC addresses in the pictures. For this host no task has been scheduled and therefor you got the FOG menu. But that’s ok I reckon. The last of the three pictures you posted lately is a good pointer on where things go wrong.
Although we must be very close to the “hang” (as I don’t see many function calls further down that way in the code) I still have no clue why this would stall. Possibly this is not a hang but more of an infinite loop. We’ll see from the next debug output -
04_ipxe.efi
(download). Don’t worry about taking a slow motion video if it hangs. We just need to see the very last output. In case it loops over and over a video might be handy but I guess we could even go with a picture then. So no need for high tech video capturing I reckon.Also i totally understand the iteration issue. I’ll be here to test !
Thanks a lot for taking this up with me.
-
@sebastian-roth Ah yes, this is because i was using an already in place thunderbolt-dock, but had to fetch a new one as it was needed. The machine is still the same however.
So now this is the last message i get
-
@Iceman344 Right now to me this looks as if iPXE is just waiting for the TCP connection to close. For some reason the other communication partner (FOG server in this case I reckon) seems to not properly close the connection and iPXE is waiting for it. This is just an assumption up to now as we don’t have a packet dump of the communication yet. Might be one of the next steps. Don’t get me wrong. I still think this is something caused by the Realtek USB NIC… we’ll figure it out at some point I am sure.
But first please try simple waiting. Use
05_ipxe.efi
(again added more debug output - download), take a picture of the screen where it hangs and then just let it sit there for a couple of hours. Check on it every now and then to see if it went any further or if it just sits at this stage for ever. -
@sebastian-roth I loaded up the image and let it run for a bit (it took a while as it had to dump all the tcp handshakes). I let the laptop stay on for a while after this but nothing further happened.
-
@Iceman344 I am sorry but I still don’t see the logic behind this. Maybe I just can’t see the wood for the trees right now. Again added more debug output in
06_ipxe.efi
(download) and also compiled07_ipxe.efi
that skips the TCP shutdown code altogether. Sure this is really ugly but let’s see what happens with this. Again try it out and take a picture when it hangs. Thanks! -
@sebastian-roth Running 06_ipxe did not produce a different output to number 5 at first glance.
However 07_ipxe did trow out something else. It hung this time on removing devices. I’m guessing this is because of bad handover from the bootloader to the kernel ?
Not ugly imho, pretty cool to see the packets flow by
-
@Iceman344 This combination of hardware, USB NIC and UEFI firmware is just a piece of s**t I reckon. Sorry for using those words but I can’t believe it’s hanging on one of the other shutdown functions as well, now that we skipped tHe TCP shutdown function. There are six if I remember correctly…
I will look into this when I get home. -
@Iceman344 Alright I have tried hard to understand what’s going on here and I have had some new insights. But still no solution I’m afraid. First off, here is another binary
08_ipxe.efi
(download)which should print out which devices it tries to remove.Trying this binary on a UEFI MacBook I see the following:
This is not actually hanging - I just added a sleep call so I could capture a good picture of this. Knowing a bit about TCP I see the transfer being properly finished and the connections closed (PSH
= “push” the last bytes,FIN
= then close the connection,ACK
= acknowledging the last couple of bytes). Then we receiveFIN ACK
from the server and both finally terminate the connection withACK
. This is how it should be.In the pictures you posted I see things being all over the place. In the picture you took with
05_ipxe.efi
we see the client side wants toFIN ACK
the connection but the server has not sentFIN
yet. Looks like it kills the connection right in the middle somewhere. For 06 and 07 I see a properPSH FIN ACK
received from the server. The client sends a good lookingFIN ACK
but for whatever reason does not get the finalACK
from the server.Please do me a favor and capture a packet dump of the communication as well. See George’s instructions here: https://forums.fogproject.org/topic/9673/when-dhcp-pxe-booting-process-goes-bad-and-you-have-no-clue but use
tcpdump -w output.pcap host 192.168.12.x and port 80
as we only want to see the HTTP traffic of this one client (put in the correct IP of your test client). Please upload the PCAP file somewhere and post a link here or send me a private message with the download link.@george1421 Maybe you have an idea what could be the issue here?
-
@sebastian-roth Loaded up the image and took some pictures. i saw some devices being initalised so snapped that and the last line pushed to the screen. i don’t know if these where devices you where referencing too in the first lines of your post.
Now about the network trafic. Reading over your reasoning i agree and am afraid that the reality is very close to worse. Taking a quick look at the dump it is filled with retransmissions because of the “Frame check sequence” being incorrect (packet checksums don’t match). It seems that all the incorrect packets have a FCS of 0xd3600000 . I’m guessing its dropping the packets as from time to time the loading dump freezes for a decent amount of time.
I also spotted this at the end of a first partial dump i did. Could however not find the same again in the full capture. Probably the result from packet errors:
7zXZ Destination address too large XZ-compressed data is corrupt Bug in the XZ decompressor Destination physical address inappropriately aligned Destination virtual address inappropriately aligned XZ decompressor ran out of memory Input is not in the XZ format (wrong magic bytes) Input was encoded with settings that are not supported by this XZ decoder Kernel is not a valid ELF file Failed to allocate space for phdrs Avoiding potentially unsafe overlapping memcpy()! -- System halted EL64 EL32 Failed to handle fs_proto Failed to open volume initrd= Failed to alloc mem for rom Failed to read rom->vendor Failed to read rom->devid Failed to alloc mem for gdt efi_main() failed! exit_boot() failed! Failed to get handle for LOADED_IMAGE_PROTOCOL Failed to alloc lowmem for boot params Trying to load files to higher address Failed to alloc mem for pci_handle Failed to alloc mem for gdt structure efi_relocate_kernel() failed! efi= nochunk Failed to open file: Failed to get file info size Failed to get initrd info EFI stub: ERROR: Failed to alloc mem for file handle list Failed to alloc mem for file info EFI stub: ERROR: Failed to alloc highmem for files EFI stub: ERROR: We've run out of free low memory EFI stub: ERROR: Failed to read file EFI stub: ERROR: Failed to allocate usable memory for kernel. EFI stub: UEFI Secure Boot is enabled. EFI stub: ERROR: Could not determine UEFI Secure Boot status.
-
@Iceman344 Damn, I forgot to add the “skipping tcp_shutdown” part, so you ran into the same issue as earlier. Sorry for that. Try
09_ipxe.efi
(download)!Thanks for the packet capture. I’ll look into this later on when I have more time.
-
So last update from me. The device dump did get displayed now, nicely halting at EFI. Hope this gives a good insight into the problem. As i’m leaving my current job i’m handing this issue over to my boss. I explained the efi testing steps and he’s been following it for some time so he is up to date.
Thanks for all the help and support, its great to see dedication to an open project like this. I’ll be around on the forum still as i’ll keep using fog from now on personally.
-
@Iceman344 It’s been great to work with you. All the best to you.
TCP is pretty amazing! I’ve seen lots of packet dumps but it’s always a bit different and you see new things when taking a deep dive in. On first sight this looks as if the client (Realtek USB NIC) is just a bit slow processing the data it gets from the server. In this case it has to handle quite a lot of it as kernel/initrd is more than just a few HTTP bytes. So for when one end is slower than the other TCP has a good set of “regulative” algorithms. One is “flow control” (a.k.a. “window size” or “sliding window protocol”). Both ends can tell the other to send more or less data in one frame. Smaller window size slows down the transfer. But from what I can see the client does not make use of this.
I found an interesting post (https://ask.wireshark.org/questions/17730/retransmissions) that states:
Packets get lost for any number of reasons. Here are a few likely candidates for large number of retransmissions:
- Full Duplex / Half Duplex mismatch (check the configuration of the network card and switch interfaces)
- The server transmits data with a high speed (say 1 GBit) and the receiver is connected with a lower speed (say 100 MBit). Drops occur if the receiver is signalling a large TCP window size, found in the TCP header.
- One of your routers is configured with a quality of service rule that enforces a certain the bandwidth
- A broken cable offers very poor signal quality
- A wireless network is busy or suffers from interference
My first guess is server and client are going at different speeds. Could you please force the connection speed on the switch port on both server and client side to 100MBit/s! In case you cannot alter switch configuration you could also just take an old 100MBit/s mini switch and connect it in between. Again take a packet dump on the FOG server and see if you get see
TCP Dup ACK
andTCP Retransmission
packets.The device dump did get displayed now, nicely halting at EFI.
Although I am not sure I kind of hope that this is just caused by skipping the tcp cleanup stuff. Let’s see if we can get the connection/transfer play nicely first and then see if we still hang on removing EFI…