HP Stream 11 G3 - locks up after init.xz?
-
@Sebastian-Roth said in HP Stream 11 G3 - locks up after init.xz?:
, schedule a debug upload task
Psssst… in the grub menu, menu item #6 (I think) you can jump right to debug mode no capture or deploy step required. You can’t continue to capture or deploy from there, but you can run the commands you outlined.
-
Strange. I rebooted the server and the 192.168.x.x address was no longer interfering with PXE booting. No signs of it in boot.php based on the suggestions from George at this point either.
In any case, if I run the first command from a FOS USB, it comes back with no results. What’s interesting is it tries three times to get an IP but seems to give up, despite its actually getting an address…?
Here’s what it locked up at with 05_undionly and the added host debug parameters DEBUG=init,device,undi,pci.
06_undionly with the extra debug params was just a black screen when it locked up, with bzImage… ok and init.xz… ok and a blinking cursor.
And 07, with debug params, just skips the FOG menu and boots to the OS. This is what’s on the screen for a split-second before it starts booting Windows.
-
@theterminator93 Thanks again for testing and posting pictures!
Strange. I rebooted the server and the 192.168.x.x address was no longer interfering with PXE booting.
That was my fault. I had the wrong script embedded in one of the older binaries. See my post here further down. Don’t worry about it.
In any case, if I run the first command from a FOS USB, it comes back with no results. What’s interesting is it tries three times to get an IP but seems to give up, despite its actually getting an address…?
FOS tries to contact the FOG webserver after getting an IP via DHCP to make sure it’s fully connected. When building George’s USB FOS you need to change
myfogip=x.x.x.x
in that script to match your FOG server IP. Otherwise you’ll run into this issue.
About the empty output, again my fault, sorry! Should have asked you to runlsusb
instead! Post a picture of the full output please. One first info we already have thanks to the picture. Seems like the driverr8152
is handling eth0… Will be interesting to see the USB IDs.Here’s what it locked up at with 05_undionly
Alright, it hangs when trying to unload the UNDI root bus. Can’t tell you exactly what that is but it sounds like the UNDI implementation of that USB NIC is faulty. Possibly we can work around this but I need a little more time. Let me see.
06_undionly with the extra debug params was just a black screen when it locked up, with bzImage… ok and init.xz… ok and a blinking cursor.
Fine, so none of our iPXE header configs is causing this issue. 06 was the clean build.
And 07, with debug params, just skips the FOG menu and boots to the OS…
Nice! See this, when iPXE simply exits (shutting itself down to boot from hard disk) there is no hang on “Removed UNDI root bus”!
Hope I can give you some more information or binaries to test soon!
-
@theterminator93 Ok, here is the latest binary with even more debug output. It’s called
08_undionly.kpxe
(link). Let’s hope we reach the end soon! -
I tried running lsusb from a debug kernel off FOS, nothing appeared with ethernet in the name so I snapped this in case it proves useful.
And here is what we lock up at with 08…
-
@theterminator93 Oh yes, the
lsusb
output only has numbers. I’ll figure those out…We are getting closer, seems like it hangs when calling some internal UNDI wrapper. I commented this call and compiled
09_undionly.kpxe
for you. See if this runs all the way through!? -
Ok, so the first entry of ‘lsusb’ is telling us you have a Realtek RTL8153 Gigabit Ethernet Adapter/Chip in that USB NIC. The WorkingDevices list in our wiki has this kind of USB NIC listed as confirmed working using
undionly.kkpxe
(double ‘k’) in one case andipxe.efi
in the other. Both didn’t work for you, right?! So I guess the firmware of this particular model is crappy. Let’s see what you get booting09_undionly.kpxe
… -
Actually both undionly.kpxe and undionly.kkpxe managed to successfully give me the FOG menu. I didn’t try any of the EFI binaries since I built the Win10 image in legacy mode.
The good news is… 09_undionly.kpxe didn’t lock up and we appear to be in business!
It did spit out a bunch of debug output immediately before the screen went black (like usual) which I attempted to capture:
The only oddity now is that after I PXE boot once (maybe twice), subsequent boot attempts throw an error and reboot until I restart the FOG server.
tftp://10.15.1.20/default.ipxe… ok
http://10.15.1.20/fog/service/ipxe/boot.php… Connection reset (http://ipxe.org/0f0a6039)
Could not boot: Connection reset (http://ipxe.org/0f0a6039)
Chainloading failed, hit ‘s’ for the iPXE shell; reboot in 10 seconds -
@theterminator93 Great to see we got through! On first sight this “connection reset” thing has nothing to do with the patched iPXE binary. I can’t think of how this would be related. But then, you never know.
Does this only happen when cold booting the device? Do you see anything in the apache error logs when this is happening (see my signature on where to find those)?
-
Odd… I did nothing and it started working. I cold booted two times in a row and it worked both times.
But then I then tried a different (same model) host just to be sure, and it threw the error. After that I tried the original host and it was erroring out too. Then I tried undionly.kpxe to see if it would at least give me a menu, same result. Then I tried a different subnet… and it worked. Apparently it’s something screwy with the network or switch.
As far as the error log, nothing of significance. Only events indicating the events that the OS is shutting down and starting again.
-
@theterminator93 Would be interesting to see if the connection is actually being reset by TCP packets. Please install
tcpdump
on your FOG server and then runtcpdump -w /tmp/reset.pcap port 80
and leave the command. Startup your client and see if you run into the issue. Either way stop tcpdump after that with Ctrl+C. In case you don’t see the error just fire up the same command and boot up the client again till you see the “connection reset”. Then upload that PCAP file and post a link here. -
@theterminator93 Any news on this?
-
@Quazz The debug 09_undionly.kpxe binary ended up working properly. Whatever Sebastian did as a workaround allowed it to go straight from PXE to the kernel.
@Sebastian-Roth Here’s a link to a pcap when a different machine did the connection reset jig today.
-
@theterminator93 Thanks for capturing a packet dump. Unfortunately I can’t find any TCP reset (wireshark display filter:
tcp.flags.reset==1
) HTTP request to /fog/service/ipxe/boot.php (display filter:http.request.uri contains ipxe
) or any other obvious issue in there.I will look into what’s causing the initial hang issue in the next days! Will let you know.
-
@theterminator93 I dug through the iPXE code and read Intel’s PXE spec but I am still not sure why it would hang/loop/lock up (?) right at this point. This is where iPXE calls basic PXE functions (provided by the NIC’s PXE code) to cleanup the base memory before handing over (e.g. to the linux kernel) - explained here. As far as I understand the PXE specification was never very clear and therefor different vendors implement those PXE functions in a different manner.
Years ago in the early days of iPXE (was called Etherboot then) the developers were not sure about the order to call those functions.
There are other PXE boot loaders out there doing things differently. For example see the pxelinux code. It even mentions that iPXE is doing it differently in a comment. Would be interesting to see if pxelinux is doing fine on your USB NIC. This is easy to test as we still have this stuff from the old days. Please change your DHCP option 67 from
undionly.kpxe
topxelinux.0.old
. As well you want to change the timeout in /tftpboot/pxelinux.cfg/default toTIMEOUT 05
so it’s not just flicking through. Boot up your client and see what happens. My guess is that you see the pxelinux menu and then it properly chainloadsipxe.krn
which then probably hangs as it used to with the other iPXE binaries. In case it hangs on a kernel panic then we are hitting a different spot.As well I added more debugging statements to the code and compiled a new
10_undionly.kpxe
(download, compiled withDEBUG=undinet
) for you to test. Please post a picture of the messages on screen again. By the way I don’t see the pictures hosted on photobucket anymore - says “please update your accound to enable 3rd party hosting”.