full registration hangs at bzimage
-
And to add to Georg’s great list of questions:
- Have you tried on different hosts (same hardware)?
- Have you tried on same hardware (different clients)?
-
Sorry for the delayed reply, I was busy with some other things recently…
@george1421 said in full registration hangs at bzimage:
Is /has the bios/firmware been updated in this system?
No
Is this running bios(legacy) or uefi mode?
Bios
Are you using the built in ethernet adapter for imaging?
Yes
Is this a new install of FOG?
Yes
What precisely do you have configured for dhcp option 67?
Don’t know what this is, Sorry! I’m using the DHCP in my pfsense
Have you tried on different hosts (same hardware)?
Yes - I have a couple of the mainboard mentioned (p5w) with Linux Mint 18.02 and Debian Wheezy and none of them is working
Have you tried on same hardware (different clients)?
My Laptop - a thinkpad w540 - has also Linux Mint 18.2 installed and with this machine the registration worked
I tried to pull the image from the thinkpad today and I received an error: “Could not mount the image folder” - can this also be the problem for the other PCs?
thanks
C. -
@christian99x Please post a picture of the pfSense settings here. You need to know about things like DHCP, option 66 aka next-server and option 67 aka filename. Read up on this stuff, I’d suggest. The internet is full of great explanations on that stuff.
I have a couple of the mainboard mentioned (p5w) with Linux Mint 18.02 and Debian Wheezy and none of them is working
So the issue is specific to that mainboard I’d say. Please boot up you Linux Mint and run
lspci -nn | grep net
. Post what you get on the screen here.My Laptop - a thinkpad w540 - has also Linux Mint 18.2 installed and with this machine the registration worked
Ok, so in general PXE booting and registering clients seems to work.
I tried to pull the image from the thinkpad today and I received an error: “Could not mount the image folder” - can this also be the problem for the other PCs?
Nope, different issue! Probably best if you open a complete new thread on this so we don’t mix up things! It’s a lot easier for everyone to follow if we don’t discuss several issues in one thread.
And as you keep stressing the point which OS is installed on the client… this does not matter at all. You can PXE boot clients with pretty much any OS installed and even with no hard drive in the client at all.
Edit: Moved this to section hardware compatibility as I really think this is a very specific (network) issue with the onboard NIC on that mentioned mainboard.
-
Thanks for the quick answer!
Please post a picture of the pfSense settings here.
It’s rather large…
https://screenshots.firefoxusercontent.com/images/e08efdf4-40ed-4dde-9146-163a1291f95a.png
Anything else you need to know?
You need to know about things like DHCP, option 66 aka next-server and option 67 aka filename. Read up on this stuff, I’d suggest. The internet is full of great explanations on that stuff.
Yes, I will definitely!
So the issue is specific to that mainboard I’d say. Please boot up you Linux Mint and run lspci -nn | grep net. Post what you get on the screen here.
03:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller [11ab:4362] (rev 20)
Nope, different issue! Probably best if you open a complete new thread on this so we don’t mix up things! It’s a lot easier for everyone to follow if we don’t discuss several issues in one thread.
I will try to solve this on my own first …
-
@christian99x said:
It’s rather large…
https://screenshots.firefoxusercontent.com/images/e08efdf4-40ed-4dde-9146-163a1291f95a.pngLooking good as far as I see. The last couple of settings above the Save button are important for PXE booting. “Next Server” (option 66), “Default BIOS file name” (option 67 for legacy BIOS systems). Seems fine.
03:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller [11ab:4362] (rev 20)
Ok, this is valuable information I suppose. Though a quick search hasn’t brought anything up yet. On the iPXE website they even say this NIC is supported. An well, yes we see it does kind of work but as soon as it starts loading a big file over HTTP it seems to hang. I think we need to do a packet dump to see if there are network (congestion) errors happening. Get your client ready but don’t start it yet. On your FOG server install tcpdump (apt-get/yum install tcpdump). Then run the following command and substitute x.x.x.x with the client’s IP address:
tcpdump -w /tmp/boot_issue.pcap host x.x.x.x
Now boot up the client and till it hangs at 5% or whatever. Wait another 10-20 seconds and then stop tcpdump on the FOG server (Ctrl+c). Upload the /tmp/boot_issue.pcap file to you dropbox/googledrive and post a link here so we can check it out.
-
-
@christian99x Looking at the pcap we see it transfer undionly.kpxe without issue to the last block. At this point you should see the iPXE boot menu.
I think where you are getting stuck at XX% is when FOS loads. From your pcap I’m not seeing the pull request for bzImage.
Do you have the ability to take a second computer and a small hub or a small switch (like SLM2008) using a mirror port capture the traffic actually going in and out of that target computer. We really need to see the entire pxe booting conversation here. The tcpdumps from the FOG perspective only tell us what the FOG server is doing. We need to see from the target computer perspective what is getting to the target from the fog server, dhcp server, tftpboot, etc.
I know we are asking a lot here. You have an abnormal situation that is causing this to fail. What you have is abnormal at least from what we’ve seen historically.
-
@christian99x Is this packet dump somehow being filtered after capturing? The only thing I see is TFTP and ARP traffic. Missing is DHCP (should at least see broadcasts) and HTTP packets.
So either those were filtered out or your network is way more complex. Possibly DHCP server, client and FOG server are in three different network segments. That way we wouldn’t see the DHCP messages when capturing packets on the FOG server. But then… where are the HTTP packets? Maybe you filtered to only show UDP packets??
-
Is this packet dump somehow being filtered after capturing?
No - I used the command you mentioned and uploaded the original file
Possibly DHCP server, client and FOG server are in three different network segments.
Not that I know, though I did not set it up…
Do you have the ability to take a second computer and a small hub or a small switch (like SLM2008) using a mirror port capture the traffic actually going in and out of that target computer.
I’ve never done something like this and I only got a rough idea how to do it, I really would appreciate if you can point me in the right direction
-
if I cancel with ctrl-c and do a ifstat on the pxe shell I receice the following message(s):
net0: 00:1a:92:9e:10:e1 using undionly on 0000:03:00.0 (open)
[Link:up, TX:150 TXE:1 RX:276 RXE:9]
[TXE: 1 x “Network unreachable (http://ipxe.org/28086011)”]
[RXE: 4 x “Operation not supported (http://ipxe.org/3c3f6303)”]
[RXE: 4 x “Error 0x42306001 (http://ipxe.org/42306001)”]
[RXE: 1 x “Invalid argument (http://ipxe.org/1c056002)”] -
@christian99x said in full registration hangs at bzimage:
[Link:up, TX:150 TXE:1 RX:276 RXE:9]
Looks good! TX and RX having reasonable numbers. Don’t worry about the TXE / RXE, that’s not probelmatic errors.
I keep wondering why we don’t see the HTTP traffic in the packet dump?!
-
I did another attempt with tcpdump and now I can see at least some HTTP traffic - maybe this helps:
https://www.dropbox.com/s/4sl1rog18suypmu/bootissue.pcap?dl=0
After ctrl-c tcpdump told me:
tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
97 packets captured
97 packets received by filter
0 packets dropped by kernel
67 packets dropped by interface…still working on the mirror port capture…
Thanks!
-
@christian99x Now we see a lot more in the packet dump! Yeah. So it does request
boot.php
which is transferred just fine. Next isbg.png
- here we already see some first TCP retransmission packets, though it seems to finish properly. ThenbzImage
transfer begins and seems of for the first couple of data and acknowledge packets going back and forth. But in the first microseconds the transfer seems to stall completely. From my point of view this is because the client machine does not acknowledge the packets anymore. The interesting thing is that we see ACKs from the client 15, 30 and 45 seconds after the stall. So it kind of seems that the client is not “dead”.Unfortunately there is not much we can do for you I think. I’d need access to such a machine and a lot of time to debug what is causing the network stall. It’s a driver issue within iPXE I reckon.
But to be sure we’d need to rule out other things. Can you try connecting the FOG server and this single one client by using a dump mini switch or even a crossover cable. Does it do the same thing?
-
It might be worthwhile to try a different boot file (eg ipxe.pxe instead of undionly.kpxe) as well.
-
@quazz said in full registration hangs at bzimage:
It might be worthwhile to try a different boot file (eg ipxe.pxe instead of undionly.kpxe) as well.
changing the boot file to ipxe.pxe did the trick! The host performed the full registration successfully without any errors!
Should we continue? It will take me some time to do a mirror port capture (but I’m definitely okay doing this)
-
@christian99x said:
Should we continue? It will take me some time to do a mirror port capture (but I’m definitely okay doing this)
Don’t worry about the mirror port. It’s definitely fine to use
ipxe.pxe
if it works for you. Some work better (or at all) for different hardware. Just see if you can boot all your hardware usingipxe.pxe
. If so, just stick to that. We default toundionly.kpxe
because that causes the least issues. But as we see there are pieces of hardware around not liking the UNDI driver stuff.@Quazz Thanks heaps for mentioning the other binaries. I had thought about this as well but forgot to mention it in my last post.