Imaging Issue
-
Latest SVN still has the same issue
-
I still want to figure this out if possible. Can you enable kernel debug and set console level to 7 and then try booting to debug task?
-
Sure. I have turned that on and see a whole lot of info fly by before it gets to [root@fogclient /]#
What specifically are you looking for, or is there a way for me to copy all that from the client and send to you? -
Take a video with a smartphone, and just upload it to YouTube, then paste here.
Or you can share the video file via DropBox or Mega.
-
Here they are:
Video of debug task: [media=youtube]wOMm6Lmj7XI[/media]
Video of imaging task: [media=youtube]TUhRb1vYP7U[/media] -
Ok, new info - it might be/probably is a network issue on our end of things. The same style machine works in another building. So I took one machine from the problem lab to this other building and it works there as well.
Spanning tree portfast is enabled everywhere. But thinking this through now, I imaged that lab fine 3 weeks ago on SVN 2961 without issue. There haven’t been any changes on the network switches or anything as far as that goes. Is it possible that the timeout between link down and link up was shortened?
-
In the first video, the network info doesn’t show the interface to have a valid IP address.
This is sort of confirmed in the second video, where it says “link down”.
Obviously it’s getting an IP when it tries to network boot, because it loads the boot file.
Something is happening between the boot file and the rest of the process…You say it works fine on r2961, and these problem computers work fine in other buildings… what revision is the other buildings on? Any newer than r2961 ??
-
Do you have a hub (not a switch) ?
We can use a hub to capture all traffic that is going to one of these computers to see what it’s trying to do, using Wireshark.
If you don’t have a hub, you can run TCPDump on the FOG server and at least see every broadcast message and all messages to/from that computer and the FOG server.
Here’s a tutorial on TCPDump: [url]http://fogproject.org/wiki/index.php/TCPDump[/url]
Unless someone else has a better idea.
-
Right, that’s sort of what it looks like to me. It’s almost like it needs to grab an IP a second time before imaging and that’s where it fails.
All buildings share one common fog server. That server has been updated to the latest revison, so I can’t go back and check 2961 very easily. The computer in question is what I was moving back and forth. In building A, I have the problem where it doesn’t get the IP in the debug. I disconnect it, walk across the parking lot to building B, plug it back in, and it works perfectly.
-
Hmm that’s a thought. I don’t have a hub, but I can set up a monitor session on the switch. Give me about 10 minutes and I’ll upload a capture file.
-
How strange…
Maybe you’ve got a rogue DHCP in building A. Maybe it’s just a defunct patch cable?
-
No rogue DHCPs that I can see. I did a wireshark on a dhcp renewal and only got an offer from our known dhcp server.
The wireshark in debug mode really doesn’t show much, it’s mostly just TCP segments when it’s downloading the kernel.
I’m ruling out the patch cable because there are 60 machines with this same issue. -
But these 60 machines having the same problem are behind the same point when they’re having the issue?
-
I just noticed something else… See the picture attached. It says starting network before the link is up. Seems a bit backwards to me, but is there a reason I’m missing?
[url=“/_imported_xf_attachments/1/1918_Debug.PNG?:”]Debug.PNG[/url]
-
[QUOTE]But these 60 machines having the same problem are behind the same point when they’re having the issue?[/QUOTE]
Yes. I’m just trying to think of anything else that may have changed… I haven’t logged into the switches in a long time, dhcp hasn’t been modified at all, anything else I might not be thinking of?
-
Can we see a TCPDump from FOG?
-
Yep - Here’s a TCPDump
[url]https://www.dropbox.com/s/exn3ol2ro9dhxad/FogDebugTask.pcap?dl=0[/url]Fog Server: 10.162.1.212
Client loading debug task: 10.162.30.58Two fog storage nodes at 10.162.1.71 and 10.162.1.72
-
I found a read request from 10.162.30.58, asking 10.162.1.212 for default.ipxe
that’s wrong I think… it should ask for undionly.kpxe, right ?
I’m still looking through the file…
Ok, ok… it first asks for undionly.kpxe, but the first request gets an error (lions and tigers and bears, oh my!):
[IMG]http://s22.postimg.org/up3qxx7e9/error_code_for_file_request.png[/IMG]after that error, it asks again… and it looks like it gets the file…
and then after that completes, it asks for default.ipxe
[IMG]http://s22.postimg.org/oadwsqfg1/another_read_request.png[/IMG]
Then it appears to use TCP (contrary to TFTP) to get default.ipxe…
and then stuff kinda goes crazy… there’s a ton of duplicate errors…
[IMG]http://s27.postimg.org/6b37olf6r/Duplicate_ACK.png[/IMG]
Eventually it finishes up, then it asks for /fog/service/ipxe/init.xz
And it looks to be pulling that file for a while… then there are a few retransmission errors, then it just seems to disappear.
It ALMOST looks like it’s being interrupted by other computers communicating with FOG. Because, one moment it’s communicating, and the next some other traffic comes in from 10.163.94.76 and then 10.162.3.26 and 10.163.16.48… and all goes silent from 10.162.30.58
-
I think we’ll need Tom to answer that one… I’m not sure what the relation is between all the *pxe files.
-
That’s the actually correct method to get the files.
Get the undionly.kpxe, undionly then passes to get the tftp://default.ipxe file. At least that is what is supposed to happen.