Imaging Issue
-
[QUOTE]But these 60 machines having the same problem are behind the same point when they’re having the issue?[/QUOTE]
Yes. I’m just trying to think of anything else that may have changed… I haven’t logged into the switches in a long time, dhcp hasn’t been modified at all, anything else I might not be thinking of?
-
Can we see a TCPDump from FOG?
-
Yep - Here’s a TCPDump
[url]https://www.dropbox.com/s/exn3ol2ro9dhxad/FogDebugTask.pcap?dl=0[/url]Fog Server: 10.162.1.212
Client loading debug task: 10.162.30.58Two fog storage nodes at 10.162.1.71 and 10.162.1.72
-
I found a read request from 10.162.30.58, asking 10.162.1.212 for default.ipxe
that’s wrong I think… it should ask for undionly.kpxe, right ?
I’m still looking through the file…
Ok, ok… it first asks for undionly.kpxe, but the first request gets an error (lions and tigers and bears, oh my!):
[IMG]http://s22.postimg.org/up3qxx7e9/error_code_for_file_request.png[/IMG]after that error, it asks again… and it looks like it gets the file…
and then after that completes, it asks for default.ipxe
[IMG]http://s22.postimg.org/oadwsqfg1/another_read_request.png[/IMG]
Then it appears to use TCP (contrary to TFTP) to get default.ipxe…
and then stuff kinda goes crazy… there’s a ton of duplicate errors…
[IMG]http://s27.postimg.org/6b37olf6r/Duplicate_ACK.png[/IMG]
Eventually it finishes up, then it asks for /fog/service/ipxe/init.xz
And it looks to be pulling that file for a while… then there are a few retransmission errors, then it just seems to disappear.
It ALMOST looks like it’s being interrupted by other computers communicating with FOG. Because, one moment it’s communicating, and the next some other traffic comes in from 10.163.94.76 and then 10.162.3.26 and 10.163.16.48… and all goes silent from 10.162.30.58
-
I think we’ll need Tom to answer that one… I’m not sure what the relation is between all the *pxe files.
-
That’s the actually correct method to get the files.
Get the undionly.kpxe, undionly then passes to get the tftp://default.ipxe file. At least that is what is supposed to happen.
-
Oh by the way, I was using this filter because that was A BIG file…
[CODE]eth.dst == 00:25:84:01:ff:c0 || eth.src==00:25:84:01:ff:c0 && DHCP || TFTP[/CODE]
-
Well, here’s what I’m seeing…
You’re FOG server, and your FOG storage nodes are on the same segment. All the computers in all the buildings use this ONE fog server.
The problem computer works with that fog server in building B, but not in building A.
Additionally, it’s just these 60 machines…
Hmm…
Let’s do a test…
Grab a computer that works fine, one from another part of building A.
Take it to where those 60 machines are (the problem ones) and plug it in using one of their network ports.[U]See if it works.[/U] If it doesn’t, you’ve pinpointed the switch being the problem, or the trunk config for wherever it’s up-link goes to.
-
ALSO,
Take one of the problem computers,
And plug it into a network port that a working computer was using. See if it works there. If it does, again you’ve pinpointed the switch being the issue.
-
Also a thing to try is using the realtek.pxe file. There was an issue with rtl8169 cards with the eeprom that caused all kinds of weird issues.
-
[quote=“SeqSupport@Edkey, post: 46075, member: 27616”]Also a thing to try is using the realtek.pxe file. There was an issue with rtl8169 cards with the eeprom that caused all kinds of weird issues.[/quote]
But the computers work fine with the same FOG server, but different physical location… I’m not sure the realtek.pxe file would help, but I’m sure it wouldn’t hurt anything to try.
-
[quote=“Wayne Workman, post: 46077, member: 28155”]But the computers work fine with the same FOG server, but different physical location… I’m not sure the realtek.pxe file would help, but I’m sure it wouldn’t hurt anything to try.[/quote]
I just threw it out there because we have had issues with certain older managed/unmanaged switches not liking tftp and our new rtl8169 nics.
-
[QUOTE]Grab a computer that works fine, one from another part of building A.
Take it to where those 60 machines are (the problem ones) and plug it in using one of their network ports.[/QUOTE]I did try this earlier - that computer works fine on the same port as the problem computer. It’s so weird - like, everything points to it being a switch port issue, but then I take another machine to that same port and it’s okay.
[QUOTE]Also a thing to try is using the realtek.pxe file. There was an issue with rtl8169 cards with the eeprom that caused all kinds of weird issues.[/QUOTE]
I’ll give this a try hopefully soon. Both of those labs are full at the moment with classes.
-
Do the reverse when you can.
Take one of the problem computers, move it to a known-good port in another part of the building. One that’s [B]not[/B] connected to the same switch that the other 60 are on.
-
Test done - realtek.pxe and realtek.kpxe both do the same thing - link down, network starting, and then link up.
[QUOTE]Take one of the problem computers, move it to a known-good port in another part of the building. Preferably one that’s not connected to the same switch that the other 60 are on.[/QUOTE]
I did that earlier too, guess I forgot to mention that. Problem computer in Building A, moved to another room on another switch, same behavior.
-
Then the problem must be the top-level device in building A.
Because the Lenovo M72e Tiny works in Building “B”, with FOG server/nodes
But not in Building “A” with the [B]same[/B] FOG server/nodes.YET, other clients in building “A” work with the [B]same[/B] FOG server/nodes; even if they are connected to the Lenovo M72e Tiny’s network ports.
I’m very sure this is a network issue.
-
I’d like to believe that and it really does appear to be a network issue. But what I don’t understand is the issue started happening immediately after the update.
2 weeks ago I was on r2961 and imaged the lab without a problem. This week I updated to r3287 and this problem appears. Nothing else changed - no switch configurations, no dhcp server changes. The only variable that changed here is the revision of fog.
I think tomorrow I’ll fire up another virtual server and install r2961. I’d like to have that running and temporarily point dhcp option 66 to the new server just to (hopefully)prove or deny that the network is still functioning correctly. I’ll let you know what I find.
-
If it does work, we should slowly iterate up in the revisions until we find where it breaks.
This will help the developers with creating a fix.
-
What’s most intriguing to me is that it doesn’t work at one level but works fine elsewhere?
-
[QUOTE]What’s most intriguing to me is that it doesn’t work at one level but works fine elsewhere?[/QUOTE]
That’s very interesting to me too. And yet there are other machines connected to the same switch that are working, exact same port configurations.So the latest update: I did a fresh install of Ubuntu 12.04 and fog r2961. Much to my surprise, I’m seeing the same behavior - the network is starting before the link is up, and it’s not getting an IP. “Downloading kernels and inits…OK” Is it downloading the latest and greatest kernel/init, no matter which revision I’m on? I did go into settings and revert back to a kernel from February, but same results. What does the init.xz file do, and is there a way I can get a version of that from back in February?