No configuration methods succeeded.

JLE

Server

FOG Version:1.5.0-RC-2
OS: CentOS 7

Client

Service Version: 11.12
OS: Windows 10

Description

Random computers give this error when you turn them on.
This is not an extremely pressing issue as a few reboot cycles always seems to fix it, but it would be nice to know what is causing it.

Here’s some info about the setup:
1.) All Dell Optiplex 990s
2.) All latest BIOS versions.
3.) All in the same LAN (vlan 300)
4.) All set to legacy boot with NIC at the top, then HDD
5.) Spanning-tree is set to rapid mode on all switches. (c2960x)

There seems to be no pattern to the error - the same computer can successfully boot and then the very next time give this error. Then you can restart that one and it will succeed the next time. This leads me to believe it might be something network related and not a FOG setting? Any ideas on what I could check?

JLE

I ran this command after it failed and am now trying to track down why there is such a high RXE value.

george1421

Can you place a dumb (unmanaged) switch between the pxe booting computer and your building switch? This (on the surface) appears to be a spanning tree issue, where you don’t have one of the fast STP protocols enabled. The dumb switch will help us test this idea.

[edit] well now that I look at it, it may NOT be a spanning tree issue. I would still do that just so we can rule out a networking issue. But the randomness still makes me think spanning tree.

JLE

@george1421 Just put a little 5 port D-Link in place and rebooted the computer a few times. It has not failed yet. The computers in this room are connected to the second member of a 2960x cisco stack. The stack master has spanning-tree mode set to rapid. I guess I will try to make sure that each member also has the spanning-tree mode set that way? I thought that the master would set it for all.

george1421

@jle I can’t speak for the cisco switches as for the configuration. But it sounds like you did find the root of the issue.

The thing between default spanning tree and the fast protocols is its attempt to find the loop back. You have pessimistic vs optimistic checking. Withe default spanning tree it will listen for up to 27 seconds for a bpdu packet then starts forwarding data with the fast spanning tree protocol it starts forwarding data right away then check for the bpdu packet. FOG / FOS boots so fast, by the time standard spanning tree starts to forward data, FOS has already given up on the booting process.

JLE

@george1421 Well, I think I have it figured out. Thanks for the kick in the right direction. The problem was apparently that the primary stp root for vlan 300 was a completely different switch in a different building. I set the main switch here in this building (where vlan 300 is anyways) to be the primary root and so far every one I’ve tested works. I am about to image a few labs so that will be the real test.

george1421

@jle Sweet, let us know how it goes.

You wouldn’t have thought it was the root bridge causing the issue. But if the root bridge is on the other side of a slow link, it could cause that type of response. We typically set the core switch with a priority of 4095 to ensure only the core switch is ever the stp root bridge.

JLE

@george1421 Well we still have some kind of problem.

I am working with one machine now to try and solve it. It keeps getting the “no configuration methods succeeded message”. Placing the D-Link switch in between it and the cisco switch fixes the problem.

I traced the physical port back to the switch to be 100% and then logged in and check the settings. It is part of vlan 300 with the stp mode set to rpsvt. It has portfast enabled.

It definitely seems like the switch is taking too long. Portfast is supposed to immediately put the port into a state of forwarding according to everything I have read but this PC flies right through the menu with a D-Link switch in place…

To complicate matters, when the D-Link is not in place there isn’t a 100% failure rate - it succeeds some times.

Does anyone have any ideas on what I could check next? For the time being I am reading up on c2960x stacks and STP, but I am considering just turning it off completely on something. Maybe make a new VLAN, turn STP off for it, set the switches to it in order to deploy my images and then revert back to normal afterwards.

george1421

@jle I don’t know the cisco switches, but typically you turn on rstp globally and then each port you can define as rstp or stp, or turn of stp on that port alone.

I don’t think anyone on the forums is a cisco network engineer to help. You might get a better quality answer if you post on the Spiceworks Community.

But since the d-link is keeping the cisco from seeing the network link, wink as the target computer pxe boots, I’m almost 100% positive its a spanning tree issue with the port.

Sebastian Roth

@JLE Keep your eyes open for something called “port fast”. This tells the switch to not enable spanning tree for this one particular port. You do not want this setting for all ports but just for the ones where you are absolutely sure that you have only clients/servers connected and never ever another switch that could cause you a loop. RSTP (rapid spanning tree) should work in most cases as well.

JLE

I checked on the switch’s config files. Portfast is enabled globally as is the spanning-tree mode (rapid-pvstp). I have found something else out that seems a little odd to me. On the machines that are giving this error, when it gets to the “hit s to enter the shell” part I can do that, give iPXE a static IP, and then ping the DHCP server just fine.

Also, while turning on one of these machines and watching in the DHCP leases on the server I can indeed see that it does create the lease - but the odd thing is that by the time it gets to “no configuration methods succeeded” the lease is gone from the server.

@george1421 I am going to put together a detailed diagram complete with configs and network info and maybe something will jump out. I’ll probably toss that up on Spiceworks too.

Sebastian Roth

@JLE Sounds interesting! Try again and when you get to the iPXE shell just run the following commands:

ifstat
...
dhcp
...

Maybe it just needs a little bit of delay.

Also, while turning on one of these machines and watching in the DHCP leases on the server I can indeed see that it does create the lease - but the odd thing is that by the time it gets to “no configuration methods succeeded” the lease is gone from the server.

This is because the client does DHCP in several rounds. The first one is from the PXE ROM burned into the NIC. Second time would be from iPXE after downloading it via TFTP. And this still fails.

It would be best to get a packet dump of this. More often than not we see something in that dump that does the trick. See George’s instructions: https://forums.fogproject.org/topic/9673/when-dhcp-pxe-booting-process-goes-bad-and-you-have-no-clue

george1421

@sebastian-roth That reminds me, there are the iPXE kernels with a 10 second delay. I don’t know what conditions these were created for. Do you remember? I just checked and they are still in FOG 1.4.4 in the /tftpboot/10secdelay directory.

I think its important to find out why this condition exists in the networking infrastructure. Its not normal or expected, but if all options fail, these iPXE kernels may be the solution.

JLE

Sweet deal. I updated to the latest trunk build, set the boot file on the dhcp server to the 10 second ipxe.kpxe file and now everything boots.

Is there an easy way for me to adjust that delay? Say…make it 5 seconds?

Something else I noticed:

The computers that were continuously failing with “no configuration method succeeded” were filling up the DHCP lease with bad address entries…

I tried pinging those IPs from everything I could just to make sure they’re not static on something (nothing should be outside of the management vlan here).

Another weird thing is that now that I am using the 10s delay to boot they do not fill up the dhcp scope anymore with bad addresses. They even get the old IP that was otherwise “bad” in the previous case. ?.?

george1421

@jle Please understand you didn’t fix the underlying networking issues you only masked them by artificially slowing down the pxe booting process. The underlying condition is still there (and has probably always been there). Will this work for you, probably.

So you have to decide, is it working good enough for your needs.

JLE

@george1421 Yeah, I know. I will still make that diagram outlining the problem as best I can and post it along with some wireshark data. I tried digging through some capture data following the dhcp discovery/offer/request/acknowledge and they were all there for client and all of the numbers looked good.

JLE

Wireshark data posted privately. Some info about the data:
The problematic client is getting a dhcp address successfully of 10.241.96.20. (It says it for PXE and I can see it pop up on the DHCP server.)
I let the capture run from a computer right beside (also on the same vlan and switch) . The capture ran for two loops of “no configuration methods succeeded”. There are a lot of TCP retransmissions coming from the fog server.

To simplify matters early on I unplugged 3 of the members of the fog’s NIC team so it only has one right now and I am pretty sure the capture caught the tail end of a computer lab imaging.

I see the “malformed” dhcp packets, but I have no idea what is causing them… aside from laughing at their name I am reading up on the topic.

Sebastian Roth

@JLE Looks quite interesting that packet dump. Something I have not come across in a long time. I am trying to write down what I see in the PCAP to hopefully make any sense as I don’t see what’s wrong yet. Maybe George can add to that as well.

First the PXE ROM of the NIC sends a DHCP Discover and does not get an answer. So 8 seconds later it sends another Discover (same information but just a new DHCP transaction ID). The second DHCP Discover is answered with a DHCP Offer very promptly (delay only ms). Transaction IDs of the second Discover and the following Offer match so the answer is definitely not a delayed one to the first Discover. Question: Why is the first DHCP Discover not answered? (this is happening again later on)
As far as I can see the DHCP Offer looks good (next-server and filename set properly).
Now the client is quiet for 16 seconds before sending a DHCP Request packet to complete the DHCP communication. This Request packet is promptly answered by a proper looking DHCP ACK. So client is finally happy I suppose.
Then I suspect the TFTP transfer to happen which was not captured. See the next bullet point.
Another 12 seconds after the first DHCP DORA (Discover, Offer, Request, ACK) finished I see a new DHCP Discover from that client. This time option 175 is set which is a clear sign for iPXE sending the packet. And the same thing is happening again. No answer from the DHCP server for 8 seconds and the client (now iPXE instead of the PXE ROM) sends another DHCP Discover which is promptly answered with a fine DHCP Offer.
After the Offer the client sends a third DHCP Discover and then a DHCP Request just a second later. I think this is where things start to go really wrong. I suppose iPXE is very confused about the DHCP server only answering the second DHCP Discover (matching transaction ID). I haven’t checked the iPXE code yet but I guess this is something unexpected now causing an issue in your case.
Following are a row of DHCP Request packets from the client which are all answered by DHCP NAK (non ACK!) packets. So the DHCP server declines to give that offered IP information to the client. Result is the “No configuration methods succeeded” message in iPXE.
In the packet dump I see the same thing happening again a minute later. But one thing is different this time. The very first DHCP Discover sent by the client’s PXE ROM is answered within one second this time. But for the DHCP Discover sent by iPXE I see again the exact same behavior as described above.

I guess this can be fixed in iPXE but I doubt this is the right place to do so. There is something wrong within your network. Do those first DHCP Discover packet get lost somewhere along the way? Why is the second one answered so promptly then?

Ok, I’ll leave this ti you for now. We all need to think about it and I am sure someone will come up with a proper explanation on what’s going wrong here.

No configuration methods succeeded.

Server

Client

Description

72

12.7k

17.6k

156.8k