SVN 4380 Cloud 5419 (on Ubuntu 14.04.3) Fog not consistently tftp booting from location
-
Not 100% sure if this is a “bug report” or not, and not sure how to even write this one out as I like to provide as much info as possible but…
I’m having the darndest time getting a fog client to TFTP boot (and more importantly, pull down the image itself locally) consistently from a storage node in a remote subnet. Sometimes boot.php on the master install will direct clients to image from the local storage node which is 100% the desired behavior, and sometimes boot.php will direct the client to pull down bzimage etc from the master server.
On each storage node, I am replacing the storage node IP in the chain line of /tftpboot/default.ipxe with the IP of the master install. This is per some documentation I read once upon a time for an older version of fog, and it’s worked fine up until recently.
Poking around the database in the location table, I’m noting that the lTftpEnabled column entries all contain o (as in lowercase orange, not zero) for locations that appear as TFTP Boot Enabled in the GUI, rather than 1 which is what I would expect?
In fact, changing this “o” to any value at all appears to retain the TFTP Boot Enabled setting in the GUI. Changing this value to 0 does change TFTP Boot Enabled in the GUI to “unchecked” as one would expect. Unchecking it in the GUI and saving the change actually blanks out this field, rather than changing it to zero.
I don’t know if the databse oddity noted is relevant at all, but the expected behavior from my end is if the location is set for a host, then it needs to pull boot and image files strictly from that local tftp-enabled storage node.
Sorry if this is rambling. I’m having a difficult time nailing down problem duplication steps because it doesn’t seem to be acting consistently either way.
-
Are you using the location plugin? Tftp enabled is only for locations and it is not truly tftp directing. All that option does is tell the host to get its bzImage and init.xz from its specified location. What plugins do you have and what are their relevant entries?
-
@Tom-Elliott Yes, the location plugin is installed. bzImage and init.xz being pulled from the specified location is exactly something that I desire, so I’m not off my rocker (yet) so far, even if my initial understanding of what Tftp enabled was a bit off!
Each storage nodes referenced by a given location are local to all clients in that location, unfortunately sometimes the client does boot and pulls down bzimage and init (and then the image) from the master server which is not local to that client.
Location plugin is the only one installed, and it lists its location as “…/lib/plugins/location/”
-
@Malos said:
… sometimes … sometimes … sometimes …
Are you able to reproduce under which circumstances clients boot from the right/wrong server? To me this sounds like there are several DHCP servers offering information to the clients. Sometimes they get the “correct” info first but sometimes not.
Are you willing and able to hook a hub in front of one of your clients and capture the traffic using wireshark/tcpdump? I’d be really interested to see the packet dump. Hopefully we can figure things out this way. -
Got something concrete that I picked up on finally!
When a host with a pending task boots, it pulls down bzimage and init (and then the image) from the master server, and then if I shut off the host halfway through the task and reboot it, it boots and pulls bzimage and init/image from the correct storage node location, rinse repeat and pulls bzimage and init/image from master, rinse repeat from node etc.
This flipping action happens very consistently once the task has started.
OK! Now, if I power off the host before the bzimage and init pulldown finishes (so, before the screen flashes and clears over to the imaging process itself), booting the host again will pull everything down from the same server just as before. So it’s almost like something gets toggled in the database side of things, perhaps in the task itself right as the image kicks off that might be causing this?
-
@Sebastian-Roth
It looks like whatever is causing the flip is changing the taskNFSMemberID column in the task (in tasks table) to 1 (the master server in my case) or to 3 (the correct location storage node)I would be willing to capture a dump somehow if you feel it would be helpful, but I’m very certain that there are not multiple DHCP servers, as there’s no other noted issues in my environment.
-
@Malos said:
I would be willing to capture a dump somehow if you feel it would be helpful, but I’m very certain that there are not multiple DHCP servers, as there’s no other noted issues in my environment.
https://wiki.fogproject.org/wiki/index.php/Troubleshoot_TFTP
There are steps in there for doing a capture on the fog server.But, since we’re looking at DHCP specifically - you can simply do a capture with Wireshark using any computer that is connected to the same network that you’ll be booting the trouble-host on. The capturing computer will hear all of the broadcast messages on the network and that’s what Sebastian was wanting to look at.
-
@Wayne-Workman Thanks for explaining and pointing this out. We’d actually see all the broadcasts and don’t really need a hub. You are right. But @Malos’s findings sound pretty reasonable (reproduceable) and I think we better have a look down that alley before picking up the big gun.
-
I’ve poked through the code a little and to me it seams like things might go wrong here: lib/reg-task/TaskQueue.class.php
But I don’t know enough about the PHP code and @Tom-Elliott needs to have a look I suppose. -
@Sebastian-Roth I found and fixed the issue last night. Thanks for pointing out but for this particular problem it was related to the change items hook of the location plugin and the location association class. The problem was I was trying to get the storage node from the association which doesn’t maintain the node or group information. The other half of it was the storage group was getting the list of all enabled nodes, not all enabled nodes that are within its group. This should be fully fixed now.
-
@Tom-Elliott Confirmed, tasks are pulling down the boot files and image data from the correct node consistently, and updating nodes pulls from the newly set node as well. Awesome work, thanks!