SVN 4380 Cloud 5419 (on Ubuntu 14.04.3) Fog not consistently tftp booting from location
-
Are you using the location plugin? Tftp enabled is only for locations and it is not truly tftp directing. All that option does is tell the host to get its bzImage and init.xz from its specified location. What plugins do you have and what are their relevant entries?
-
@Tom-Elliott Yes, the location plugin is installed. bzImage and init.xz being pulled from the specified location is exactly something that I desire, so I’m not off my rocker (yet) so far, even if my initial understanding of what Tftp enabled was a bit off!
Each storage nodes referenced by a given location are local to all clients in that location, unfortunately sometimes the client does boot and pulls down bzimage and init (and then the image) from the master server which is not local to that client.
Location plugin is the only one installed, and it lists its location as “…/lib/plugins/location/”
-
@Malos said:
… sometimes … sometimes … sometimes …
Are you able to reproduce under which circumstances clients boot from the right/wrong server? To me this sounds like there are several DHCP servers offering information to the clients. Sometimes they get the “correct” info first but sometimes not.
Are you willing and able to hook a hub in front of one of your clients and capture the traffic using wireshark/tcpdump? I’d be really interested to see the packet dump. Hopefully we can figure things out this way. -
Got something concrete that I picked up on finally!
When a host with a pending task boots, it pulls down bzimage and init (and then the image) from the master server, and then if I shut off the host halfway through the task and reboot it, it boots and pulls bzimage and init/image from the correct storage node location, rinse repeat and pulls bzimage and init/image from master, rinse repeat from node etc.
This flipping action happens very consistently once the task has started.
OK! Now, if I power off the host before the bzimage and init pulldown finishes (so, before the screen flashes and clears over to the imaging process itself), booting the host again will pull everything down from the same server just as before. So it’s almost like something gets toggled in the database side of things, perhaps in the task itself right as the image kicks off that might be causing this?
-
@Sebastian-Roth
It looks like whatever is causing the flip is changing the taskNFSMemberID column in the task (in tasks table) to 1 (the master server in my case) or to 3 (the correct location storage node)I would be willing to capture a dump somehow if you feel it would be helpful, but I’m very certain that there are not multiple DHCP servers, as there’s no other noted issues in my environment.
-
@Malos said:
I would be willing to capture a dump somehow if you feel it would be helpful, but I’m very certain that there are not multiple DHCP servers, as there’s no other noted issues in my environment.
https://wiki.fogproject.org/wiki/index.php/Troubleshoot_TFTP
There are steps in there for doing a capture on the fog server.But, since we’re looking at DHCP specifically - you can simply do a capture with Wireshark using any computer that is connected to the same network that you’ll be booting the trouble-host on. The capturing computer will hear all of the broadcast messages on the network and that’s what Sebastian was wanting to look at.
-
@Wayne-Workman Thanks for explaining and pointing this out. We’d actually see all the broadcasts and don’t really need a hub. You are right. But @Malos’s findings sound pretty reasonable (reproduceable) and I think we better have a look down that alley before picking up the big gun.
-
I’ve poked through the code a little and to me it seams like things might go wrong here: lib/reg-task/TaskQueue.class.php
But I don’t know enough about the PHP code and @Tom-Elliott needs to have a look I suppose. -
@Sebastian-Roth I found and fixed the issue last night. Thanks for pointing out but for this particular problem it was related to the change items hook of the location plugin and the location association class. The problem was I was trying to get the storage node from the association which doesn’t maintain the node or group information. The other half of it was the storage group was getting the list of all enabled nodes, not all enabled nodes that are within its group. This should be fully fixed now.
-
@Tom-Elliott Confirmed, tasks are pulling down the boot files and image data from the correct node consistently, and updating nodes pulls from the newly set node as well. Awesome work, thanks!