FOGPingHosts Different Subnets (Location Plugin)
-
The FOGPingHosts service is unable to ping any hosts that are connected to a storage node (via location plugin) that is on a different subnet.
All the machines that are on the same subnet as the Master FOG server are pinged successfully and show green in the host list.On the storage node, “systemctl -l status FOGPingHosts”, returns “a valid database connection could not be made”. Is this an expected limitation?
Thanks!
-
@markbam Not expected but I can’t tell you for sure which version of FOG had this properly working. Storage node and location plugin setup is something we as developers don’t get to test very often and so we rely on people like you who report things. Thanks!
Please check the database connection information you find in
/var/www/html/fog/lib/fog/config.class.php
on you storage node. Are you able to connect using those credentials on the command line?mysql -h x.x.x.x -u root -p
? -
I am unable to connect to mysql as root. However, if I change the user to match the config.class.php user information “fogstorage”, it connects.
mysql -h x.x.x.x -u fogstorage -p xxxxxxxxxxxxxxxEverything else with my setup has been working correctly. It’s just the FOGPingHosts job that doesn’t seem to want to connect.
-
@markbam My fault. Somehow I had
root
in my head but actually meantfogstorage
. If you can connect on the command line using those credentials there might be actually a bug in FOGPingHosts. I will try to replicate the issue and see what I can find in the next days. -
Much appreciated. Thanks!
-
An update:
I brought down and up my entire FOG cluster and now the Storage Node’s FOGPingHosts service is running successfully. However, a different set of hosts show as green.
I restarted a few times and varied the order on which servers start up first. This seems to play a role in which hosts decided to show as green.
It seems like the hosts on the subnet that is first powered on will report to FOG. The rest will not.
-
@markbam Interesting findings. Do you know that FOGPingHost is not a live thing? The service polls all machines in a loop cycle but then sets itself to sleep for a few minutes before it checks the hosts again. Maybe this would make things look a bit different?!
-
Yup, I understand that it’s just a snapshot in time as it’s a service polled at a specific interval. But I wonder now if the results of the services are conflicting?
How I’m imagining the pseudo logic:
10:30am FOGPingHosts(FOGSERVER) Active
10:30am FOGSERVER(10.0.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to Green.
10:30am FOGPingHosts(FOGSERVER) Sleeps10:32am FOGPingHosts(FOGSTORAGE) Active
10:32am FOGSTORAGE(10.20.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to Red.
10:32am FOGPingHosts(FOGSTORAGE) SleepsWhile the services are now asleep, this is the time when I’m viewing the host list from the GUI and only seeing the results of the subnet that was last pinged.
The cycle repeats…10:40am FOGPingHosts(FOGSERVER) Active
10:40am FOGSERVER(10.0.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to green.
10:40am FOGPingHosts(FOGSERVER) Sleeps10:42am FOGPingHosts(FOGSTORAGE) Active
10:42am FOGSTORAGE(10.20.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to red.
10:42am FOGPingHosts(FOGSTORAGE) Sleeps -
@markbam Could you do me a favor? Install Wireshark on one WinMachine001(10.0.0.5) and watch the network packets. You can filter to only display the packets from/to your FOG server using this display filter:
ip.addr == x.x.x.x
(put in the FOG server IP… -
I’ve installed wireshark and I’m seeing FOGPingHosts fail at pinging the hosts on the FOGSERVER subnet. What’s odd is that I can successfully manual ping the hosts from the FOGSERVER.
For a sanity check, with wireshark, I can see successful pings on the FOGSTORAGE subnet with FOGPingHosts.
-
@markbam What tool do you use to manually ping the hosts? FOG does use TCP port 445, so not normal ICMP ping.
-
I’ve been using the standard ICMP ping command to test if the hosts are even visible to the servers that I’m using.
-
@markbam Can you use a differnt tool to check online state of the hosts that does use the same method as FOGPingHosts? Are you familiar with tools like
nmap
(Linux) or other TCP port scanners? -
Yup, I’m familiar with nmap.
I’m definitely seeing a lot of inconsistencies in the results I’m getting. Due to some unknown circumstance, FOG will log the pings from one server and ignore the returns from the others. I haven’t been able to reproduce exactly what triggers it to switch which server it decides to listen to but seems related to startup order.
This effort started as merely “nice to have”. I think I’d have to re-evaluate my network topology into something a bit less complicated in order to get any definitive answers though. I’ll probably revisit this sometime down the road.
Thanks for your time!
-
@markbam Wait a second before you head off. Are you sure the FOGPingHosts service is running properly on with nodes? Just asking because we have an issue on startup scripts open: https://github.com/FOGProject/fogproject/issues/268
Not sure if FOGPingHosts could show the same problem.
I haven’t been able to reproduce exactly what triggers it to switch which server it decides to listen to but seems related to startup order.
Can you give a bit more details on what you mean by that and what exactly you looked at to see this? Maybe some idea pops up in my head when I understand this better.
-
The FOGPingHost.service shows as active on all the servers.
With wiresharks on each subnet, I see that FOGPingHosts is trying to ping all of the hosts.
On each subnet, I can see the pings return successfully when it hits a host that is alive.From there is where the issue manifests: The FOG database only reflects the state of the hosts that are solely on one subnet. Usually the subnet of server that is turned on first (but I can’t 100% reproduce this).
So if a storage node is powered on first, usually(but not 100% of the time) it’s subnet’s hosts will show as active.
If the FOG Server is powered on first, usually(but not 100% of the time) it’s subnet’s hosts will show as active. -
@markbam This sounds like one node would overwrite all the states of another node. While I can’t say this is the case I’d really wonder if this is happening. Well, all nodes use one database but…
Would be really great if we can dig deeper to see if this is caused by FOG itself or something within your network.
Have you done MySQL logging yet? It’s fairly simple to setup but logs will fill up fairly quickly and it will be a real quest to extract the information from the logs I am afraid.
I’ve just had a quick look at the code. FOGPingHosts loops over all host objects, pings them one by one and updates the database as well one by one.
Ok, let’s give it a try:
- Choose a time where not many people are at work.
- Stop
FOGPingHosts
on all your nodes - Login to your MySQL/MariaDB instance as
root
. Then run:
SET global general_log_file='/tmp/mysql.log'; SET global log_output = 'file'; SET global general_log = on;
- Start
FOGPingHosts
an all nodes. - Watch the pinghosts.log on the nodes to see the services are doing work
Not sure how long is appropriate to wait till you stop the DB logging again. Probably best if you just keep an eye on the log file size (
ls -alh /tmp/mysql.log
). If it grows above 10 MB quickly (don’t think it will but depends on the activity in your network - clients PXE booting and fog-client checking in) you might switch it off again (mysql shell:SET global general_log = off;
)If you gzip the log file it should still be possible to dig through. Try this first:
grep "UPDATE.*hosts" /tmp/mysql.log > /tmp/mysql_hosts_update.log
If you need assistance digging through the log file you can send me a private message to get my email to send the log file to.
-
I finally had a chance to go through this.
From what I can gather, it does appear to be conflicting. One server will ping and set the flag to 0, then anther will ping and set the flag to 6 and then back and forth.
-
@markbam Great work! I guess I now understand the logic of this problem. Moving this topic to bug reports now. Can’t promise you when I will find the time to fix this. In case you or one of your co-workers have PHP skills we could work together on this.
-
@markbam Have not found the time to try and fix this but I have it on my list and will do so.