FOGPingHosts Different Subnets (Location Plugin)
-
An update:
I brought down and up my entire FOG cluster and now the Storage Node’s FOGPingHosts service is running successfully. However, a different set of hosts show as green.
I restarted a few times and varied the order on which servers start up first. This seems to play a role in which hosts decided to show as green.
It seems like the hosts on the subnet that is first powered on will report to FOG. The rest will not.
-
@markbam Interesting findings. Do you know that FOGPingHost is not a live thing? The service polls all machines in a loop cycle but then sets itself to sleep for a few minutes before it checks the hosts again. Maybe this would make things look a bit different?!
-
Yup, I understand that it’s just a snapshot in time as it’s a service polled at a specific interval. But I wonder now if the results of the services are conflicting?
How I’m imagining the pseudo logic:
10:30am FOGPingHosts(FOGSERVER) Active
10:30am FOGSERVER(10.0.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to Green.
10:30am FOGPingHosts(FOGSERVER) Sleeps10:32am FOGPingHosts(FOGSTORAGE) Active
10:32am FOGSTORAGE(10.20.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to Red.
10:32am FOGPingHosts(FOGSTORAGE) SleepsWhile the services are now asleep, this is the time when I’m viewing the host list from the GUI and only seeing the results of the subnet that was last pinged.
The cycle repeats…10:40am FOGPingHosts(FOGSERVER) Active
10:40am FOGSERVER(10.0.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to green.
10:40am FOGPingHosts(FOGSERVER) Sleeps10:42am FOGPingHosts(FOGSTORAGE) Active
10:42am FOGSTORAGE(10.20.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to red.
10:42am FOGPingHosts(FOGSTORAGE) Sleeps -
@markbam Could you do me a favor? Install Wireshark on one WinMachine001(10.0.0.5) and watch the network packets. You can filter to only display the packets from/to your FOG server using this display filter:
ip.addr == x.x.x.x
(put in the FOG server IP… -
I’ve installed wireshark and I’m seeing FOGPingHosts fail at pinging the hosts on the FOGSERVER subnet. What’s odd is that I can successfully manual ping the hosts from the FOGSERVER.
For a sanity check, with wireshark, I can see successful pings on the FOGSTORAGE subnet with FOGPingHosts.
-
@markbam What tool do you use to manually ping the hosts? FOG does use TCP port 445, so not normal ICMP ping.
-
I’ve been using the standard ICMP ping command to test if the hosts are even visible to the servers that I’m using.
-
@markbam Can you use a differnt tool to check online state of the hosts that does use the same method as FOGPingHosts? Are you familiar with tools like
nmap
(Linux) or other TCP port scanners? -
Yup, I’m familiar with nmap.
I’m definitely seeing a lot of inconsistencies in the results I’m getting. Due to some unknown circumstance, FOG will log the pings from one server and ignore the returns from the others. I haven’t been able to reproduce exactly what triggers it to switch which server it decides to listen to but seems related to startup order.
This effort started as merely “nice to have”. I think I’d have to re-evaluate my network topology into something a bit less complicated in order to get any definitive answers though. I’ll probably revisit this sometime down the road.
Thanks for your time!
-
@markbam Wait a second before you head off. Are you sure the FOGPingHosts service is running properly on with nodes? Just asking because we have an issue on startup scripts open: https://github.com/FOGProject/fogproject/issues/268
Not sure if FOGPingHosts could show the same problem.
I haven’t been able to reproduce exactly what triggers it to switch which server it decides to listen to but seems related to startup order.
Can you give a bit more details on what you mean by that and what exactly you looked at to see this? Maybe some idea pops up in my head when I understand this better.
-
The FOGPingHost.service shows as active on all the servers.
With wiresharks on each subnet, I see that FOGPingHosts is trying to ping all of the hosts.
On each subnet, I can see the pings return successfully when it hits a host that is alive.From there is where the issue manifests: The FOG database only reflects the state of the hosts that are solely on one subnet. Usually the subnet of server that is turned on first (but I can’t 100% reproduce this).
So if a storage node is powered on first, usually(but not 100% of the time) it’s subnet’s hosts will show as active.
If the FOG Server is powered on first, usually(but not 100% of the time) it’s subnet’s hosts will show as active. -
@markbam This sounds like one node would overwrite all the states of another node. While I can’t say this is the case I’d really wonder if this is happening. Well, all nodes use one database but…
Would be really great if we can dig deeper to see if this is caused by FOG itself or something within your network.
Have you done MySQL logging yet? It’s fairly simple to setup but logs will fill up fairly quickly and it will be a real quest to extract the information from the logs I am afraid.
I’ve just had a quick look at the code. FOGPingHosts loops over all host objects, pings them one by one and updates the database as well one by one.
Ok, let’s give it a try:
- Choose a time where not many people are at work.
- Stop
FOGPingHosts
on all your nodes - Login to your MySQL/MariaDB instance as
root
. Then run:
SET global general_log_file='/tmp/mysql.log'; SET global log_output = 'file'; SET global general_log = on;
- Start
FOGPingHosts
an all nodes. - Watch the pinghosts.log on the nodes to see the services are doing work
Not sure how long is appropriate to wait till you stop the DB logging again. Probably best if you just keep an eye on the log file size (
ls -alh /tmp/mysql.log
). If it grows above 10 MB quickly (don’t think it will but depends on the activity in your network - clients PXE booting and fog-client checking in) you might switch it off again (mysql shell:SET global general_log = off;
)If you gzip the log file it should still be possible to dig through. Try this first:
grep "UPDATE.*hosts" /tmp/mysql.log > /tmp/mysql_hosts_update.log
If you need assistance digging through the log file you can send me a private message to get my email to send the log file to.
-
I finally had a chance to go through this.
From what I can gather, it does appear to be conflicting. One server will ping and set the flag to 0, then anther will ping and set the flag to 6 and then back and forth.
-
@markbam Great work! I guess I now understand the logic of this problem. Moving this topic to bug reports now. Can’t promise you when I will find the time to fix this. In case you or one of your co-workers have PHP skills we could work together on this.
-
@markbam Have not found the time to try and fix this but I have it on my list and will do so.
-
Unfortunately neither I nor any of my coworkers know PHP. I mainly write in PERL and, while a bit similar to PHP, my code is by no means production quality but I can try to help however I can.
I’ve looked at the FOGPingHosts.service and a quickly thought of trying to create persistence based on time by adding a field into the SQL database that records that last successful ping time.
Touching the database seems a bit overreaching but maybe a field like this already exists. I haven’t had the chance to dump it yet.
Pseudo Code: FogPingHosts.service foreach(Host) { getCurrentTime=CurrentTime; timestampFromSQL= read_sql(timestampSQLField); timeSinceLastSuccessfulPing=getCurrentTime - timestampFromSQL; if(timeSinceLastSuccessfulPing > 180 Seconds) { if(PingHost == successful) { write_sql(timestampSQLField, getCurrentTime); } } }
-
@markbam I’m unsure about how this would resolve the problem as it currently exists.
As I understand it, the problem is that nodes overwrite the status constantly.
While a time check would reduce server stress somewhat, it won’t prevent a certain subnet from ‘getting there’ first and setting an incorrect value. (presuming subnets can’t ping to other subnets, you’d only see what one specific subnet sees, in other words)
So the bug is that storage nodes try and ping all hosts rather than only their own, if I’m getting it right.
-
@Quazz I don’t know that it’s a bug. Yes, i understand this is unexpected behaviour. But Remember the location plugin is a plugin. It’s not a default element of the core of fog. The Core elements attempt to ping all hosts each iteration. (So this is the expected behavior).
That’s not to say we shouldn’t have this in bug reports and we’ll need to add things to the location plugin to distribute the PING Hosts to enable location specific pinging.
This is not currently coded for or thought of so I imagine this will take a while to work out.
-
@markbam I started to implement that but need to push this further down the list. We need to focus on bugs of the core FOG code before that.
Note to myself: I have saved a diff of the current work in my workspace.