FOGPingHosts Different Subnets (Location Plugin)
-
@markbam Wait a second before you head off. Are you sure the FOGPingHosts service is running properly on with nodes? Just asking because we have an issue on startup scripts open: https://github.com/FOGProject/fogproject/issues/268
Not sure if FOGPingHosts could show the same problem.
I haven’t been able to reproduce exactly what triggers it to switch which server it decides to listen to but seems related to startup order.
Can you give a bit more details on what you mean by that and what exactly you looked at to see this? Maybe some idea pops up in my head when I understand this better.
-
The FOGPingHost.service shows as active on all the servers.
With wiresharks on each subnet, I see that FOGPingHosts is trying to ping all of the hosts.
On each subnet, I can see the pings return successfully when it hits a host that is alive.From there is where the issue manifests: The FOG database only reflects the state of the hosts that are solely on one subnet. Usually the subnet of server that is turned on first (but I can’t 100% reproduce this).
So if a storage node is powered on first, usually(but not 100% of the time) it’s subnet’s hosts will show as active.
If the FOG Server is powered on first, usually(but not 100% of the time) it’s subnet’s hosts will show as active. -
@markbam This sounds like one node would overwrite all the states of another node. While I can’t say this is the case I’d really wonder if this is happening. Well, all nodes use one database but…
Would be really great if we can dig deeper to see if this is caused by FOG itself or something within your network.
Have you done MySQL logging yet? It’s fairly simple to setup but logs will fill up fairly quickly and it will be a real quest to extract the information from the logs I am afraid.
I’ve just had a quick look at the code. FOGPingHosts loops over all host objects, pings them one by one and updates the database as well one by one.
Ok, let’s give it a try:
- Choose a time where not many people are at work.
- Stop
FOGPingHosts
on all your nodes - Login to your MySQL/MariaDB instance as
root
. Then run:
SET global general_log_file='/tmp/mysql.log'; SET global log_output = 'file'; SET global general_log = on;
- Start
FOGPingHosts
an all nodes. - Watch the pinghosts.log on the nodes to see the services are doing work
Not sure how long is appropriate to wait till you stop the DB logging again. Probably best if you just keep an eye on the log file size (
ls -alh /tmp/mysql.log
). If it grows above 10 MB quickly (don’t think it will but depends on the activity in your network - clients PXE booting and fog-client checking in) you might switch it off again (mysql shell:SET global general_log = off;
)If you gzip the log file it should still be possible to dig through. Try this first:
grep "UPDATE.*hosts" /tmp/mysql.log > /tmp/mysql_hosts_update.log
If you need assistance digging through the log file you can send me a private message to get my email to send the log file to.
-
I finally had a chance to go through this.
From what I can gather, it does appear to be conflicting. One server will ping and set the flag to 0, then anther will ping and set the flag to 6 and then back and forth.
-
@markbam Great work! I guess I now understand the logic of this problem. Moving this topic to bug reports now. Can’t promise you when I will find the time to fix this. In case you or one of your co-workers have PHP skills we could work together on this.
-
@markbam Have not found the time to try and fix this but I have it on my list and will do so.
-
Unfortunately neither I nor any of my coworkers know PHP. I mainly write in PERL and, while a bit similar to PHP, my code is by no means production quality but I can try to help however I can.
I’ve looked at the FOGPingHosts.service and a quickly thought of trying to create persistence based on time by adding a field into the SQL database that records that last successful ping time.
Touching the database seems a bit overreaching but maybe a field like this already exists. I haven’t had the chance to dump it yet.
Pseudo Code: FogPingHosts.service foreach(Host) { getCurrentTime=CurrentTime; timestampFromSQL= read_sql(timestampSQLField); timeSinceLastSuccessfulPing=getCurrentTime - timestampFromSQL; if(timeSinceLastSuccessfulPing > 180 Seconds) { if(PingHost == successful) { write_sql(timestampSQLField, getCurrentTime); } } }
-
@markbam I’m unsure about how this would resolve the problem as it currently exists.
As I understand it, the problem is that nodes overwrite the status constantly.
While a time check would reduce server stress somewhat, it won’t prevent a certain subnet from ‘getting there’ first and setting an incorrect value. (presuming subnets can’t ping to other subnets, you’d only see what one specific subnet sees, in other words)
So the bug is that storage nodes try and ping all hosts rather than only their own, if I’m getting it right.
-
@Quazz I don’t know that it’s a bug. Yes, i understand this is unexpected behaviour. But Remember the location plugin is a plugin. It’s not a default element of the core of fog. The Core elements attempt to ping all hosts each iteration. (So this is the expected behavior).
That’s not to say we shouldn’t have this in bug reports and we’ll need to add things to the location plugin to distribute the PING Hosts to enable location specific pinging.
This is not currently coded for or thought of so I imagine this will take a while to work out.
-
@markbam I started to implement that but need to push this further down the list. We need to focus on bugs of the core FOG code before that.
Note to myself: I have saved a diff of the current work in my workspace.