Disappearing hosts from host list
-
@foglalt Ok, give me a day or two - I’ll put together a monitoring script that will record when the problem occurs - the script will also grab the last 100 lines from apache access logs and error logs - and say the last 50 entries from the history table. That information will have the clues that we can use to find what’s causing this.
-
Ok, just take your time. I am glad that you guys are so responsive in all ways of user interaction! I dont even get the clue about that moron on forum claiming that fogproject is dead… We should thank your work, not complain in such nonsense way.
-
@foglalt I put together a script:
https://github.com/wayneworkman/fog-community-scripts/blob/master/troubleshootingTools/monitor-missing-primary-mac.sh
For future readers, this script will be here in the troubleshooting tools:
https://github.com/FOGProject/fog-community-scriptsIt monitors the count of hosts that are missing a primary mac. If the count goes above 0, it gets those host IDs, the last 100 lines of the apache access log, the last 100 lines of the apache error log, and the last 50 entires from FOG’s history table and dumps them to the file
/root/troubleshooting.log
Once that log file is written to, the script will not write anything else.Get the script onto your fog server, and then you will need to setup this script to run as a cron task every minute on your FOG Server. There’s lots of tutorials if you google search
cron
orcrontab
but generally steps are as follows:sudo -i crontab -e # this might ask you if you want to use vim basic or vim tiny, it doesn't matter which. # Go into insert mode with: i # Paste in the below line: * * * * * /root/monitor-missing-primary-mac.sh # Or wherever you put the script. # Leave insert mode with the escape key esc # Save and close in vim with :wq
There are probably better tutorials out there on how to use Vim & create a cron job, but I put together this little Vim tutorial some years ago:
https://wiki.fogproject.org/wiki/index.php?title=ViIf you need further instructions/clarification on how to setup the script, just ask. When the file
/root/troubleshooting.log
appears on your server, just look through it yourself first - remove any sensitive information. Maybe this file will help you figure out the bug yourself even. If not, you can share the file here with us and we can probably get a very good idea on how to reproduce this issue. -
And again, impressive response time! Thanks again ok, i do preparation and wait fo the rabbit to jump out its nest. If i understand well, if a log appears it found something. Lets wait and hope. i will report back on next hit. (anyway you gave me good ideas on how to do some things for myself as i read through your script, nice work!)
-
This post is deleted! -
@foglalt Do we have the rabbit yet?
-
No, not yet. During weekend we do nothing, and we have a pulsating imaging. As we change pcs we create new imaging waves trust me, it will happen. (Why post is solved?)
-
@foglalt said in Disappearing hosts from host list:
Why post is solved?
Because we found a solution that fixed the problem earlier, and I marked that post as a solution. Now we are continuing in this same thread because it’s the same problem, trying to find the root cause to this issue. We can ask the @moderators to mark the thread as unsolved I suppose.
-
Oh, i see. I will inform you if file is appearing.
-
here it comes again now, i have a log file that your script created and i think this is not the usual situation. here is what happened now:
task was deploying and DURING that task the host data was updated with new mac address (i am aware that it is not best usage it was not me)
so i have the log, where to put it? mail maybe? it has many info about our setup, maybe not best to have it on a forum ,) let me know how to share with it with you.
-
@foglalt You can sanitize the log and post it here, or you could email it to me and I can look over it - PM me for my email address. To reset the ‘trap’, you just need to delete/move/rename the log that appeared. If you choose to sanitize and post here, don’t just wholesale delete stuff. For example, instead of completely removing an address or hostname, replace it with a sanitized one. Use search & replace for this - and ensure the integrity of the log is still valid.
-
@wayne-workman hehe, sanitize i will, well, i already started reading it but work came in the office in waves, so i postponed it to tomorrow. I will send or post it here from work. btw i kept your address from last db transaction we did together.
-
@foglalt said in Disappearing hosts from host list:
here is what happened now:
task was deploying and DURING that task the host data was updated with new mac address (i am aware that it is not best usage it was not me)No matter how it happened, we can learn from this and possibly improve FOG while doing so.
-
I realised that only ip could have been sensitive, so i altered it, rest is unmodified. I am curious if log helps this time (as i mentioned it has now an extra step over normal daily work (mac was changed during selected host was being deployed).
(as logs cannot be inserted here this long as i have, and uploaded files are only pictures, i send to your email.)
-
Got it, I will look over it tonight. At any rate, reset the trap.
-
@foglalt I looked over the log.
What I am seeing is that hostID 5 lost it’s primary MAC address on February 28th at 09:29:01 your time.
Knowing that you set this check up to run every minute, that means this problem happened between 9:28 AM and 9:29 AM.
Good thing the script gets the last 100 lines of the apache access log, these relevant lines were nearly at the top of the output:
10.10.36.124 - - [28/Feb/2018:09:28:10 +0100] "POST /fog/management/index.php?node=host&sub=edit&id=5&tab=host-general HTTP/1.1" 302 726 "http://SERVER/fog/management/index.php?node=host&sub=edit&id=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0" 10.10.36.124 - - [28/Feb/2018:09:28:10 +0100] "GET /fog/management/index.php?node=host&sub=edit&id=5 HTTP/1.1" 302 698 "http://SERVER/fog/management/index.php?node=host&sub=edit&id=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0" 10.10.36.124 - - [28/Feb/2018:09:28:10 +0100] "GET /fog/management/index.php?node=host HTTP/1.1" 200 2848 "http://SERVER/fog/management/index.php?node=host&sub=edit&id=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0"
One POST, two GET requests. I’m going to assume the POST request caused the primary mac to be disassociated from the host.
I do not see any relevant logs from the history table during this time, and the apache error log is basically totally silent.
So now we have something to go on. I’ll try to replicate the problem this weekend but I do have a very busy weekend so I might not get to it. I am very confident that the primary MAC of this host is being either deleted or disassociated during a host information update via the GUI - on a POST request.
@Foglalt if you can attempt to replicate this problem via editing & saving host information and figure out exactly what’s causing it, it would help. @developers any insight or help in isolating the issue further would be appreciated.
-
If I had to guess, this is due to how fog does macaddress updates. Manually entering a new mac in the primary mac field should cause the new mac to become primary and adjust the original mac address as an associated mac. I haven’t played too much with this action however as it can become extremely difficult to manage. I mean we’re having to check four very different things at the same time not including the primary check itself (mac, pending, ignore client, ignore image) in a future version, and possibly 1.5.1 this will be handled, I think, much better. My intent is to move all macaddresses to their own tab. Within that tab you have a single text entry to insert new macs, then you have a table that will present the primary, pending, image ignore, and client ignore. Getting the logic designed for 1.5.x is a bit more difficuly, but I do have this already coded for 1.6 (nowhere near ready for even alpha testing yet – though much closer than most may think.
-
@foglalt said in Disappearing hosts from host list:
task was deploying and DURING that task the host data was updated with new mac address (i am aware that it is not best usage it was not me)
This is probably the key here, right?!
-
@sebastian-roth It might be related. There are no entries in the history table at the time the host lost it’s primary MAC. There are entires in it for that host though, but just not during that 1 minute window.
I think the issue is a much larger scope work-flow bug rather than something you can see in a couple lines of code and point your finger at.
I’m still going to try to reproduce the problem.
-
So I just want to clarify here, he’s been running your script for one month. One person does one thing and the host magically disappears (this seems like a timing issue considering the host was in a tasking). This isn’t a workflow problem, in my eyes. You find it great, but 1 host in 1 month due to 1 person making 1 change at the wrong time is NOT something that can be accounted for in all cases. The DB is being written too, and at the time of the “change” this might mean the host became invalidated because of the shift of mac address at the same time the tasking is updating (which now becomes invalidated because the mac is being switched.)
I’m only working on what this seems like. Timing issues are never going to be fully fixable when working with stuff that writes to different areas of a database. The only way we could prevent this is prevent any edit’s to any image, group, and/or host, that has an item in a tasking. Again, though, this still has a temporary (albeit very small) timing issue in that a person editing a host at the exact same time a person is deploying a host task could cause exactly the same issue.
I’m trying to give you this from a developers standpoint. Just stating it’s a “much more serious workflow bug” doesn’t help a single thing because if it were a serious workflow bug,we 'd have many many many more reports of this.