Disappearing hosts from host list
-
Got it, I will look over it tonight. At any rate, reset the trap.
-
@foglalt I looked over the log.
What I am seeing is that hostID 5 lost itās primary MAC address on February 28th at 09:29:01 your time.
Knowing that you set this check up to run every minute, that means this problem happened between 9:28 AM and 9:29 AM.
Good thing the script gets the last 100 lines of the apache access log, these relevant lines were nearly at the top of the output:
10.10.36.124 - - [28/Feb/2018:09:28:10 +0100] "POST /fog/management/index.php?node=host&sub=edit&id=5&tab=host-general HTTP/1.1" 302 726 "http://SERVER/fog/management/index.php?node=host&sub=edit&id=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0" 10.10.36.124 - - [28/Feb/2018:09:28:10 +0100] "GET /fog/management/index.php?node=host&sub=edit&id=5 HTTP/1.1" 302 698 "http://SERVER/fog/management/index.php?node=host&sub=edit&id=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0" 10.10.36.124 - - [28/Feb/2018:09:28:10 +0100] "GET /fog/management/index.php?node=host HTTP/1.1" 200 2848 "http://SERVER/fog/management/index.php?node=host&sub=edit&id=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0"
One POST, two GET requests. Iām going to assume the POST request caused the primary mac to be disassociated from the host.
I do not see any relevant logs from the history table during this time, and the apache error log is basically totally silent.
So now we have something to go on. Iāll try to replicate the problem this weekend but I do have a very busy weekend so I might not get to it. I am very confident that the primary MAC of this host is being either deleted or disassociated during a host information update via the GUI - on a POST request.
@Foglalt if you can attempt to replicate this problem via editing & saving host information and figure out exactly whatās causing it, it would help. @developers any insight or help in isolating the issue further would be appreciated.
-
If I had to guess, this is due to how fog does macaddress updates. Manually entering a new mac in the primary mac field should cause the new mac to become primary and adjust the original mac address as an associated mac. I havenāt played too much with this action however as it can become extremely difficult to manage. I mean weāre having to check four very different things at the same time not including the primary check itself (mac, pending, ignore client, ignore image) in a future version, and possibly 1.5.1 this will be handled, I think, much better. My intent is to move all macaddresses to their own tab. Within that tab you have a single text entry to insert new macs, then you have a table that will present the primary, pending, image ignore, and client ignore. Getting the logic designed for 1.5.x is a bit more difficuly, but I do have this already coded for 1.6 (nowhere near ready for even alpha testing yet ā though much closer than most may think.
-
@foglalt said in Disappearing hosts from host list:
task was deploying and DURING that task the host data was updated with new mac address (i am aware that it is not best usage it was not me)
This is probably the key here, right?!
-
@sebastian-roth It might be related. There are no entries in the history table at the time the host lost itās primary MAC. There are entires in it for that host though, but just not during that 1 minute window.
I think the issue is a much larger scope work-flow bug rather than something you can see in a couple lines of code and point your finger at.
Iām still going to try to reproduce the problem.
-
So I just want to clarify here, heās been running your script for one month. One person does one thing and the host magically disappears (this seems like a timing issue considering the host was in a tasking). This isnāt a workflow problem, in my eyes. You find it great, but 1 host in 1 month due to 1 person making 1 change at the wrong time is NOT something that can be accounted for in all cases. The DB is being written too, and at the time of the āchangeā this might mean the host became invalidated because of the shift of mac address at the same time the tasking is updating (which now becomes invalidated because the mac is being switched.)
Iām only working on what this seems like. Timing issues are never going to be fully fixable when working with stuff that writes to different areas of a database. The only way we could prevent this is prevent any editās to any image, group, and/or host, that has an item in a tasking. Again, though, this still has a temporary (albeit very small) timing issue in that a person editing a host at the exact same time a person is deploying a host task could cause exactly the same issue.
Iām trying to give you this from a developers standpoint. Just stating itās a āmuch more serious workflow bugā doesnāt help a single thing because if it were a serious workflow bug,we 'd have many many many more reports of this.
-
thank you guys for cooperating with killing this bug-or-what
@Wayne-Workman i will try to do frequent host updates randomly (actually i think during ānon-tasking timeā it is not replicatable, maybe is about load on server).
@Sebastian-Roth I dont think the key is the āduring image mac is changedā as far as I know about fog database updates are done upon things finish and starting point. I mean if i change mac during deployment, database is updated on change and upon task finish (btw my guess is that task finish doesnt do a thing with mac address, as host data is update upon end of a task, but why would it be caring with mac of the same host? I mean host is looked up for update with host id, not with mac lookup. but ofc it is a guess, you guys know more details
@Tom-Elliott I think same, it is surely not a full scale, everybody does it kind of bug as i see no other posts about it. I think it is related more on our methods of managing host deloying. I asked it before on forum that how ppl do this. this was the reason. to see what is different and how, to find the difference of our methods.
I have a guess of these:
- maybe load on server may cause this timing issu
- maybe data update of host details are done one by one and something can interfer with it making it ābroken a wayā.
The strange part of it is that we are not a huuuuge company with tons of ppl using fog. We have many hosts (like 2k) what we only clone with fog. We do not manage them with fog client, no printer management, etc. We involve fog only in the host deployment. And it is done only very few āon-siteā (computer labs) and all other are depliyed or even uploaded in 1 office actually, where our 2 guys use fog for daily deplying. Meaning very few concurrent database updates. We sometimes do more, and sometimes very few (actually we does a full company pc upgrade in waves of timed deployment; as we are not many, we have to do it in a long row of deplying, not 1-2 huge waves of full company). I told it cos only to see behind curtains. 1 month of running script is not a big thing. Some part of the year it means 100 deploying in 2 week, sometimes 100 during few months.
Anyway about this actual issue. If i take seriously what my colleague said (why would i guess they lie? this screws their work, not mine, so they are in need of solution, not disguise ), so if i take it seriously, this time the disappearance was āuniqueā as normally they did not do it before (āwe have never did it this way everā).
Btw, script is freed to catch next instance of it (i will make another tune maybe, to send a message to me if it happens at once, as it took 1-2 days before i was even informed that it happened). So, waiting for next issue
-
Oh, one more thing. We use fog version 1.4.4 at the moment. It was a fully clean install (os, fog, database, only the existing images very moved from storage to /images of new machine). Does it have a relevance? We actually installed a new server for far site usage (low and unstable connection prevent replicating, so only images are uploaded to that server if prepared). This remote site has 1.5.0 as it was finally out to be stable (gz and thanks from comminuty anyway, we are happy to see it).
Should we do upgrade before going on with debug? (actually i dont like the word upgrade at this point, as we have this issue since a long time now and even fully clean installation gave same, so maybe behind methods of things done, not in-fog thing only). The difference between 1.5.0 and 1.4.4 is huge visually, but can it be upgraded with donwloading the new tgz and running install between two version? wont it corrupt database if i do upgrade in between these two? (the far site of us is a sideproject for me to experiment with upgrades of fog, and test new versions more often )
-
@foglalt said in Disappearing hosts from host list:
as far as I know about fog database updates are done upon things finish and starting point. I mean if i change mac during deployment, database is updated on change and upon task finish
actually, itās updated every couple of seconds during the task with the progress of the task.
-
@junkhacker you mean there are updates even if nothing changed? As if refreshment?
-
@foglalt if an imaging task is taking place, the database is updated by the computer being imaged with the progress. you know the progress bar that says how complete the task is? thatās in the database.
-
I donāt understand how changing a MAC address while itās being deployed could cause this though. Iāve still not tried to replicate the issue yet though.
-
@junkhacker oh, there you got the score! Totally forgot this btw, it writes to a host data using host id or uses mac for lookups? If later, it may cause issue, am i right?
-
@Wayne-Workman the issue happened again preparing the log that the script generated and i send it to you in mail again for analysis.
-
@foglalt Ok, Iāll go over it this evening. Also, have you considered updating to 1.5.2 ? I think 1.3 is so old now that itās probably a waste of time to worry too much about it.
-
er, mine active is not 1.3, since it was first happening, it was about 1.2 or so. then as time passed, we tried in many ways, upgrades, etc. not too long ago mine was most uptodate one (1.4.4), then ofc as developing is going (happily see that btw), ofc it āgrew olderā. but this happened to me on every single uptodate versions so i think the true reason is far more deep than it is at first glance. i have hope in your script
-
@foglalt I think the issue has been present for a long while. As soon as we know enough about it and can reproduce it, I would expect that a fix can be created quickly. The problem is knowing whatās causing it and knowing how to reproduce it. We were grasping at straws, So I wrote this script to grasp at straws more efficiently lol.
-
@foglalt What is the MAC addressā of the host with hostID 43?
-
@developers I have found a pattern - not sure what it means but I wanted to share to get your eyes on it.
These next two code blocks are from the first log the script provided:
Date & time: 2018. febr. 28., szerda, 09:29:01 CET Found hostID '5' without a primary MAC.
From the history table, there are some events about one hour earlier with the same hostID.
[2018-02-28,08:28:10],Host,ID: 5 NAME: laci1 has been successfully updated. laci 2018-02-28 08:28:10 10.10.36.124 [2018-02-28,08:28:10],MACAddressAssociation,ID: 131 has been successfully updated. laci 2018-02-28 08:28:10 10.10.36.124 [2018-02-28,08:27:45],Host,ID: 5 NAME: laci1 has been successfully updated. laci 2018-02-28 08:27:45 10.10.36.124
These next two blocks are from the recent log the script provided.
Date & time: 2018. Ɣpr. 10., kedd, 09:52:01 CEST Found hostID '43' without a primary MAC.
Again, we have an event in the history table from about one hour earlier with the same hostID.
[2018-04-05,08:53:10],Host,ID: 43 NAME: laci6 has been successfully updated. laci 2018-04-05 08:53:10 146.110.36.124
Could it be that the timestamps from the history table are just wrong and these events actually happened at the same time? If so, Iām betting that changing the name of a host can somehow - sometimes - cause this primary MAC missing issue.
-
@wayne-workman The timestamps are accurate. The event youāre showing is from February 28th, vs April 5th. I donāt know if the timestamps are in UTC or correct to the timezone the user is in.