Disappearing hosts from host list
-
thank you guys for cooperating with killing this bug-or-what
@Wayne-Workman i will try to do frequent host updates randomly (actually i think during “non-tasking time” it is not replicatable, maybe is about load on server).
@Sebastian-Roth I dont think the key is the “during image mac is changed” as far as I know about fog database updates are done upon things finish and starting point. I mean if i change mac during deployment, database is updated on change and upon task finish (btw my guess is that task finish doesnt do a thing with mac address, as host data is update upon end of a task, but why would it be caring with mac of the same host? I mean host is looked up for update with host id, not with mac lookup. but ofc it is a guess, you guys know more details
@Tom-Elliott I think same, it is surely not a full scale, everybody does it kind of bug as i see no other posts about it. I think it is related more on our methods of managing host deloying. I asked it before on forum that how ppl do this. this was the reason. to see what is different and how, to find the difference of our methods.
I have a guess of these:
- maybe load on server may cause this timing issu
- maybe data update of host details are done one by one and something can interfer with it making it “broken a way”.
The strange part of it is that we are not a huuuuge company with tons of ppl using fog. We have many hosts (like 2k) what we only clone with fog. We do not manage them with fog client, no printer management, etc. We involve fog only in the host deployment. And it is done only very few “on-site” (computer labs) and all other are depliyed or even uploaded in 1 office actually, where our 2 guys use fog for daily deplying. Meaning very few concurrent database updates. We sometimes do more, and sometimes very few (actually we does a full company pc upgrade in waves of timed deployment; as we are not many, we have to do it in a long row of deplying, not 1-2 huge waves of full company). I told it cos only to see behind curtains. 1 month of running script is not a big thing. Some part of the year it means 100 deploying in 2 week, sometimes 100 during few months.
Anyway about this actual issue. If i take seriously what my colleague said (why would i guess they lie? this screws their work, not mine, so they are in need of solution, not disguise ), so if i take it seriously, this time the disappearance was “unique” as normally they did not do it before (“we have never did it this way ever”).
Btw, script is freed to catch next instance of it (i will make another tune maybe, to send a message to me if it happens at once, as it took 1-2 days before i was even informed that it happened). So, waiting for next issue
-
Oh, one more thing. We use fog version 1.4.4 at the moment. It was a fully clean install (os, fog, database, only the existing images very moved from storage to /images of new machine). Does it have a relevance? We actually installed a new server for far site usage (low and unstable connection prevent replicating, so only images are uploaded to that server if prepared). This remote site has 1.5.0 as it was finally out to be stable (gz and thanks from comminuty anyway, we are happy to see it).
Should we do upgrade before going on with debug? (actually i dont like the word upgrade at this point, as we have this issue since a long time now and even fully clean installation gave same, so maybe behind methods of things done, not in-fog thing only). The difference between 1.5.0 and 1.4.4 is huge visually, but can it be upgraded with donwloading the new tgz and running install between two version? wont it corrupt database if i do upgrade in between these two? (the far site of us is a sideproject for me to experiment with upgrades of fog, and test new versions more often )
-
@foglalt said in Disappearing hosts from host list:
as far as I know about fog database updates are done upon things finish and starting point. I mean if i change mac during deployment, database is updated on change and upon task finish
actually, it’s updated every couple of seconds during the task with the progress of the task.
-
@junkhacker you mean there are updates even if nothing changed? As if refreshment?
-
@foglalt if an imaging task is taking place, the database is updated by the computer being imaged with the progress. you know the progress bar that says how complete the task is? that’s in the database.
-
I don’t understand how changing a MAC address while it’s being deployed could cause this though. I’ve still not tried to replicate the issue yet though.
-
@junkhacker oh, there you got the score! Totally forgot this btw, it writes to a host data using host id or uses mac for lookups? If later, it may cause issue, am i right?
-
@Wayne-Workman the issue happened again preparing the log that the script generated and i send it to you in mail again for analysis.
-
@foglalt Ok, I’ll go over it this evening. Also, have you considered updating to 1.5.2 ? I think 1.3 is so old now that it’s probably a waste of time to worry too much about it.
-
er, mine active is not 1.3, since it was first happening, it was about 1.2 or so. then as time passed, we tried in many ways, upgrades, etc. not too long ago mine was most uptodate one (1.4.4), then ofc as developing is going (happily see that btw), ofc it “grew older”. but this happened to me on every single uptodate versions so i think the true reason is far more deep than it is at first glance. i have hope in your script
-
@foglalt I think the issue has been present for a long while. As soon as we know enough about it and can reproduce it, I would expect that a fix can be created quickly. The problem is knowing what’s causing it and knowing how to reproduce it. We were grasping at straws, So I wrote this script to grasp at straws more efficiently lol.
-
@foglalt What is the MAC address’ of the host with hostID 43?
-
@developers I have found a pattern - not sure what it means but I wanted to share to get your eyes on it.
These next two code blocks are from the first log the script provided:
Date & time: 2018. febr. 28., szerda, 09:29:01 CET Found hostID '5' without a primary MAC.
From the history table, there are some events about one hour earlier with the same hostID.
[2018-02-28,08:28:10],Host,ID: 5 NAME: laci1 has been successfully updated. laci 2018-02-28 08:28:10 10.10.36.124 [2018-02-28,08:28:10],MACAddressAssociation,ID: 131 has been successfully updated. laci 2018-02-28 08:28:10 10.10.36.124 [2018-02-28,08:27:45],Host,ID: 5 NAME: laci1 has been successfully updated. laci 2018-02-28 08:27:45 10.10.36.124
These next two blocks are from the recent log the script provided.
Date & time: 2018. ápr. 10., kedd, 09:52:01 CEST Found hostID '43' without a primary MAC.
Again, we have an event in the history table from about one hour earlier with the same hostID.
[2018-04-05,08:53:10],Host,ID: 43 NAME: laci6 has been successfully updated. laci 2018-04-05 08:53:10 146.110.36.124
Could it be that the timestamps from the history table are just wrong and these events actually happened at the same time? If so, I’m betting that changing the name of a host can somehow - sometimes - cause this primary MAC missing issue.
-
@wayne-workman The timestamps are accurate. The event you’re showing is from February 28th, vs April 5th. I don’t know if the timestamps are in UTC or correct to the timezone the user is in.
-
@tom-elliott The timestamps are from whatever is set on his server. You see the 1-hour pattern though in both occasions? Makes no sense to me. Only thing I can think of is a problem in an hourly ran service. This is the script he’s running every minute via cron: https://github.com/FOGProject/fog-community-scripts/blob/master/troubleshootingTools/monitor-missing-primary-mac.sh
-
Tomorrow or later tonight i send you another. Yes, again it happened. Maybe it van be same pattern at least. Btw host is not renamed. But mac is changed. All time!
-
@foglalt When you say the MAC is changed, are you changing it via the GUI or is this something that’s happening that is not supposed to happen? Please elaborate.
-
@wayne-workman It is simply the following:
- pc1 comes in for reinstallation/installation, its mac is registered in a “dummylikehost” (for example “laci1”).
- image selected, task set to deploy, pc1 finishes, turned off
- pc2 comes in for same purpose, host “laci1” got a mac overwrite (mac gui field selected, typed in the new mac, update button).
the missing host is detected by your script, or previously it was detected when pc3 comes in for processing.
this is why i asked before how others do massive cloning. colleagues do this method cos with it you dont need to remake cloning groups (no image update? so you dont need to change image name). it is like another 5 pc came, you put them on the table, put cables in, register new macs and launch process of cloning. we normally never keep hosts in database, as you may saw in our database before. we only have a few in them. few dedicated ones (like image creators machine and some other).
-
One more thing about yesterdays instance of the problem: during the host update the new mac was registered on the host (this time laci2, to name it as in log you may see) and when the task start was prepared (those hosts are in a group for mass deploy) colleague ralised that group has less member than needed. voila, missing host. so it was disappeared after host update (or during)
-
@foglalt So is it accurate to say “When an existing host’s MAC address is changed to something else, sometimes the primary MAC address is lost.” ?
Also, you know you can image without registering, right?