Disappearing hosts from host list

Wayne Workman

@sebastian-roth It might be related. There are no entries in the history table at the time the host lost it’s primary MAC. There are entires in it for that host though, but just not during that 1 minute window.

I think the issue is a much larger scope work-flow bug rather than something you can see in a couple lines of code and point your finger at.

I’m still going to try to reproduce the problem.

Tom Elliott

So I just want to clarify here, he’s been running your script for one month. One person does one thing and the host magically disappears (this seems like a timing issue considering the host was in a tasking). This isn’t a workflow problem, in my eyes. You find it great, but 1 host in 1 month due to 1 person making 1 change at the wrong time is NOT something that can be accounted for in all cases. The DB is being written too, and at the time of the “change” this might mean the host became invalidated because of the shift of mac address at the same time the tasking is updating (which now becomes invalidated because the mac is being switched.)

I’m only working on what this seems like. Timing issues are never going to be fully fixable when working with stuff that writes to different areas of a database. The only way we could prevent this is prevent any edit’s to any image, group, and/or host, that has an item in a tasking. Again, though, this still has a temporary (albeit very small) timing issue in that a person editing a host at the exact same time a person is deploying a host task could cause exactly the same issue.

I’m trying to give you this from a developers standpoint. Just stating it’s a “much more serious workflow bug” doesn’t help a single thing because if it were a serious workflow bug,we 'd have many many many more reports of this.

Foglalt

thank you guys for cooperating with killing this bug-or-what

@Wayne-Workman i will try to do frequent host updates randomly (actually i think during “non-tasking time” it is not replicatable, maybe is about load on server).

@Sebastian-Roth I dont think the key is the “during image mac is changed” as far as I know about fog database updates are done upon things finish and starting point. I mean if i change mac during deployment, database is updated on change and upon task finish (btw my guess is that task finish doesnt do a thing with mac address, as host data is update upon end of a task, but why would it be caring with mac of the same host? I mean host is looked up for update with host id, not with mac lookup. but ofc it is a guess, you guys know more details

@Tom-Elliott I think same, it is surely not a full scale, everybody does it kind of bug as i see no other posts about it. I think it is related more on our methods of managing host deloying. I asked it before on forum that how ppl do this. this was the reason. to see what is different and how, to find the difference of our methods.

I have a guess of these:

maybe load on server may cause this timing issu
maybe data update of host details are done one by one and something can interfer with it making it “broken a way”.

The strange part of it is that we are not a huuuuge company with tons of ppl using fog. We have many hosts (like 2k) what we only clone with fog. We do not manage them with fog client, no printer management, etc. We involve fog only in the host deployment. And it is done only very few “on-site” (computer labs) and all other are depliyed or even uploaded in 1 office actually, where our 2 guys use fog for daily deplying. Meaning very few concurrent database updates. We sometimes do more, and sometimes very few (actually we does a full company pc upgrade in waves of timed deployment; as we are not many, we have to do it in a long row of deplying, not 1-2 huge waves of full company). I told it cos only to see behind curtains. 1 month of running script is not a big thing. Some part of the year it means 100 deploying in 2 week, sometimes 100 during few months.

Anyway about this actual issue. If i take seriously what my colleague said (why would i guess they lie? this screws their work, not mine, so they are in need of solution, not disguise ), so if i take it seriously, this time the disappearance was “unique” as normally they did not do it before (“we have never did it this way ever”).

Btw, script is freed to catch next instance of it (i will make another tune maybe, to send a message to me if it happens at once, as it took 1-2 days before i was even informed that it happened). So, waiting for next issue

Foglalt

Oh, one more thing. We use fog version 1.4.4 at the moment. It was a fully clean install (os, fog, database, only the existing images very moved from storage to /images of new machine). Does it have a relevance? We actually installed a new server for far site usage (low and unstable connection prevent replicating, so only images are uploaded to that server if prepared). This remote site has 1.5.0 as it was finally out to be stable (gz and thanks from comminuty anyway, we are happy to see it).

Should we do upgrade before going on with debug? (actually i dont like the word upgrade at this point, as we have this issue since a long time now and even fully clean installation gave same, so maybe behind methods of things done, not in-fog thing only). The difference between 1.5.0 and 1.4.4 is huge visually, but can it be upgraded with donwloading the new tgz and running install between two version? wont it corrupt database if i do upgrade in between these two? (the far site of us is a sideproject for me to experiment with upgrades of fog, and test new versions more often )

Junkhacker

@foglalt said in Disappearing hosts from host list:

as far as I know about fog database updates are done upon things finish and starting point. I mean if i change mac during deployment, database is updated on change and upon task finish

actually, it’s updated every couple of seconds during the task with the progress of the task.

Foglalt

@junkhacker you mean there are updates even if nothing changed? As if refreshment?

Junkhacker

@foglalt if an imaging task is taking place, the database is updated by the computer being imaged with the progress. you know the progress bar that says how complete the task is? that’s in the database.

Wayne Workman

I don’t understand how changing a MAC address while it’s being deployed could cause this though. I’ve still not tried to replicate the issue yet though.

Foglalt

@junkhacker oh, there you got the score! Totally forgot this btw, it writes to a host data using host id or uses mac for lookups? If later, it may cause issue, am i right?

Foglalt

@Wayne-Workman the issue happened again preparing the log that the script generated and i send it to you in mail again for analysis.

Wayne Workman

@foglalt Ok, I’ll go over it this evening. Also, have you considered updating to 1.5.2 ? I think 1.3 is so old now that it’s probably a waste of time to worry too much about it.

Foglalt

er, mine active is not 1.3, since it was first happening, it was about 1.2 or so. then as time passed, we tried in many ways, upgrades, etc. not too long ago mine was most uptodate one (1.4.4), then ofc as developing is going (happily see that btw), ofc it “grew older”. but this happened to me on every single uptodate versions so i think the true reason is far more deep than it is at first glance. i have hope in your script

Wayne Workman

@foglalt I think the issue has been present for a long while. As soon as we know enough about it and can reproduce it, I would expect that a fix can be created quickly. The problem is knowing what’s causing it and knowing how to reproduce it. We were grasping at straws, So I wrote this script to grasp at straws more efficiently lol.

Wayne Workman

@foglalt What is the MAC address’ of the host with hostID 43?

Wayne Workman

@developers I have found a pattern - not sure what it means but I wanted to share to get your eyes on it.

These next two code blocks are from the first log the script provided:

Date & time: 2018. febr. 28., szerda, 09:29:01 CET
Found hostID '5' without a primary MAC.

From the history table, there are some events about one hour earlier with the same hostID.

[2018-02-28,08:28:10],Host,ID: 5 NAME: laci1 has been successfully updated.	laci	2018-02-28 08:28:10	10.10.36.124
[2018-02-28,08:28:10],MACAddressAssociation,ID: 131 has been successfully updated.	laci	2018-02-28 08:28:10	10.10.36.124
[2018-02-28,08:27:45],Host,ID: 5 NAME: laci1 has been successfully updated.	laci	2018-02-28 08:27:45	10.10.36.124

These next two blocks are from the recent log the script provided.

Date & time: 2018. ápr. 10., kedd, 09:52:01 CEST
Found hostID '43' without a primary MAC.

Again, we have an event in the history table from about one hour earlier with the same hostID.

[2018-04-05,08:53:10],Host,ID: 43 NAME: laci6 has been successfully updated.	laci	2018-04-05 08:53:10	146.110.36.124

Could it be that the timestamps from the history table are just wrong and these events actually happened at the same time? If so, I’m betting that changing the name of a host can somehow - sometimes - cause this primary MAC missing issue.

Tom Elliott

@wayne-workman The timestamps are accurate. The event you’re showing is from February 28th, vs April 5th. I don’t know if the timestamps are in UTC or correct to the timezone the user is in.

Wayne Workman

@tom-elliott The timestamps are from whatever is set on his server. You see the 1-hour pattern though in both occasions? Makes no sense to me. Only thing I can think of is a problem in an hourly ran service. This is the script he’s running every minute via cron: https://github.com/FOGProject/fog-community-scripts/blob/master/troubleshootingTools/monitor-missing-primary-mac.sh

Foglalt

Tomorrow or later tonight i send you another. Yes, again it happened. Maybe it van be same pattern at least. Btw host is not renamed. But mac is changed. All time!

Wayne Workman

@foglalt When you say the MAC is changed, are you changing it via the GUI or is this something that’s happening that is not supposed to happen? Please elaborate.

Foglalt

@wayne-workman It is simply the following:

pc1 comes in for reinstallation/installation, its mac is registered in a “dummylikehost” (for example “laci1”).
image selected, task set to deploy, pc1 finishes, turned off
pc2 comes in for same purpose, host “laci1” got a mac overwrite (mac gui field selected, typed in the new mac, update button).

the missing host is detected by your script, or previously it was detected when pc3 comes in for processing.

this is why i asked before how others do massive cloning. colleagues do this method cos with it you dont need to remake cloning groups (no image update? so you dont need to change image name). it is like another 5 pc came, you put them on the table, put cables in, register new macs and launch process of cloning. we normally never keep hosts in database, as you may saw in our database before. we only have a few in them. few dedicated ones (like image creators machine and some other).

Disappearing hosts from host list

154

12.1k

17.3k

155.4k