Disappearing hosts from host list

Foglalt

@wayne-workman hehe, sanitize i will, well, i already started reading it but work came in the office in waves, so i postponed it to tomorrow. I will send or post it here from work. btw i kept your address from last db transaction we did together.

Wayne Workman

@foglalt said in Disappearing hosts from host list:

here is what happened now:
task was deploying and DURING that task the host data was updated with new mac address (i am aware that it is not best usage it was not me)

No matter how it happened, we can learn from this and possibly improve FOG while doing so.

Foglalt

@wayne-workman

I realised that only ip could have been sensitive, so i altered it, rest is unmodified. I am curious if log helps this time (as i mentioned it has now an extra step over normal daily work (mac was changed during selected host was being deployed).

(as logs cannot be inserted here this long as i have, and uploaded files are only pictures, i send to your email.)

Wayne Workman

@foglalt

Got it, I will look over it tonight. At any rate, reset the trap.

Wayne Workman

@foglalt I looked over the log.

What I am seeing is that hostID 5 lost it’s primary MAC address on February 28th at 09:29:01 your time.

Knowing that you set this check up to run every minute, that means this problem happened between 9:28 AM and 9:29 AM.

Good thing the script gets the last 100 lines of the apache access log, these relevant lines were nearly at the top of the output:

10.10.36.124 - - [28/Feb/2018:09:28:10 +0100] "POST /fog/management/index.php?node=host&sub=edit&id=5&tab=host-general HTTP/1.1" 302 726 "http://SERVER/fog/management/index.php?node=host&sub=edit&id=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0"
10.10.36.124 - - [28/Feb/2018:09:28:10 +0100] "GET /fog/management/index.php?node=host&sub=edit&id=5 HTTP/1.1" 302 698 "http://SERVER/fog/management/index.php?node=host&sub=edit&id=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0"
10.10.36.124 - - [28/Feb/2018:09:28:10 +0100] "GET /fog/management/index.php?node=host HTTP/1.1" 200 2848 "http://SERVER/fog/management/index.php?node=host&sub=edit&id=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0"

One POST, two GET requests. I’m going to assume the POST request caused the primary mac to be disassociated from the host.

I do not see any relevant logs from the history table during this time, and the apache error log is basically totally silent.

So now we have something to go on. I’ll try to replicate the problem this weekend but I do have a very busy weekend so I might not get to it. I am very confident that the primary MAC of this host is being either deleted or disassociated during a host information update via the GUI - on a POST request.

@Foglalt if you can attempt to replicate this problem via editing & saving host information and figure out exactly what’s causing it, it would help. @developers any insight or help in isolating the issue further would be appreciated.

Tom Elliott

If I had to guess, this is due to how fog does macaddress updates. Manually entering a new mac in the primary mac field should cause the new mac to become primary and adjust the original mac address as an associated mac. I haven’t played too much with this action however as it can become extremely difficult to manage. I mean we’re having to check four very different things at the same time not including the primary check itself (mac, pending, ignore client, ignore image) in a future version, and possibly 1.5.1 this will be handled, I think, much better. My intent is to move all macaddresses to their own tab. Within that tab you have a single text entry to insert new macs, then you have a table that will present the primary, pending, image ignore, and client ignore. Getting the logic designed for 1.5.x is a bit more difficuly, but I do have this already coded for 1.6 (nowhere near ready for even alpha testing yet – though much closer than most may think.

Sebastian Roth

@foglalt said in Disappearing hosts from host list:

task was deploying and DURING that task the host data was updated with new mac address (i am aware that it is not best usage it was not me)

This is probably the key here, right?!

Wayne Workman

@sebastian-roth It might be related. There are no entries in the history table at the time the host lost it’s primary MAC. There are entires in it for that host though, but just not during that 1 minute window.

I think the issue is a much larger scope work-flow bug rather than something you can see in a couple lines of code and point your finger at.

I’m still going to try to reproduce the problem.

Tom Elliott

So I just want to clarify here, he’s been running your script for one month. One person does one thing and the host magically disappears (this seems like a timing issue considering the host was in a tasking). This isn’t a workflow problem, in my eyes. You find it great, but 1 host in 1 month due to 1 person making 1 change at the wrong time is NOT something that can be accounted for in all cases. The DB is being written too, and at the time of the “change” this might mean the host became invalidated because of the shift of mac address at the same time the tasking is updating (which now becomes invalidated because the mac is being switched.)

I’m only working on what this seems like. Timing issues are never going to be fully fixable when working with stuff that writes to different areas of a database. The only way we could prevent this is prevent any edit’s to any image, group, and/or host, that has an item in a tasking. Again, though, this still has a temporary (albeit very small) timing issue in that a person editing a host at the exact same time a person is deploying a host task could cause exactly the same issue.

I’m trying to give you this from a developers standpoint. Just stating it’s a “much more serious workflow bug” doesn’t help a single thing because if it were a serious workflow bug,we 'd have many many many more reports of this.

Foglalt

thank you guys for cooperating with killing this bug-or-what

@Wayne-Workman i will try to do frequent host updates randomly (actually i think during “non-tasking time” it is not replicatable, maybe is about load on server).

@Sebastian-Roth I dont think the key is the “during image mac is changed” as far as I know about fog database updates are done upon things finish and starting point. I mean if i change mac during deployment, database is updated on change and upon task finish (btw my guess is that task finish doesnt do a thing with mac address, as host data is update upon end of a task, but why would it be caring with mac of the same host? I mean host is looked up for update with host id, not with mac lookup. but ofc it is a guess, you guys know more details

@Tom-Elliott I think same, it is surely not a full scale, everybody does it kind of bug as i see no other posts about it. I think it is related more on our methods of managing host deloying. I asked it before on forum that how ppl do this. this was the reason. to see what is different and how, to find the difference of our methods.

I have a guess of these:

maybe load on server may cause this timing issu
maybe data update of host details are done one by one and something can interfer with it making it “broken a way”.

The strange part of it is that we are not a huuuuge company with tons of ppl using fog. We have many hosts (like 2k) what we only clone with fog. We do not manage them with fog client, no printer management, etc. We involve fog only in the host deployment. And it is done only very few “on-site” (computer labs) and all other are depliyed or even uploaded in 1 office actually, where our 2 guys use fog for daily deplying. Meaning very few concurrent database updates. We sometimes do more, and sometimes very few (actually we does a full company pc upgrade in waves of timed deployment; as we are not many, we have to do it in a long row of deplying, not 1-2 huge waves of full company). I told it cos only to see behind curtains. 1 month of running script is not a big thing. Some part of the year it means 100 deploying in 2 week, sometimes 100 during few months.

Anyway about this actual issue. If i take seriously what my colleague said (why would i guess they lie? this screws their work, not mine, so they are in need of solution, not disguise ), so if i take it seriously, this time the disappearance was “unique” as normally they did not do it before (“we have never did it this way ever”).

Btw, script is freed to catch next instance of it (i will make another tune maybe, to send a message to me if it happens at once, as it took 1-2 days before i was even informed that it happened). So, waiting for next issue

Foglalt

Oh, one more thing. We use fog version 1.4.4 at the moment. It was a fully clean install (os, fog, database, only the existing images very moved from storage to /images of new machine). Does it have a relevance? We actually installed a new server for far site usage (low and unstable connection prevent replicating, so only images are uploaded to that server if prepared). This remote site has 1.5.0 as it was finally out to be stable (gz and thanks from comminuty anyway, we are happy to see it).

Should we do upgrade before going on with debug? (actually i dont like the word upgrade at this point, as we have this issue since a long time now and even fully clean installation gave same, so maybe behind methods of things done, not in-fog thing only). The difference between 1.5.0 and 1.4.4 is huge visually, but can it be upgraded with donwloading the new tgz and running install between two version? wont it corrupt database if i do upgrade in between these two? (the far site of us is a sideproject for me to experiment with upgrades of fog, and test new versions more often )

Junkhacker

@foglalt said in Disappearing hosts from host list:

as far as I know about fog database updates are done upon things finish and starting point. I mean if i change mac during deployment, database is updated on change and upon task finish

actually, it’s updated every couple of seconds during the task with the progress of the task.

Foglalt

@junkhacker you mean there are updates even if nothing changed? As if refreshment?

Junkhacker

@foglalt if an imaging task is taking place, the database is updated by the computer being imaged with the progress. you know the progress bar that says how complete the task is? that’s in the database.

Wayne Workman

I don’t understand how changing a MAC address while it’s being deployed could cause this though. I’ve still not tried to replicate the issue yet though.

Foglalt

@junkhacker oh, there you got the score! Totally forgot this btw, it writes to a host data using host id or uses mac for lookups? If later, it may cause issue, am i right?

Foglalt

@Wayne-Workman the issue happened again preparing the log that the script generated and i send it to you in mail again for analysis.

Wayne Workman

@foglalt Ok, I’ll go over it this evening. Also, have you considered updating to 1.5.2 ? I think 1.3 is so old now that it’s probably a waste of time to worry too much about it.

Foglalt

er, mine active is not 1.3, since it was first happening, it was about 1.2 or so. then as time passed, we tried in many ways, upgrades, etc. not too long ago mine was most uptodate one (1.4.4), then ofc as developing is going (happily see that btw), ofc it “grew older”. but this happened to me on every single uptodate versions so i think the true reason is far more deep than it is at first glance. i have hope in your script

Wayne Workman

@foglalt I think the issue has been present for a long while. As soon as we know enough about it and can reproduce it, I would expect that a fix can be created quickly. The problem is knowing what’s causing it and knowing how to reproduce it. We were grasping at straws, So I wrote this script to grasp at straws more efficiently lol.

Disappearing hosts from host list

165

12.5k

17.5k

156.2k