FOS checkin time

Wayne Workman

This is a needed feature.

FOS checkin time should be a kernel argument, that defines how often a host waiting in line to image should check in with the FOG Server for an open slot.

Right now, it’s set to 3 seconds. I recommend that the default be 30 seconds - but if this feature is implemented I’d gladly change whatever is default to 30 seconds.

Here’s my reasoning for 30.

hosts boot and checkin at different times. They aren’t all at once exactly usually. a 30 second interval doesn’t mean when one computer is done, it’s 30 seconds before the next starts. It just means that COULD happen. You can also say the next computer to begin imaging COULD happen in the next second!

What I’m asking for is to allow the FOS checkin interval to be user-controlled from the web interface.

Thanks,
Wayne

Tom Elliott

@Wayne-Workman Checkin “checks” every 5 seconds.

Tom Elliott

I disagree that this should be a user definable element. We keep record of the how long the queued tasks are waiting so as to prevent one system being unplugged removed and preventing the rest of the queue from moving along.

I am working to fix the progress reporter as I believe it is set to checkin every three seconds. While it does create a bit of a poll on the server, this is typically very minimal even at 3 second intervals.

Wayne Workman

@Tom-Elliott I don’t understand how the timekeeping couldn’t all be dynamically based on a user-adjustable number. How would one computer being unplugged keep others from continuing - and how is that tied to the check in time?

I’m specifically talking about the FOS checkin time that waits for a slot. Even 5 seconds is a lot, when there could potentially be 500 in queue. 50 in queue when replication to 14 nodes is happening, and one upload is happening simultaneously - renders MySQL unable to keep up. Anything at all to reduce the load on MySQL would help - and making this user definable would solve it for us.

ITSolutions

I guess I am not really understanding what this would solve? The check in is a small little packet, that shouldn’t slow down the imaging process even with a hundred machines checking in I don’t see where this would make much impact on the system.

If I am wrong could you explain what purpose this would serve?

Wayne Workman

@ITSolutions It’s when the system is under very heavy load. The little things DO count. This is how Linux developers have been forever, little things count. Little efficiencies here and there add up. With the mindset of “Oh, that little thing, I don’t care”, soon with many of those you have a bloated inefficient system - like Windows.

But anyways, Yesterday we had 1 image capturing, 6 computers deploying, and replication for a snapin and a new image (uploaded earlier) that was replicating to 14 storage nodes. To put the cherry on the cake, 50 hosts queued waiting for image deployment.

Those 50 hosts, every 5 seconds, check for a slot. The server was under tremendous load already. I think 5 seconds is excessive, I want to set a custom value of 30 (and I will). MySQL could not keep up. I was getting the “Update Schema” page left and right, and intermittently when trying to do anything during this time. Also my custom status reporting script quit working too during this time - it kept erroring with “too many connections”. I did up the MySQL max connections to 500 and that seemed to help, but still anything at all to reduce load and improve efficiency is a good thing.

And - I’d argue very strongly that 30 seconds is absolutely acceptable. It doesn’t mean there’s a 30 second wait between one host getting done and another starting. It means there will be a 0 - 30 second wait, and the more computers there are waiting at a time, the less chance the wait will be anywhere near 30 seconds. Plus 30 seconds is not a long time. I want a 30 second checkin time for the waiting phase.

Wayne Workman

Server

FOG Version: 1.3.0 RC-20 svn 6011
OS: CentOS 7

Description

I’m again experiencing very high CPU usage due to Apache and MySQL being slammed.

I figured out this is due to a mere 12 hosts that are queued for imaging. The server doing the imaging isn’t even the main server, it’s a remote node. And these 12 hosts reporting in over and over so often with probably inefficient sql and methodology is killing the main server’s 8 cores.

Again, 12 hosts queued for imaging is doing this, maxing out a 8-core server that isn’t even the server dolling out the images.

This sort of load makes the FOG system as a whole - almost unusable. 12 hosts queued for imaging and waiting for an open slot is causing a 15-server system to be almost unusable.

Tom Elliott

@Wayne-Workman The issue isn’t the reporting in. There’s a literally a delay of 3 seconds every checkin. This is unlikely what’s causing your high load.

Progress is only updated per each host every 3 seconds. This is why there’s a 5 second delay on the task management page.

More likely, 12 hosts imaging means that you have 12 open connections to the db (by proxy of the node). The transfer of the data to the db is nearly instant, (what ever the delay would be to update 12 individual sql statements).

This (also) is unlikely causing a high load.

The fact that you had a capture going (writing), and 6 deploy’s going (6 different reads) was a portion of what the load by the server is being caused from.

If you want to disable persistent connections (which should prevent your ‘too many connections’ issue) Edit the file:
/var/www/fog/lib/db/pdodb.class.php at line 64 and change the true to false. This would tell you quite quickly that things are working properly. I use persistent connections in an attempt to help speed connections to the server as many times the data being requested is coming from a “continuous” source.

If you want to set a different timeout, feel free to edit the inits. Particularly the file: in the FOS filesystem located at:

/bin/fog.statusreporter

Line numbers 6 and 14.

Change them from:
usleep 3000000
to
sleep 30 or usleep 30000000 changing the 30 to whatever you want the value to be. I doubt it will help the scenario unless you disable the persistent connections though.

(Either way, the connections will have to be made, but the load is not likely coming from the updating.)

You can have a look at your /var/www/fog/service/progress.php file as well if you’d like.

Wayne Workman

@Tom-Elliott said in FOS checkin time:

The fact that you had a capture going (writing), and 6 deploy’s going (6 different reads) was a portion of what the load by the server is being caused from.

That wasn’t the case yesterday.

Yesterday, no captures were going. 3 computers were imaging, 12 were waiting in queue to image. The 3 were imaging from a remote server, not even the main server. And the main server’s CPU was maxed out.

FOS checkin time

Server

Description

141

12.6k

17.5k

156.3k