1.3.4 - high cpu load - client login

andjjru

@Junkhacker /opt/fog/snapins

Sorry for the confusion earlier with us all participating here, I should’ve introduced myself.

Junkhacker

@andjjru i’m not entirely familar with that part of the code, but the md5sum task must be part of the image replicator service. i knew we did that for snapins, but i didn’t think we did it for images. anyway, perhaps you should try disabling IMAGEREPLICATORGLOBALENABLED until most of the clients have had a chance to check in and reset their keys

andjjru

@Junkhacker Alright I disabled IMAGEREPLICATORGLOBALENABLED and that process went away. Thanks.

Wayne Workman

@UWPVIOLATOR said in 1.3.4 - high cpu load - client login:

What is this doing that it is pulling so much resources? Happening multiple times a day.

Not long ago, a week or so, a change was made so the entire images got hashed instead of just the first 10 megs.

Turn the occurrence of this way down:
Web Interface -> FOG Configuration -> FOG Settings -> FOG Linux Service Sleep Times -> IMAGEREPSLEEPTIME Set that to something like 24 hours.

The default is 600 seconds. What I’m guessing is you have a ton of images and it takes hours to hash them all - thus your FOG Server is always slammed.

Tom Elliott

@Wayne-Workman I will go back to using 10mb. It wasn’t using the first 10mb though. Because it was constantly pinging for traffic across ftp. It only checked the filesizes originally. While this worked, it failed to detect changes in files like d1.partitions that might have been updated.

So I re-added the “file hash” checking as a means but made it so the hashing was done at the “local” node’s rather than at the single “side”.

I am trying to check things out.

I’m thinking about testing the last 10mb of the file though as it’s fully possible the first 10 mb would be the same, but much less likely that the last 10 mb would be.

Wayne Workman

@Tom-Elliott This appears to work for hashing the last 10 megs. Also works for files that are sub-10MB

[root@fog-server Acerbase]# tail -c 10485760 d1p1.img | md5sum
326ea3163c9bc3e202fa323e47f02b23  -

Tom Elliott

@Wayne-Workman I’m probably going to go with sha512sum to ensure less potential of collision (while md5 shouldn’t have too many).

Wayne Workman

@Tom-Elliott Doesn’t really matter what you choose now that we’re only going to hash the last 10 megs. Speed differences in them won’t be noticeable.

Tom Elliott

Updated working-1.3.5.

I want to push up RC-11, but want to hear more back about the init’s (which will have to wait until at least tomorrow I think.)

UWPVIOLATOR

Good News! CPU load was low all day today. Mid day we turned the FOG Service back on at a few sites. We will monitor this tomorrow morning then add more clients back and report back.

What we did.

Disabled IMAGEREPLICATORGLOBALENABLED
Increased MaxRequestWorkers from 150 to 500 in mpm_prefork_module
Reset all host encryption UPDATE hosts SET hostPubKey= ‘’, hostSecToken = ‘’;

Tom Elliott

Do you guys think it’s suitable to solve this yet, or just hold for a little bit?

Would you guys mind jumping on the working branch to see if the changes there will help fix the issue more directly? I believe the high load was coming from the constant md5summing that was happening for each image every cycle the replicator service was running.

I’ve switched out to using the first and last 10 mb of the files at both the remote and local systems, hash those together (Thanks @Wayne-Workman and @Junkhacker) and compare. So it’s still entirely possible that load can still get high (if it has to replicate multiple images/snapins at the same time) but it should be less CPU intensive during the “checking” processes.

ablohowiak

We had <25% of clients enabled this morning. We need to clean up the snapshots on the server before we can update. It might be this afternoon or Monday.

We’re still going to re-enable clients in steps just to be safe. By Tuesday morning all clients should be running again.

Thanks again for all your time and effort!

UWPVIOLATOR

@Tom-Elliott

One thing we notice that is still not working is WOL to Groups. WOL works for individual hosts but not for WOL to a Group. Also Report Management does not return anything for any of the reports.

0_1487947390955_upload-5fe577c4-0a56-45fb-8c67-d85dafcc384a

Florent

@Tom-Elliott said in 1.3.4 - high cpu load - client login:

11

Hi,
We have the same problem.
Is this problem is resolved in RC10 ?
Or when RC11 available ?

Regards.

Tom Elliott

@Florent What is:

“We have the same problem.”?

I ask because there seems to be multiple issues being described in this thread, while the primary issue was related to High CPU. Are you referring to High CPU usage being an issue?

Florent

@Tom-Elliott
Thanks for your response (my english is not very good).

Yes we have high CPU usage since we have deploy the new client (0.11.9) with GPO.

We have try to modify FOG_CLIENT_CHECKIN_TIME but we think value over 60 seconds are no effects.
In our client log we see in general a contact server every 60-200 seconds.
We have more than 1500 clients.

If the problem is here is it possible to modify checkin time to 15 minutes ?

Or if the problem is not this where i can find informations for identify in detail the source of the problem ?

Regards

UWPVIOLATOR

@Florent

Do you see something like this in your processes on your FOG server?

We Disabled IMAGEREPLICATORGLOBALENABLED until Tom fixes the image checking in the next RC.

Florent

@UWPVIOLATOR
Just do this on Web interface / fog settings or after restart apache ?

Tom Elliott

@Florent In the GUI.

FOG Configuration Page->FOG Settings->FOG Linux Service Enabled

ablohowiak

Tom,
We added back about 75% of our clients and the load has remained stable and UI responsive. I was trying to update to 1.3.5-RC10, but the install failed.

Downloading inits, kernels, and the fog client…Failed!

Feb 28 12:47:24 FogDB systemd[1]: Starting MySQL Community Server…
Feb 28 12:47:26 FogDB systemd[1]: Started MySQL Community Server.
mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 1045 (28000): Access denied for user ‘root’@‘localhost’ (using password: YES)
mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 1045 (28000): Access denied for user ‘root’@‘localhost’ (using password: YES)

At this point there’s basically no fog site in apache. I’m reverting back to my last snapshot.

1.3.4 - high cpu load - client login

75

12.7k

17.6k

156.8k