HIGH CPU Fog Services after update r5029 v6759
-
@Wayne-Workman So I checked both storage nodes and
SN 1 Master
top - 09:25:03 up 11 min, 2 users, load average: 5.06, 4.59, 2.69 Tasks: 196 total, 6 running, 190 sleeping, 0 stopped, 0 zombie %Cpu(s): 80.6 us, 19.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 1014980 total, 880752 used, 134228 free, 73704 buffers KiB Swap: 1037308 total, 0 used, 1037308 free. 442564 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1100 root 20 0 40444 20268 14684 R 44.1 2.0 4:14.83 FOGSnapinReplic 1135 root 20 0 40444 20480 14896 R 39.5 2.0 4:13.61 FOGImageReplica 1529 root 20 0 40444 20716 15136 R 38.9 2.0 4:11.95 FOGPingHosts 1507 root 20 0 40444 20628 15044 R 38.5 2.0 4:14.75 FOGMulticastMan 1482 root 20 0 40444 20476 14896 R 38.2 2.0 4:17.92 FOGTaskSchedule 2614 fog 20 0 11284 3780 3008 S 0.3 0.4 0:00.29 sshd 2630 fog 20 0 5572 2792 2320 R 0.3 0.3 0:00.98 top 1 root 20 0 4732 3768 2504 S 0.0 0.4 0:02.64 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.08 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 7 root 20 0 0 0 0 S 0.0 0.0 0:00.65 rcu_sched 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh 9 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 10 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 11 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1 12 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
SN 2
top - 09:26:04 up 12 min, 2 users, load average: 5.05, 4.72, 2.84 Tasks: 221 total, 6 running, 215 sleeping, 0 stopped, 0 zombie %Cpu(s): 84.5 us, 15.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 7735460 total, 918676 used, 6816784 free, 61320 buffers KiB Swap: 7828476 total, 0 used, 7828476 free. 411264 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1437 root 20 0 40564 20524 14944 R 92.8 0.3 9:28.80 FOGTaskSchedule 1176 root 20 0 40564 20400 14820 R 89.4 0.3 9:47.33 FOGMulticastMan 1415 root 20 0 40564 20612 15032 R 77.5 0.3 9:27.74 FOGSnapinReplic 1460 root 20 0 40564 20724 15148 R 75.8 0.3 9:36.11 FOGPingHosts 1381 root 20 0 40564 20500 14920 R 64.5 0.3 9:34.41 FOGImageReplica 7 root 20 0 0 0 0 S 0.3 0.0 0:00.36 rcu_sched 2678 fog 20 0 5572 2868 2376 R 0.3 0.0 0:00.67 top 1 root 20 0 4732 3896 2608 S 0.0 0.1 0:01.43 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh 9 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 10 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 11 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
-
I’m guessing you have a lot of images and snapins, or don’t have really powerful machines running fog, or a combination of the two.
There has been code added recently to checksum the first 10 megs of image files and I think snapin files too perhaps. This is part of replication, it’s to ensure files match across storage nodes. This takes some amount of HDD and CPU usage. If the system can’t handle the load adequately, then the httpd sessions lag behind and begin to build up with cert building/encryption/decryption while communicating with the fog client.
My advice to you would be to turn down your replication check frequency, and to also turn down your fog client check in frequency in order to reduce the overall load on your system.
Web Interface -> FOG Configuration -> FOG Settings -> FOG Client -> FOG_SERVICE_CHECKIN_TIME
and
Web Interface -> FOG Configuration -> FOG Settings -> FOG Linux Service Sleep Times -> IMAGEREPSLEEPTIME
and
Web Interface -> FOG Configuration -> FOG Settings -> FOG Linux Service Sleep Times -> SNAPINREPSLEEPTIME
These settings changes won’t have an immediate effect, but the load will reduce within 5 or so minutes.
-
@Wayne-Workman Ok will try that if i can get the web interface to load…
And i have no plugins at all.
-
@Raymond-Bell said:
@Wayne-Workman Ok will try that if i can get the web interface to load…
If it comes down to it, we can give you the MySQL commands to edit these values manually via CLI.
-
@Wayne-Workman I say lets just start by stopping the fog services.
Restart apache 2 (do not restart the fog services.)
Considering the main box has a 15 minute load of 120 or more 5 minute of 130 and 1 minute of 150 adjusting the times is not going to help at all.
sudo service FOGMulticastManager stop sudo service FOGImageReplicator stop sudo service FOGSnapinReplicator stop sudo service FOGScheduler stop sudo service FOGPingHosts stop sudo service apache2 stop && sleep 60 && sudo service apache2 start
I have it waiting one minute after stopping httpd to give the main server some breathing room. These should be run on the “Main Server” and of course you can run them on the nodes if you want to.
-
@Tom-Elliott Do i need to start the fog services back up or reboot server?
-
@Tom-Elliott After stopping services and adding start time
top - 10:13:03 up 25 min, 2 users, load average: 78.25, 102.97, 94.38 Tasks: 348 total, 151 running, 197 sleeping, 0 stopped, 0 zombie %Cpu(s): 56.4 us, 42.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 1.3 si, 0.0 st KiB Mem: 4355692 total, 3087288 used, 1268404 free, 115580 buffers KiB Swap: 1046524 total, 0 used, 1046524 free. 2046424 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5509 mysql 20 0 366700 81812 9928 S 24.5 1.9 2:50.14 mysqld 9466 www-data 20 0 111532 19076 12524 R 7.9 0.4 0:01.21 apache2 9497 www-data 20 0 111520 19052 12532 R 7.6 0.4 0:01.01 apache2 9408 www-data 20 0 111540 19232 12652 R 7.3 0.4 0:01.74 apache2 9321 www-data 20 0 108848 17092 10668 R 7.0 0.4 0:03.56 apache2 9509 www-data 20 0 111516 19060 12524 R 7.0 0.4 0:01.02 apache2 9480 www-data 20 0 112532 21168 13356 R 6.6 0.5 0:01.20 apache2 9542 www-data 20 0 108180 16388 10720 R 6.6 0.4 0:00.62 apache2 9350 www-data 20 0 111048 19940 13484 R 6.3 0.5 0:02.03 apache2 9420 www-data 20 0 112620 20932 13364 R 6.3 0.5 0:01.45 apache2 9432 www-data 20 0 111532 20444 13492 R 6.3 0.5 0:01.44 apache2 9446 www-data 20 0 108840 17232 10836 R 6.3 0.4 0:01.28 apache2 9519 www-data 20 0 111520 19172 12660 R 6.3 0.4 0:00.65 apache2 9540 www-data 20 0 112540 20908 13140 R 6.3 0.5 0:00.72 apache2 9322 www-data 20 0 108844 17316 10892 R 6.0 0.4 0:03.12 apache2 9399 www-data 20 0 111024 19928 13484 R 6.0 0.5 0:01.68 apache2 9313 www-data 20 0 110716 20192 14052 R 5.3 0.5 0:04.32 apache2 9370 www-data 20 0 112892 21304 13492 R 5.3 0.5 0:01.86 apache2 9434 www-data 20 0 111524 20416 13492 R 5.3 0.5 0:01.57 apache2 9462 www-data 20 0 108184 16196 10636 R 5.3 0.4 0:01.02 apache2 9501 www-data 20 0 108184 16416 10848 R 5.0 0.4 0:00.95 apache2 9548 www-data 20 0 108184 16404 10900 R 5.0 0.4 0:00.52 apache2 9609 www-data 20 0 108100 14404 9164 R 5.0 0.3 0:00.15 apache2 9601 www-data 20 0 108100 14404 9164 R 4.6 0.3 0:00.20 apache2 9592 www-data 20 0 108160 15908 10336 R 4.3 0.4 0:00.32 apache2 9340 www-data 20 0 108596 16884 10720 R 4.0 0.4 0:02.19 apache2 9360 www-data 20 0 111556 20472 13492 R 4.0 0.5 0:01.97 apache2 9423 www-data 20 0 108820 16588 10464 R 4.0 0.4 0:01.48 apache2 9439 www-data 20 0 111532 20296 13356 R 4.0 0.5 0:01.44 apache2 9514 www-data 20 0 111500 18828 12308 R 4.0 0.4 0:00.61 apache2 9604 www-data 20 0 108100 14476 9164 R 4.0 0.3 0:00.12 apache2 9607 www-data 20 0 107840 14344 9104 R 4.0 0.3 0:00.12 apache2 9610 www-data 20 0 108000 14284 9040 R 4.0 0.3 0:00.12 apache2 9606 www-data 20 0 108132 14360 9100 R 3.6 0.3 0:00.11 apache2 9608 www-data 20 0 107856 14344 9104 R 3.6 0.3 0:00.11 apache2 9329 www-data 20 0 108596 16924 10756 R 3.3 0.4 0:02.45 apache2 9368 www-data 20 0 108712 17532 11500 R 3.3 0.4 0:01.75 apache2 9393 www-data 20 0 108180 16444 10848 R 3.3 0.4 0:01.41 apache2 9492 www-data 20 0 108192 16000 10464 R 3.3 0.4 0:00.88 apache2 9504 www-data 20 0 108824 16828 10720 R 3.3 0.4 0:00.87 apache2 9516 www-data 20 0 108172 16044 10528 R 3.3 0.4 0:00.69 apache2 9561 www-data 20 0 108672 16204 10336 R 3.3 0.4 0:00.35 apache2 9605 www-data 20 0 108100 14472 9100 R 3.3 0.3 0:00.10 apache2 9611 www-data 20 0 107888 14284 9040 R 3.3 0.3 0:00.10 apache2 9327 www-data 20 0 111628 20544 13492 R 3.0 0.5 0:02.73 apache2 9436 www-data 20 0 108180 16476 10892 R 3.0 0.4 0:01.41 apache2 9470 www-data 20 0 112848 21128 13364 R 3.0 0.5 0:01.27 apache2 9310 root 20 0 107168 25596 21100 S 2.6 0.6 0:01.98 apache2 9538 www-data 20 0 108688 16672 10764 R 2.6 0.4 0:00.64 apache2
-
@Raymond-Bell Adjust the fields I pointed out earlier, then I’d say reboot.
-
@Raymond-Bell For now, I’d say leave the services off until you are sure the load has finally balanced out.
Then start the services one at a time and watch closely to the load. I don’t think starting them, now, would be a problem, but i’m starting to think you’re seeing conflicting runs due to potentially dual webroot’s acting in place. (Again it’s all theory as I really don’t know.)
If you wait five minutes between starting the services, it should give you a good base line as to which one (or multiple) are causing issues.If you see one service starting to tack up the load, stop it, and go on to the next. If all services do the same we have a baseline to try finding info for/on.
-
@Tom-Elliott restarted all services like you suggested and now none of the are spiked, but i still see alot of apache2 processes is this normal in this update?
-
I can confirm the 100% cpu usage issue. I don’t have any storage nodes in my setup and have a c2750 based atom processor running my FOG virtual machine. I only have 2 test images, each around 1G.
I found that when I downgraded to version 6753, the high cpu usage disappeared. 6755, 6757 and up all produced the issue.
The next git commit has this in the log.
Author: Tom Elliott <tommygunsster@gmail.com> Date: Wed Mar 16 00:42:03 2016 +0000 Ensure variables are set even on initial startup (init.php). git-svn-id: https://svn.code.sf.net/p/freeghost/code/trunk@5027 71f96598-fa45-0410-b640-bcd6f8691b32
Hope that helps!
-baggar11
-
@baggar11 what os is fog running on your server?
-
@Raymond-Bell is server load better?
-
@Tom-Elliott said:
@Raymond-Bell is server load better?
Yes on server but did the same thing on nodes and they all run HIGH CPU
-
Ubuntu 14.04 here
-
-
I just updated my home FOG setup (which includes many nodes) to r6769 and I cannot replicate the issue. I’m using Fedora 23
And my replication setting is set to 60 seconds, and I’ve got the slowest setup in town (running 4 OSs on a single Core 2 duo, and P4s with 100 meg switches).
It’s either a Ubuntu thing or a New Client related thing. And I’m leaning towards it being a Ubuntu thing.
It’s also possible that there is some certain scenario that happened to cause replication to go awol, but we won’t know until we can see a setup that is affected and figure out what’s going on.
-
I was just able to test my 1 fog client system at home. it doesn’t make any difference.
I think this is a Ubuntu issue.
-
@Wayne-Workman And all, I made a rather significant change (though outside it shouldn’t matter), to hopefully make an attempt at figuring this out.
Basically please try out the latest. First thing I noticed was a very similar result on my storage nodes, (one that is Ubuntu 15, and the other that is Fedora 23) and I found that my particular issue was due to the service sleep time being parsed as a string rather than an integer. This would cause the FOG Services to keep cycling (after initial reboot) probably due to improper connection finding. I’m hoping this is fixed but also a much more performance enhanced FOG server capability.
-
@Tom-Elliott Thanks Tom. Have these changes been pushed to Git too? That’s what I’m using…