HIGH CPU Fog Services after update r5029 v6759


  • Testers

    My server is maxed out after the svn update. And i see alot of apache2 services running. I have stoped and restarted all services and apache2. Same results…

    top - 08:54:26 up 39 min,  2 users,  load average: 150.15, 130.19, 120.70
    Tasks: 357 total, 131 running, 226 sleeping,   0 stopped,   0 zombie
    %Cpu(s): 77.0 us, 21.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
    KiB Mem:   4355692 total,  3191892 used,  1163800 free,   108260 buffers
    KiB Swap:  1046524 total,        0 used,  1046524 free.  1993812 cached Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     9601 root      20   0   42548  20896  15292 R  81.6  0.5   1:41.08 FOGSnapinReplic
     9610 root      20   0   42548  21276  15668 R  79.7  0.5   1:37.41 FOGTaskSchedule
     9581 root      20   0   42548  21008  15400 R  78.4  0.5   1:45.02 FOGImageReplica
     9619 root      20   0   42548  21016  15408 R  73.8  0.5   1:38.89 FOGPingHosts
     6818 www-data  20   0  111740  20580  13672 R   4.9  0.5   0:18.63 apache2
     6772 www-data  20   0  112616  21216  13180 R   3.6  0.5   0:16.42 apache2
     6931 www-data  20   0  111580  20780  13780 R   3.3  0.5   0:15.84 apache2
     8981 www-data  20   0  108844  17132  11068 R   3.3  0.4   0:01.87 apache2
     9206 www-data  20   0  108844  16768  10536 R   3.3  0.4   0:01.64 apache2
     7212 www-data  20   0  111540  21016  14348 R   2.9  0.5   0:08.32 apache2
     7122 www-data  20   0  110964  19332  12952 R   2.6  0.4   0:07.91 apache2
     9201 www-data  20   0  108828  17192  11068 R   2.6  0.4   0:01.64 apache2
     6232 mysql     20   0  386180  97720   9964 S   2.0  2.2   2:22.31 mysqld
     8582 www-data  20   0  109068  17304  10920 R   2.0  0.4   0:02.44 apache2
     8547 www-data  20   0  108756  17228  11068 R   1.6  0.4   0:02.07 apache2
     6732 www-data  20   0  112004  20700  13404 R   1.0  0.5   0:17.99 apache2
     6863 www-data  20   0  112088  21292  13784 R   1.0  0.5   0:17.01 apache2
     6948 www-data  20   0  109608  18104  11164 R   1.0  0.4   0:15.35 apache2
     6961 www-data  20   0  111872  20612  13320 R   1.0  0.5   0:13.70 apache2
     7042 www-data  20   0  111496  20684  13764 R   1.0  0.5   0:14.84 apache2
     7142 www-data  20   0  111340  19648  12888 R   1.0  0.5   0:09.01 apache2
     7189 www-data  20   0  111540  19476  12828 R   1.0  0.4   0:08.34 apache2
     8358 www-data  20   0  109068  17216  10920 R   1.0  0.4   0:02.07 apache2
     8529 www-data  20   0  108892  17096  10920 R   1.0  0.4   0:02.05 apache2
     8650 www-data  20   0  109084  17656  11068 R   1.0  0.4   0:01.81 apache2
     8717 www-data  20   0  108856  17152  10984 R   1.0  0.4   0:02.11 apache2
     8885 www-data  20   0  110168  18372  10920 R   1.0  0.4   0:01.60 apache2
     8936 www-data  20   0  108864  17144  11068 R   1.0  0.4   0:01.67 apache2
     9158 www-data  20   0  108704  16372  10600 R   1.0  0.4   0:01.53 apache2
     9212 www-data  20   0  110172  18076  10748 R   1.0  0.4   0:01.73 apache2
     6733 www-data  20   0  112860  22440  14300 R   0.7  0.5   0:22.91 apache2
     6805 www-data  20   0  113036  22132  13804 R   0.7  0.5   0:17.20 apache2
     6826 www-data  20   0  111832  21016  13764 R   0.7  0.5   0:16.97 apache2
     6890 www-data  20   0  112944  21836  13724 R   0.7  0.5   0:16.44 apache2
     6917 www-data  20   0  112920  21620  13404 R   0.7  0.5   0:16.21 apache2
     6938 www-data  20   0  111236  20484  13828 R   0.7  0.5   0:15.97 apache2
     7014 www-data  20   0  111804  21052  13828 R   0.7  0.5   0:16.61 apache2
     7050 www-data  20   0  111616  20652  13868 R   0.7  0.5   0:15.19 apache2
     7086 www-data  20   0  112304  20612  12888 R   0.7  0.5   0:08.34 apache2
     7091 www-data  20   0  111008  19468  13040 R   0.7  0.4   0:08.66 apache2
     7100 www-data  20   0  111268  19388  12952 R   0.7  0.4   0:08.75 apache2
     7110 www-data  20   0  108988  17280  11132 R   0.7  0.4   0:07.41 apache2
     7129 www-data  20   0  111544  20888  14200 R   0.7  0.5   0:08.41 apache2
     7155 www-data  20   0  110964  19684  13300 R   0.7  0.5   0:09.00 apache2
     7177 www-data  20   0  111260  19408  12952 R   0.7  0.4   0:08.47 apache2
     7195 www-data  20   0  111220  19520  12888 R   0.7  0.4   0:08.85 apache2
     7203 www-data  20   0  111316  19276  12804 R   0.7  0.4   0:07.28 apache2
     7208 www-data  20   0  111424  20100  13336 R   0.7  0.5   0:06.72 apache2
     7221 www-data  20   0  111376  19436  12892 R   0.7  0.4   0:08.27 apache2
     7229 www-data  20   0  112312  21008  13272 R   0.7  0.5   0:07.58 apache2
     8316 www-data  20   0  108876  16748  10600 R   0.7  0.4   0:02.31 apache2
     8495 www-data  20   0  111256  19016  12764 R   0.7  0.4   0:02.02 apache2
    


  • Wanted to post another update. Updated to 6893 this morning. Removed my sleep and service stop commands from rc.local and rebooted. CPU seems to be in check again. Thanks for all your hard work!



  • @Tom-Elliott
    thought i had the last version.
    I just updated & rebooted. Now working fine for a couple of minutes.

    think its ok now.


  • Senior Developer

    @boeleke What?

    You updated and still seeing CPU load issues? See the service_lib script now implicitly defines a sleep time if one cannot be found otherwise.


  • Testers

    @boeleke said:

    @Raymond-Bell

    you are right. CPU is fixed indead … Till you boot :worried:

    Ok i have not rebooted yet



  • @Raymond-Bell

    you are right. CPU is fixed indead … Till you boot :worried:


  • Testers

    @Tom-Elliott After update to r5054 this morning
    Looks like it is fixed now but i have not rebooted any of them…
    Server

    top - 07:57:51 up 23:03,  3 users,  load average: 134.00, 122.87, 122.12
    Tasks: 351 total,  32 running, 312 sleeping,   0 stopped,   7 zombie
    %Cpu(s): 55.3 us, 42.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  2.7 si,  0.0 st
    KiB Mem:   4355692 total,  3461944 used,   893748 free,    69372 buffers
    KiB Swap:  1046524 total,    16896 used,  1029628 free.  2231712 cached Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    17801 mysql     20   0  386360  85396   9744 S  19.5  2.0   3:07.93 mysqld
    20119 www-data  20   0  111272  19556  13116 R  11.3  0.4   0:04.27 apache2
    21537 www-data  20   0       0      0      0 Z   9.9  0.0   0:01.06 apache2
    21239 www-data  20   0  111188  19952  13344 R   8.6  0.5   0:02.65 apache2
    19002 www-data  20   0  111044  19584  13116 R   7.0  0.4   0:08.43 apache2
    18759 www-data  20   0  111296  19812  13092 R   6.6  0.5   0:13.16 apache2
    19062 www-data  20   0  110988  19788  13752 R   6.6  0.5   0:07.04 apache2
    20017 www-data  20   0  110996  18988  12572 S   6.6  0.4   0:04.30 apache2
    18845 www-data  20   0  111300  19240  12688 R   6.3  0.4   0:13.16 apache2
    18875 www-data  20   0  111256  20412  13732 R   6.3  0.5   0:12.60 apache2
    
    

    Storage Node Master

    top - 07:58:14 up 19:25,  2 users,  load average: 0.01, 0.06, 0.12
    Tasks: 201 total,   1 running, 195 sleeping,   0 stopped,   5 zombie
    %Cpu(s):  0.2 us,  0.2 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem:   1014980 total,   975276 used,    39704 free,   106172 buffers
    KiB Swap:  1037308 total,    22532 used,  1014776 free.   443116 cached Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
        1 root      20   0    4616   3552   2432 S   0.0  0.3   0:02.85 init
        2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
        3 root      20   0       0      0      0 S   0.0  0.0   0:03.88 ksoftirqd/0
        5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
        7 root      20   0       0      0      0 S   0.0  0.0   0:26.13 rcu_sched
        8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
        9 root      rt   0       0      0      0 S   0.0  0.0   0:00.02 migration/0
       10 root      rt   0       0      0      0 S   0.0  0.0   0:00.18 watchdog/0
       11 root      rt   0       0      0      0 S   0.0  0.0   0:00.20 watchdog/1
       12 root      rt   0       0      0      0 S   0.0  0.0   0:00.02 migration/1
    
    

    Storage Node 2nd

    top - 07:58:31 up 22:59,  3 users,  load average: 0.01, 0.02, 0.05
    Tasks: 227 total,   1 running, 225 sleeping,   0 stopped,   1 zombie
    %Cpu(s):  0.1 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem:   7735460 total,  7466792 used,   268668 free,    69996 buffers
    KiB Swap:  7828476 total,        0 used,  7828476 free.  6881904 cached Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
        7 root      20   0       0      0      0 S   0.3  0.0   0:29.70 rcu_sched
    11185 fog       20   0    5572   2852   2352 R   0.3  0.0   0:02.77 top
        1 root      20   0    4732   3844   2580 S   0.0  0.0   0:01.60 init
        2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
        3 root      20   0       0      0      0 S   0.0  0.0   0:04.09 ksoftirqd/0
        5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
        8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
        9 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 migration/0
       10 root      rt   0       0      0      0 S   0.0  0.0   0:00.30 watchdog/0
       11 root      rt   0       0      0      0 S   0.0  0.0   0:00.32 watchdog/1
       12 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 migration/1
    
    


  • @baggar11 & @Tom-Elliott

    Like Baggar is saying. Thanks for taking your time to looking at these things.
    Its just anoying that it is taking 100% CPU & not working properly.
    Take your time but its probably a problem for all the ubuntu 14.04 - users after one update.



  • @Tom-Elliott said:

    @baggar11 Okay, I’ve added a sleep time of 10 seconds, just in case of these situations where the int returned is 0. We must have at least 1 second sleeptime ( I believe ), and this just isn’t happening.

    The only word of caution I can think of, now, is while this might help with CPU Cycles, it will not make the FOG Services actually work properly as from Ubuntu’s standpoint the service is already running. While I could, potentially, come up with a way to restart the services more appropriately, I don’t know where to start at the moment.

    I’m not a coder so take this with a grain of salt. The following is a similar approach to another open source project I use.

    Make a script to run as a cron job every 5 minutes. On 1st boot, the script essentially checks for things like MySQL and network being up, and if all checks out starts the FOG services. All other times it runs, it would really only check for FOG services running and then exit. heh… :)

    Of course, it’s probably a waste of time as I assume systemd will probably take care of these things more intelligently than regular init and/or upstart. Which Fedora is already using and Ubuntu’s next LTS will be running.

    I really appreciate you taking a look into these things Tom. FOG is a very cool project.


  • Senior Developer

    @baggar11 Okay, I’ve added a sleep time of 10 seconds, just in case of these situations where the int returned is 0. We must have at least 1 second sleeptime ( I believe ), and this just isn’t happening.

    The only word of caution I can think of, now, is while this might help with CPU Cycles, it will not make the FOG Services actually work properly as from Ubuntu’s standpoint the service is already running. While I could, potentially, come up with a way to restart the services more appropriately, I don’t know where to start at the moment.



  • @Tom-Elliott said:

    @baggar11 as I understand it, this is still potentially due to starting the services before its dependencies have started. This is different from the pegged on update, I think.

    Can you simply try reinstall and see if it’s still happening? Or did update always work, only reboot caused these issues?

    For my upgrades, anything after 6753 would peg my cpu after a reboot. During and after the upgrade process using the install.sh, everything was fine. It’s only after the reboot that the services would max out the cpu. Probably the infinite loop issue as you wrote about below.


  • Senior Developer

    @baggar11 as I understand it, this is still potentially due to starting the services before its dependencies have started. This is different from the pegged on update, I think.

    Can you simply try reinstall and see if it’s still happening? Or did update always work, only reboot caused these issues?



  • @Tom-Elliott said:

    @baggar11, @Raymond-Bell I believe I may have fixed this now.
    With any luck, this issue will be gone now. Please update and let me know.

    Just updated to 6795, removed my “restart” lines from rc.local and rebooted. CPU was pegged again. Restarting the services manually brought cpu back down to idle.


  • Senior Developer

    Why this might affect the CPU? It looks to the DB to get timeout values (and logs). The DB is fully operational, but it’s not initiated by fog meaning the DB is not available. It returns an int of 0 for the sleep time. Put that into an infinite loop, (there’s at least 2 that manage start/restart of the services. It is told to sleep for some period of time already within the looping I’m referring to. However it’s set to 0 (meaning 0 seconds), so it get’s stuck in an infinite loop without ever actually starting.


  • Senior Developer

    @baggar11, @Raymond-Bell I believe I may have fixed this now.

    My issue – in theory?

    I am moving almost EVERYTHING to static form (where I can). This allows me to get items back without having to initiate a whole class object. The problem, the DB has to be available for FOG to read its info. The services are checking for the item before DB is established. This was totally an over site and for that I’m sorry to all.

    With any luck, this issue will be gone now. Please update and let me know.

    Thanks.



  • I just realized after going through another git pull upgrade that the CPU spike will probably happen every time since the services are reinstalled. Here’s how I fixed my problem again, hopefully in a cleaner way. After upgrading to 6791, it seems to be working so far with a couple reboot tests.

    Add this to your rc.local. Adjust sleep time based on your setup. My FOG setup is virtual, so 5 seconds seems to work well.

    sleep 5
    service FOGImageReplicator restart
    service FOGMulticastManager restart
    service FOGPingHosts restart
    service FOGScheduler restart
    service FOGSnapinReplicator restart


  • alt text

    Something weird. After the new update , my CPU performance from my server was going to 10-20%. After a reboot , he started again @100%

    Pretty anoying


  • Testers

    @Tom-Elliott Yes same result after

    sudo service FOGMulticastManager stop
    sudo service FOGImageReplicator stop
    sudo service FOGSnapinReplicator stop
    sudo service FOGScheduler stop
    sudo service FOGPingHosts stop
    
    
    sudo update-rc.d FOGMulticastManager disable
    sudo update-rc.d FOGImageReplicator disable
    sudo update-rc.d FOGSnapinReplicator disable
    sudo update-rc.d FOGScheduler disable
    sudo update-rc.d FOGPingHosts disable
    
    sudo service FOGMulticastManager start
    sudo service FOGImageReplicator start
    sudo service FOGSnapinReplicator start
    sudo service FOGScheduler start
    sudo service FOGPingHosts start
    
    top - 10:48:18 up  1:48,  2 users,  load average: 4.14, 4.92, 5.02
    Tasks: 202 total,   6 running, 196 sleeping,   0 stopped,   0 zombie
    %Cpu(s): 66.6 us, 32.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
    KiB Mem:   1014980 total,   966204 used,    48776 free,    19852 buffers
    KiB Swap:  1037308 total,    12316 used,  1024992 free.   529972 cached Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     5970 root      20   0   41920  20164  15184 R  45.6  2.0   0:07.37 FOGSnapinReplic
     6026 root      20   0   41920  20392  15408 R  41.3  2.0   0:05.85 FOGPingHosts
     5990 root      20   0   41920  20240  15256 R  39.6  2.0   0:06.34 FOGTaskSchedule
     5950 root      20   0   41920  20488  15504 R  39.3  2.0   0:06.90 FOGImageReplica
     5930 root      20   0   41920  20116  15132 R  34.3  2.0   0:05.32 FOGMulticastMan
     5779 fog       20   0   11284   3900   3128 S   0.3  0.4   0:00.04 sshd
        1 root      20   0    4732   3788   2548 S   0.0  0.4   0:02.83 init
        2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
        3 root      20   0       0      0      0 S   0.0  0.0   0:00.78 ksoftirqd/0
        5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
        7 root      20   0       0      0      0 S   0.0  0.0   0:05.27 rcu_sched
        8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
        9 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 migration/0
       10 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 watchdog/0
       11 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 watchdog/1
       12 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 migration/1
       13 root      20   0       0      0      0 S   0.0  0.0   0:00.12 ksoftirqd/1
    

  • Testers

    @Tom-Elliott said:

    @Raymond-Bell Did you restart the services after adding the rc.local? I’m sorry if I’m asking obvious questions, but I really just need to make sure. Current update should help alleviate CPU usage as well though.

    yes restarted and rebooted and also up to r5038 6777
    will do it again to make sure


  • Senior Developer

    @Raymond-Bell Did you restart the services after adding the rc.local? I’m sorry if I’m asking obvious questions, but I really just need to make sure. Current update should help alleviate CPU usage as well though.


Log in to reply
 

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.