FOG Unresponsive under "heavy" load

fry_p

Ubuntu 14.04
FOG 1.5.3-1.5.4

I have run into a strange issue ever since updating to 1.5.3 and they continue onto 1.5.4. It all started when I went to image a lab of 27 PC’s the other day. We pxe booted and quick deployed each client (the way we do it all of the time). Everything seemed to be going fine, 20 hosts started imaging and others sat in the queue (our max clients on the storage node is 20). I walked away and came back and the lab was in various states of errors. I saw some had quit imaging partway through and others never started. I didn’t catch the actual error as we went into panic mode, but it said something about being unable to open the web server at our fog’s IP address. I went to a non-imaged PC and tried to get to the web interface. It timed out. I was thinking about the apache2 service crashing, so I checked its status. It appeared to be running, but I restarted it. No change. I rebooted the FOG server and shotgun upgraded to 1.5.4. I also noticed on the Vsphere console that memory usage spiked to the top (8GB). The combination of this brought it back up. This was 2 days ago.

Today, I tried to wipe the imaging log as it had built up about 3 years of entries. I also tried to update the Kernel to Tom Elliott 4.17.0. The UI became unresponsive and if refreshed, timed out. Same thing happened afterward. I was unable to Pxe boot to FOG. I restarted the Apache2 service and rebooted which made it responsive once more.

Prior to the 1.5.3 update, everything seemed stable. I am now worried about the reliability of my FOG setup. Please let me know any logs I can provide to help troubleshoot.

Thanks!

george1421

@fry_p Ok I have a feeling I know what it is, but lets collect some information.

When you look at top and sort by processor what has the top cpu spots consistently?

What about top memory?

fry_p

Hi @george1421 ,
At present, things seem quiet, but here are screens of system monitor on ubuntu:

Top CPU usage:

Top Memory Usage:

So yeah, when not in crisis mode, things seem normal to me.

george1421

@fry_p that tells me that php-fpm is doing its job and serving the php pages. We just found an issue with debian where it wasn’t.

So it looks like you are running into issue only during multicasting? Or was that 20 unicast images?

fry_p

@george1421 20 unicast deployments. I also seemed to have triggered it when truncating the imaging log in mysql.

george1421

@fry_p well I have no proof of this but my intuition is telling me that php-fpm is probably running out of memory when you are unicasting to that many systems. So as an experiment I want you to do this .

We need to locate a file called www.conf in the etc directory. It should be in a directory that has php-fpm in the path. Use this command.
find /etc -name www.conf
Edit that file down towards the bottom. You should see a section that has a few entries that start out with php_admin_value. I want you to add a new line with this:
php_admin_value[memory_limit] = 256M The exact placement of the line doesn’t really matter but keep it in the admin value section.
Save and exit your text editor.
Restart pgp-fpm and apache (make sure you don’t have imaging running when you do this)

sudo systemctl restart php-fpm
sudo systemctl restart apache2

Now when you have time or your next be image push see if you run into issue again.

fry_p

@george1421 We will be doing an exorbitant of lab re-imaging this summer, so testing will not be an issue. I will do this and certainly report back. Probably in the next few days.

fry_p

@george1421 I made the changes but it got me thinking. I’ve been meaning to rebuild our FOG server on a more proper OS (Centos 7), so on Friday night, I did just that. I’ll let you know if I have any issues with mass unicasting now, but the variables have changed. I feel a lot better about the stability with the new install for now.

Quazz

I’m wondering if the Ondemand FPM handler is the better choice for FOG in general and such cases specifically.

In my experience Ondemand is only marginally slower than Dynamic or Static, but uses far less RAM on average. It’s also far easier to setup correctly since you don’t require minimum children or anything like that.

The problem with the current set up is that FPM processes that have claimed a lot of RAM will only respawn after they’ve met the request limit which could take ages in certain scenarios.

In Ondemand you can specify the idle timeout so that if a process is doing nothing it will be killed off and the memory freed to the system.

I will also recommend the Event MPM for Apache alongside this. There is little point to remain with Prefork when we are using FPM anyway.

LibraryMark

I am running Ubuntu 16.04 server in vmware. I was never able to make 1.5.4 multicast until I made the changes outlined here to the www.conf file. I was suffering the same things that fry_p had probloms with. Downloading boot.php would just be “…” for days. Now it works (like it used to).

Thank you, george1421!

Edit: Well, I take that back (a little bit). After a 20-pc multicast session, none of the PC’s were able to ‘update the database’. I had to cancel the session, reboot the fog server, and reboot the PC’s . At least the image was successfully blasted out otherwise I would be having a bad day right about now.

LibraryMark

@librarymark
And while trying to multicast 8 pcs, now I get this again: alt text

LibraryMark

@librarymark
And after I reboot the server and the multicast actually runs, the PC’s are stuck at this:
0_1528808253831_295033fa-25d0-497e-8c95-59aef6f22f3a-image.png

and FOG’s webpage says this:
0_1528808342767_b826c179-b290-4e0e-842f-42e53a8d96b9-image.png

Quazz

@librarymark Do you get php memory exhaustion in the logs?

Would also be interested in seeing your free -m and top (shift+m) stats when this happens.

george1421

@librarymark said in FOG Unresponsive under "heavy" load:

Edit: Well, I take that back (a little bit). After a 20-pc multicast session, none of the PC’s were able to ‘update the database’. I had to cancel the session, reboot the fog server, and reboot the PC’s . At least the image was successfully blasted out otherwise I would be having a bad day right about now.

It would be interesting to know the memory usage when this broke.

Also just for clarity what updates did you do to the www.conf file, up the memory to 256MB?

george1421

@librarymark OK for the gateway timeout lets work with that. I think if you look in the apache error file. You will see a php timeout waiting for php-fpm to respond. What we need to do is tell apache to wait a bit before timing out.

About how long does it take to push out your image to 20 computers?

LibraryMark

@george1421
I just upped the memory in/etc/php/7.1/fpm/pool.d/www.conf:

php_admin_value[memory_limit] = 256M

george1421

@librarymark What I want you to test is outlined in this post: https://forums.fogproject.org/topic/11713/503-service-unavailable-error/40

I want you to update this section

    <Proxy "fcgi://127.0.0.1:9000">
        ProxySet timeout=500
    </Proxy>

Set the timeout in seconds to be just a bit longer than your push time.

LibraryMark

@george1421

Where do I find the “push time”?

I edited the file is /etc/apache2/sites-enabled/001-fog.conf, and it now looks like this:

<VirtualHost *:80>
  <Proxy "fcgi://127.0.0.1:9000">
        ProxySet timeout=300
   </Proxy>

    <FilesMatch "\.php$">
        SetHandler "proxy:fcgi://127.0.0.1:9000/"
    </FilesMatch>
    KeepAlive Off
    ServerName 10.5.0.61
    DocumentRoot /var/www/html/
    <Directory /var/www/html/fog/>
        DirectoryIndex index.php index.html index.htm
    </Directory>
    RewriteEngine On
    RewriteCond %{REQUEST_METHOD} ^(TRACE|TRACK)
    RewriteRule .* - [F]
    RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-f
    RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-d
    RewriteRule ^/fog/(.*)$ /fog/api/index.php [QSA,L]
</VirtualHost>

Is that correct? In any case I will not be able to test now because we just opened (public library). It might be a few days.

george1421

@librarymark Right that looks good. Make sure you set the timeout to the right number of seconds. Right now as configured apache will wait 5 minutes for php-fpm to respond before giving up. If your image push time is more than 5 minutes you need to adjust this number.

[edit] Sorry, I was not clear “push time” is the time it takes to send the image to all 20 computers when using a multicast image.

LibraryMark

@george1421
My multicast sessions usually take about 5-7 minutes to complete. Is that what you mean?

FOG Unresponsive under "heavy" load

184

12.2k

17.3k

155.5k