[dev-branch] multicast: for some hosts DB not updated after restore

shruggy

See this post.
Apache logs don’t contain anything of note.
PHP-FPM log during (or shortly after) multicast restore sessions sometimes contains these warnings:

[04-Jan-2020 16:38:01] NOTICE: [pool www] child 29241 started
[05-Jan-2020 02:54:37] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 17 total children
[05-Jan-2020 02:54:38] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 22 total children
...
[25-Jan-2020 18:00:58] NOTICE: [pool www] child 9916 started
[25-Jan-2020 18:54:59] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 17 total children
[25-Jan-2020 18:55:00] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 22 total children
[25-Jan-2020 18:55:01] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 27 total children

Output of egrep '^pm\.(start|min|max)' /etc/php-fpm.d/www.conf

pm.max_children = 50
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 10
pm.max_requests = 2000

Output of free -h on FOG server

              total        used        free      shared  buff/cache   available
Mem:           3,9G        362M        372M        8,6M        3,2G        3,5G
Swap:          4,0G          0B        4,0G

Output of lscpu | egrep '^Core|^Socket'

Core(s) per socket:    2
Socket(s):             1

Output of ps --no-headers -o rss,cmd -C php-fpm|awk '{sum+=$1}END{print sum/NR/1024"M"}'

21.2727M

Output of mysql -u root fog -e 'select l.taskID,s.tsName from taskLog as l,taskStates as s where l.taskStateID=s.tsID and l.id between 3282 and 3348 order by l.taskID'

+--------+-------------+
| taskID | tsName      |
+--------+-------------+
| 1701   | In-Progress |
| 1701   | Complete    |
| 1702   | In-Progress |
| 1702   | Complete    |
| 1703   | In-Progress |
| 1703   | Complete    |
| 1704   | In-Progress |
| 1704   | Complete    |
| 1705   | In-Progress |
| 1706   | In-Progress |
| 1706   | Complete    |
| 1707   | In-Progress |
| 1707   | Complete    |
| 1708   | In-Progress |
| 1708   | Complete    |
| 1709   | In-Progress |
| 1709   | Complete    |
| 1710   | In-Progress |
| 1710   | Complete    |
| 1711   | In-Progress |
| 1711   | Complete    |
| 1712   | In-Progress |
| 1712   | Complete    |
| 1713   | In-Progress |
| 1713   | Complete    |
| 1714   | In-Progress |
| 1714   | Complete    |
| 1715   | In-Progress |
| 1715   | Complete    |
| 1716   | In-Progress |
| 1716   | Complete    |
| 1717   | In-Progress |
| 1717   | Complete    |
| 1718   | In-Progress |
| 1718   | Complete    |
| 1719   | In-Progress |
| 1719   | Complete    |
| 1720   | In-Progress |
| 1721   | In-Progress |
| 1721   | Complete    |
| 1722   | In-Progress |
| 1722   | Complete    |
| 1723   | In-Progress |
| 1723   | Complete    |
| 1724   | In-Progress |
| 1724   | Complete    |
| 1725   | In-Progress |
| 1725   | Complete    |
| 1726   | In-Progress |
| 1726   | Complete    |
| 1727   | In-Progress |
| 1727   | Complete    |
| 1728   | In-Progress |
| 1728   | Complete    |
| 1729   | In-Progress |
| 1730   | In-Progress |
| 1730   | Complete    |
| 1731   | In-Progress |
| 1732   | In-Progress |
| 1732   | Complete    |
| 1733   | In-Progress |
| 1733   | Complete    |
| 1734   | In-Progress |
| 1735   | In-Progress |
| 1736   | In-Progress |
| 1736   | Complete    |
+--------+-------------+

shruggy

@shruggy said in 1.5.7.89: partclone doesn't capture an image in dd mode: wrong options in fog.upload:

After the coming Microsoft Patch Day (probably over the next weekend) I am planning to capture another disk image with this and deploy it to my pool in multi-cast mode.

I did it last weekend and the results are mixed. Yes, the image was successfully captured and then restored to 36 PCs in multi-cast. But: On five hosts I got this error message after restoring the image:

Reattempting to update database: Failed

The image was restored successfully on those hosts nevertheless. Only the FOG database wasn’t updated. All 36 PCs are identical hardware.

In the Imaging Log the End column for those five hosts says:

-0001-11-30 00:00:00

while the Duration column says:

2020 years 1 month 18 days 15 hours 35 minutes 43 seconds

It looks like somehow the data for Start timestamp got written into Duration?

Sebastian Roth

@shruggy said in [dev-branch] multicast: for some hosts DB not updated after restore:

WARNING: [pool www] seems busy (you may need to increase pm.start_servers

How many hosts do you have with fog-client installed? From those logs I would assume you have a lot.

I would try adjusting /etc/php-fpm.d/www.conf to:

pm.max_children = 100
pm.start_servers = 10
pm.min_spare_servers = 10
pm.max_spare_servers = 20
pm.max_requests = 2000

Don’t forget to restart php-fpm after adjustment.

As well you might want to increase the fog-client checkin time (FOG web UI -> FOG Configuration -> FOG Settings -> …)

shruggy

@Sebastian-Roth You can mark this as solved now. I didn’t go with the adjustments you suggested, though: just wanted to try first the configuration suggested at https://www.sitepoint.com/php-fpm-tuning-using-pm-static-max-performance and it worked.

Here is an excerpt from my current /etc/php-fpm.d/www.conf (the changed lines are the first two and the last):

pm = static
pm.max_children = 40
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 35
pm.max_requests = 500

I have a pool of 38 identical hosts.

tec618

Hi.
In our case the same thing is happening (with 30 PCs with the same hardware and the fog server mounted on ubuntu 18.04). When multicast with 12 pcs, on some hosts I received this error message after restoring the image: “Trying to update the database: Failed”, and in the database (imagingLog table) it does not record the end time of the deployment

The Apache logs contain nothing of note and the PHP-FPM log contains no warnings. What can happen in our case?

Thanks in advance

george1421

@shruggy I’m interested in this issue. How many systems do you typically image at the same time with multicast? How much memory do you have on the fog server?

george1421

@tec618 Can you follow Shruggy’s guidance. Update the www.conf file (the location will be some place under /etc (hint: find /etc -name www.conf ) and change the pm to static pm = static and set pm.max_children = 50 . Save the file and then issue a sudo systemctl restart php-fpm to restart the php-fpm service.

We will need to watch the available ram on your system since each pm client will consume a bit of ram memory.

tec618

Ok, I will follow the @shruggy’s guidance and tomorrow I will tell you the results.

In any case, comment that the fog server is a virtual machine with ubuntu 18.4 and 4Gb RAM. The main server has the latest version of CENTOS 7 installed and virtualizes with kvm

shruggy

@george1421 said in [dev-branch] multicast: for some hosts DB not updated after restore:

@shruggy How many systems do you typically image at the same time with multicast? How much memory do you have on the fog server?

Usually, it’s 36 systems at once. The setup is similar to @tec618’s: FOG on a VM with 4GB RAM, but both the VM and the hosting server run CentOS 7, and it’s Xen, not KVM. PHP 7.3 from Remi’s repo.

[dev-branch] multicast: for some hosts DB not updated after restore

148

12.1k

17.3k

155.4k