FOG v1.5.7 on Ubuntu 18.04 random problem multicast

tec618

Exactly @Sebastian-Roth, PCs wait on the blue screen forever and do not start the next partition. It does not always occur on the same computer or on the same partition. In this image we have until nine partitions.

I can also prove what he says @george1421 because it happens in the same classroom.

george1421

@tec618 said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

I can also prove what he says @george1421 because it happens in the same classroom.

I can see the logic in that if there are not enough php-fpm workers to service the requests that the target systems may appear to hang if the server is to late to respond to the clients request. At this moment I don’t know if its one worker per multicast client or if one php-fpm worker can service multiple requests from multiple clients. We haven’t research it to that level yet.

How many computers do you have running the fog client? What is your client check in time out?

I agree that I’m probably off base here, but in the other post your conditions seem similar.

Sebastian Roth

@tec618 said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

It does not always occur on the same computer or on the same partition

So does that mean it’s not always happening on the first partition but usually on one of the later partitions?

In this image we have until nine partitions.

While FOG should be able to handle that amount of partitions I am still wondering why you have that many? Dual boot Windows/Linux? ChromeOS?

tec618

Hello again.
After modifying the file “www.conf”, I have returned to perform a multicast task with 6 identical computers and this is the result:

4 computers have completed the task perfectly
1 computer has failed to update the database (as described in the post: https://forums.fogproject.org/topic/14143/dev-branch-multicast-for-some-hosts-db-not-updated-after-restore/2?_=1580905777518)
and another computer has been locked when switching to partition 8 (this is the photo)

During the operation, the RAM (and CPUs) has been sufficient:
Captura de pantalla de 2020-02-06 13-22-03.png

Another interesting fact is that after finishing the deployment of computers, fog has eliminated the task (with a computer locked on the blue screen)

What parameter can we check? What could be the origin of this problem?

Sebastian Roth

@Tom-Elliott Have you ever seen a PC failing to proceed from one partition to the other in multicast?

@tec618 What happens if you deploy unicast to that PC that failed to pick up partition 8?

Tom Elliott

@Sebastian-Roth no I’ve never seen this before. While it doesn’t display, does it at least complete? I ask because of the nature of multicast.

If you have 10 machines to image and all 10 connect, imaging will proceed immediately. If one of those hosts were shut off after, then imaging would only proceed after a specified timeout, I think we default to 10 minutes.

Sebastian Roth

@tec618 Yeah, the timeout mentioned by Tom is a good point. When you see one machine not picking up on one of the partitions. Do the others sit there and wait for some amount of time as well?

The other thing that just came to my mind is checking /var/log/fog/multicast.log. While I don’t expect to see something out of the ordinary in there it’s still worth a try. You will see many lines with “No new tasks found” but at some point there should be a section of logs starting with “Task ID: xxx Name: Multi-Cast Task - yyy is new”. Please post the full block of lines here.

shruggy

@Sebastian-Roth said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

@Tom-Elliott Have you ever seen a PC failing to proceed from one partition to the other in multicast?

IIRC, I had this same problem once or twice, after one of FOG upgrades, probably half a year ago. I guess that prompted me to try out the dev-branch back then, the problem went away and I stayed on the dev-branch since then.

@tec618 What happens if you deploy unicast to that PC that failed to pick up partition 8?

In my case, it never happened in the unicast mode. I have a setup with 4 legacy primary MBR partitions. NTFS (500 MB) - NTFS (237 GB) - ext2 (500 MB) - LVM2 (237 GB). d1.partitions:

label: dos
label-id: 0x871158f2
device: /dev/sda
unit: sectors

/dev/sda1 : start=        2048, size=     1024000, type=7, bootable
/dev/sda2 : start=     1026048, size=   499081216, type=7
/dev/sda3 : start=   500107264, size=     1024000, type=83
/dev/sda4 : start=   501131264, size=   499083264, type=8e

IIRC, the freeze up happened after restoring the 3rd partition.

tec618

@Sebastian-Roth said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

@tec618 What happens if you deploy unicast to that PC that failed to pick up partition 8?

if we deploy unicast that PC no problem. In unicast does not fail, only in multicast

tec618

@Tom-Elliott said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

@Sebastian-Roth no I’ve never seen this before. While it doesn’t display, does it at least complete? I ask because of the nature of multicast.

If you have 10 machines to image and all 10 connect, imaging will proceed immediately. If one of those hosts were shut off after, then imaging would only proceed after a specified timeout, I think we default to 10 minutes.

The multicast task is not completed on the computer that fails. The solution is to turn off the computer and start the task in unicast.
We have observed that in the partition change the computers do not wait 10 minutes. This is only done when starting multicast deployment

tec618

@Sebastian-Roth These are the lines of the multicast.log file of the commented multicast task …

[02-06-20 1:03:55 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is new
[02-06-20 1:03:55 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 image file found, file: /images/L4Ene-5
[02-06-20 1:03:55 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 6 clients found
[02-06-20 1:03:55 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 sending on base port 51530
[02-06-20 1:03:55 pm]  | Command: /usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 600 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p1.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p2.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p3.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p4.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p5.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p6.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p7.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p8.img;
[02-06-20 1:03:55 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 has started
[02-06-20 1:04:05 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 1:04:15 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 1:04:25 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
..... [This line is repeated many times. Always the same.]
[02-06-20 2:08:24 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:08:34 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:08:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:08:54 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:09:04 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:09:14 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:09:24 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:09:34 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is no longer running
[02-06-20 2:09:34 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 has been killed
[02-06-20 2:09:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is new
[02-06-20 2:09:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 image file found, file: /images/L4Ene-5
[02-06-20 2:09:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 6 clients found
[02-06-20 2:09:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 sending on base port 51530
[02-06-20 2:09:44 pm]  | Command: /usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 600 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p1.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p2.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p3.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p4.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p5.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p6.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p7.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p8.img;
[02-06-20 2:09:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 has started
[02-06-20 2:09:54 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 has been completed
[02-06-20 2:09:54 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 has been killed
[02-06-20 2:09:54 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is now completed
[02-06-20 2:10:04 pm] Task not created as there are no associated tasks
[02-06-20 2:10:04 pm] Or there was no number defined for joining session
[02-06-20 2:10:04 pm]  * No new tasks found
[02-06-20 2:10:14 pm]  * No new tasks found

Sebastian Roth

@tec618 said:

... /usr/local/sbin/udp-sender --max-wait 600 ... d1p1.img;/usr/local/sbin/udp-sender ... --max-wait 10 ... d1p2.img; ...

Good I asked about the log and had a closer look at this now. I remember that we have changed the timeouts but this was a long time ago so I didn’t remember. Currently the first timeout is set to 10 minutes so hosts that boot quicker than others to the multicast don’t rush off and let others behind. But for subsequent partitions the timeout is only 10 seconds. This was made because when you have one out of 20 hosts that fails on the first partition (for whatever reason) then the whole set of hosts would wait for 10 minutes on each partition.

Now possibly the 10 seconds is too little for some of your machines. Edit the file /var/www/html/fog/lib/service/multicasttask.class.php line 662 and change number 10 to 30 sec.

Cancel all multicast tasks that might still be running and restart the service: systemctl restart FOGMulticastManager

Tom Elliott

@Sebastian-Roth I made the change in working-1.6. I took a more cautious approach to set to one minute instead of 30 seconds.

Sebastian Roth

@Tom-Elliott I am not sure this is a good idea as default. This means that between every partition it needs to wait for at least one minute if one of the multicast client fails…

Tom Elliott

@Sebastian-Roth That, I think, is okay. 30 seconds could be too short a time (as the partition get’s re-expanded) potentially. It’s one minute vs 10 seconds or 30 seconds. I’m just taking a cautious approach. Only one minute to wait between a partition isn’t too much to ask I don’t think. I can certainly shorten the time. Or we could add a global setting for the admin to select an appropriate time.

Sebastian Roth

@Tom-Elliott said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

Or we could add a global setting for the admin to select an appropriate time.

That would surely be a nice feature for 1.6.

You are probably right hat 1 minute is not asking too much. Though we haven’t seen issues with 10 seconds for many years now. I wouldn’t change that for 1.5.x. The OP is welcome to just manually adjust this.

tec618

@Sebastian-Roth @Tom-Elliott thanks for your comments.
It is very possible that this is the reason for our multicast problems. It may also be the reason for the host database update problem at the end of the multicast task (post https://forums.fogproject.org/topic/14143/dev-branch-multicast-for-some-hosts-db-not-updated-after-restore/2?_=1580905777518).
I will modify that value (60 seconds), I will test it and I will tell you the result.
This new parameter would be a good improvement for the next version 1.6

tec618

Hello again. I think the problem of blocking when changing partition has been fixed. Yesterday I executed a multicast task with 12 computers and all deployed each partition. But 2 computers showed this other error message at the end of the task (at the end of the deployment of all partitions):

Then, in the “Imaging log” report it does not correctly show the end time of this computer’s task (“-0001-11-30 00:00:00” and the duration “2020 years 2 months …”).

Is it possible to correct this too?

Thanks.

Sebastian Roth

@tec618 How many hosts with fog-client software installed do you have in your network?

Please take a look at the apache and php-fpm logs (see my signature) and post valid information here.

tec618

@Sebastian-Roth said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

@tec618 How many hosts with fog-client software installed do you have in your network?

In our subnet there are really 32 computers plus the fog server (virtual machine). The 32 computers have the fog client software installed on the 2 mounted operating systems (Windows and Ubuntu)

Please take a look at the apache and php-fpm logs (see my signature) and post valid information here.

We have just completed another multicast task with 12 computers to attach the lines of the requested files (and previously we have restarted the MV of the FOG server). In this case, only one computer failed to update the database at the end.

/var/log/apache2/error.log

[Wed Feb 12 00:06:04.802457 2020] [mpm_prefork:notice] [pid 988] AH00163: Apache/2.4.29 (Ubuntu) OpenSSL/1.1.1 configured -- resuming normal operations
[Wed Feb 12 00:06:04.802492 2020] [core:notice] [pid 988] AH00094: Command line: '/usr/sbin/apache2'
[Wed Feb 12 08:40:18.331499 2020] [proxy_fcgi:error] [pid 8178] [client 192.168.36.252:37258] AH01071: Got error 'PHP message: PHP Warning:  Illegal offset type in /var/www/fog/lib/fog/fogmanagercontroller.class.php on line 296\n', referer: http://192.168.36.252/fog/management/index.php?node=host&sub=edit&id=96
[Wed Feb 12 08:47:22.410088 2020] [mpm_prefork:notice] [pid 988] AH00169: caught SIGTERM, shutting down
[Wed Feb 12 08:47:46.067103 2020] [mpm_prefork:notice] [pid 940] AH00163: Apache/2.4.29 (Ubuntu) OpenSSL/1.1.1 configured -- resuming normal operations
[Wed Feb 12 08:47:46.067352 2020] [core:notice] [pid 940] AH00094: Command line: '/usr/sbin/apache2'
[Wed Feb 12 09:00:23.764176 2020] [proxy_fcgi:error] [pid 986] [client 192.168.36.116:55576] AH01071: Got error 'PHP message: PHP Warning:  count(): Parameter must be an array or an object that implements Countable in /var/www/fog/lib/pages/imagemanagementpage.class.php on line 1125\n', referer: http://192.168.36.252/fog/management/index.php?node=image&sub=list

/var/log/php7.2-fpm.log

[11-Feb-2020 23:29:30] NOTICE: [pool www] child 5974 started
[12-Feb-2020 00:12:32] NOTICE: [pool www] child 19941 exited with code 0 after 20696.983251 seconds from start
[12-Feb-2020 00:12:32] NOTICE: [pool www] child 8678 started
[12-Feb-2020 00:12:57] NOTICE: [pool www] child 19943 exited with code 0 after 20719.070195 seconds from start
[12-Feb-2020 00:12:57] NOTICE: [pool www] child 8703 started
[12-Feb-2020 00:13:40] NOTICE: [pool www] child 19942 exited with code 0 after 20762.680041 seconds from start
[12-Feb-2020 00:13:40] NOTICE: [pool www] child 8740 started
[12-Feb-2020 00:13:57] NOTICE: [pool www] child 19960 exited with code 0 after 20760.003372 seconds from start
[12-Feb-2020 00:13:57] NOTICE: [pool www] child 8757 started
[12-Feb-2020 00:14:08] NOTICE: [pool www] child 19970 exited with code 0 after 20763.628227 seconds from start
[12-Feb-2020 00:14:08] NOTICE: [pool www] child 8767 started
[12-Feb-2020 00:14:08] NOTICE: [pool www] child 19969 exited with code 0 after 20766.224074 seconds from start
[12-Feb-2020 00:14:08] NOTICE: [pool www] child 8768 started
[12-Feb-2020 00:16:29] NOTICE: [pool www] child 19999 exited with code 0 after 20869.088276 seconds from start
[12-Feb-2020 00:16:29] NOTICE: [pool www] child 8890 started
[12-Feb-2020 01:01:01] NOTICE: [pool www] child 20933 exited with code 0 after 22715.786627 seconds from start
[12-Feb-2020 01:01:01] NOTICE: [pool www] child 11497 started
[12-Feb-2020 03:10:38] NOTICE: [pool www] child 24868 exited with code 0 after 26547.790084 seconds from start
[12-Feb-2020 03:10:38] NOTICE: [pool www] child 19375 started
[12-Feb-2020 06:52:50] NOTICE: [pool www] child 5974 exited with code 0 after 26599.969300 seconds from start
[12-Feb-2020 06:52:50] NOTICE: [pool www] child 32550 started
[12-Feb-2020 07:34:55] NOTICE: [pool www] child 8678 exited with code 0 after 26543.871750 seconds from start
[12-Feb-2020 07:34:55] NOTICE: [pool www] child 3028 started
[12-Feb-2020 07:35:30] NOTICE: [pool www] child 8703 exited with code 0 after 26553.244558 seconds from start
[12-Feb-2020 07:35:30] NOTICE: [pool www] child 3063 started
[12-Feb-2020 07:36:38] NOTICE: [pool www] child 8740 exited with code 0 after 26577.416929 seconds from start
[12-Feb-2020 07:36:38] NOTICE: [pool www] child 3124 started
[12-Feb-2020 07:36:40] NOTICE: [pool www] child 8757 exited with code 0 after 26563.254436 seconds from start
[12-Feb-2020 07:36:40] NOTICE: [pool www] child 3125 started
[12-Feb-2020 07:36:55] NOTICE: [pool www] child 8767 exited with code 0 after 26567.924853 seconds from start
[12-Feb-2020 07:36:55] NOTICE: [pool www] child 3137 started
[12-Feb-2020 07:36:58] NOTICE: [pool www] child 8768 exited with code 0 after 26570.386211 seconds from start
[12-Feb-2020 07:36:58] NOTICE: [pool www] child 3146 started
[12-Feb-2020 07:39:10] NOTICE: [pool www] child 8890 exited with code 0 after 26560.913797 seconds from start
[12-Feb-2020 07:39:10] NOTICE: [pool www] child 3396 started
[12-Feb-2020 08:24:50] NOTICE: [pool www] child 11497 exited with code 0 after 26629.620151 seconds from start
[12-Feb-2020 08:24:50] NOTICE: [pool www] child 6007 started
[12-Feb-2020 08:47:22] NOTICE: Terminating ...
[12-Feb-2020 08:47:22] NOTICE: exiting, bye-bye!
[12-Feb-2020 08:47:46] NOTICE: fpm is running, pid 744
[12-Feb-2020 08:47:46] NOTICE: ready to handle connections
[12-Feb-2020 08:47:46] NOTICE: systemd monitor interval set to 10000ms

In this log file there are no new lines referring to this last multicast task

FOG v1.5.7 on Ubuntu 18.04 random problem multicast

168

11.6k

17.1k

154.5k