FOG v1.5.7 on Ubuntu 18.04 random problem multicast

tec618

Hello.

We have a classroom with 30 computers with the same hardware and multi-boot system with several partitions. With version 1.5.7 (we do not know if this is the reason) when multicast with 12PCs when changing partition, some PCs do not start the deployment of the next partition and the task does not end on those hosts. It does not always happen with the same PCs, this is random.

We have reviewed logs, memory, free space, … but we did not find the possible cause.

What parameter would have to be modified to correct this problem? What can we check?

Thanks

george1421

Cross linking this post: https://forums.fogproject.org/topic/14143/dev-branch-multicast-for-some-hosts-db-not-updated-after-restore/2?_=1580905777518

Sebastian Roth

@george1421 From what I read between the lines I don’t think this topic is not related. While this was an issue at the end of multicasting when hosts update the DB this topic here seems to be about an issue when hosts step from one partition to the next.

@tec618 What do you see on the screen of the hosts that don’t proceed. Do they come up with the blue partclone screen and wait like this forever?

tec618

Exactly @Sebastian-Roth, PCs wait on the blue screen forever and do not start the next partition. It does not always occur on the same computer or on the same partition. In this image we have until nine partitions.

I can also prove what he says @george1421 because it happens in the same classroom.

george1421

@tec618 said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

I can also prove what he says @george1421 because it happens in the same classroom.

I can see the logic in that if there are not enough php-fpm workers to service the requests that the target systems may appear to hang if the server is to late to respond to the clients request. At this moment I don’t know if its one worker per multicast client or if one php-fpm worker can service multiple requests from multiple clients. We haven’t research it to that level yet.

How many computers do you have running the fog client? What is your client check in time out?

I agree that I’m probably off base here, but in the other post your conditions seem similar.

Sebastian Roth

@tec618 said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

It does not always occur on the same computer or on the same partition

So does that mean it’s not always happening on the first partition but usually on one of the later partitions?

In this image we have until nine partitions.

While FOG should be able to handle that amount of partitions I am still wondering why you have that many? Dual boot Windows/Linux? ChromeOS?

tec618

Hello again.
After modifying the file “www.conf”, I have returned to perform a multicast task with 6 identical computers and this is the result:

4 computers have completed the task perfectly
1 computer has failed to update the database (as described in the post: https://forums.fogproject.org/topic/14143/dev-branch-multicast-for-some-hosts-db-not-updated-after-restore/2?_=1580905777518)
and another computer has been locked when switching to partition 8 (this is the photo)

During the operation, the RAM (and CPUs) has been sufficient:
Captura de pantalla de 2020-02-06 13-22-03.png

Another interesting fact is that after finishing the deployment of computers, fog has eliminated the task (with a computer locked on the blue screen)

What parameter can we check? What could be the origin of this problem?

Sebastian Roth

@Tom-Elliott Have you ever seen a PC failing to proceed from one partition to the other in multicast?

@tec618 What happens if you deploy unicast to that PC that failed to pick up partition 8?

Tom Elliott

@Sebastian-Roth no I’ve never seen this before. While it doesn’t display, does it at least complete? I ask because of the nature of multicast.

If you have 10 machines to image and all 10 connect, imaging will proceed immediately. If one of those hosts were shut off after, then imaging would only proceed after a specified timeout, I think we default to 10 minutes.

Sebastian Roth

@tec618 Yeah, the timeout mentioned by Tom is a good point. When you see one machine not picking up on one of the partitions. Do the others sit there and wait for some amount of time as well?

The other thing that just came to my mind is checking /var/log/fog/multicast.log. While I don’t expect to see something out of the ordinary in there it’s still worth a try. You will see many lines with “No new tasks found” but at some point there should be a section of logs starting with “Task ID: xxx Name: Multi-Cast Task - yyy is new”. Please post the full block of lines here.

shruggy

@Sebastian-Roth said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

@Tom-Elliott Have you ever seen a PC failing to proceed from one partition to the other in multicast?

IIRC, I had this same problem once or twice, after one of FOG upgrades, probably half a year ago. I guess that prompted me to try out the dev-branch back then, the problem went away and I stayed on the dev-branch since then.

@tec618 What happens if you deploy unicast to that PC that failed to pick up partition 8?

In my case, it never happened in the unicast mode. I have a setup with 4 legacy primary MBR partitions. NTFS (500 MB) - NTFS (237 GB) - ext2 (500 MB) - LVM2 (237 GB). d1.partitions:

label: dos
label-id: 0x871158f2
device: /dev/sda
unit: sectors

/dev/sda1 : start=        2048, size=     1024000, type=7, bootable
/dev/sda2 : start=     1026048, size=   499081216, type=7
/dev/sda3 : start=   500107264, size=     1024000, type=83
/dev/sda4 : start=   501131264, size=   499083264, type=8e

IIRC, the freeze up happened after restoring the 3rd partition.

tec618

@Sebastian-Roth said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

@tec618 What happens if you deploy unicast to that PC that failed to pick up partition 8?

if we deploy unicast that PC no problem. In unicast does not fail, only in multicast

tec618

@Tom-Elliott said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

@Sebastian-Roth no I’ve never seen this before. While it doesn’t display, does it at least complete? I ask because of the nature of multicast.

If you have 10 machines to image and all 10 connect, imaging will proceed immediately. If one of those hosts were shut off after, then imaging would only proceed after a specified timeout, I think we default to 10 minutes.

The multicast task is not completed on the computer that fails. The solution is to turn off the computer and start the task in unicast.
We have observed that in the partition change the computers do not wait 10 minutes. This is only done when starting multicast deployment

tec618

@Sebastian-Roth These are the lines of the multicast.log file of the commented multicast task …

[02-06-20 1:03:55 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is new
[02-06-20 1:03:55 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 image file found, file: /images/L4Ene-5
[02-06-20 1:03:55 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 6 clients found
[02-06-20 1:03:55 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 sending on base port 51530
[02-06-20 1:03:55 pm]  | Command: /usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 600 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p1.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p2.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p3.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p4.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p5.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p6.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p7.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p8.img;
[02-06-20 1:03:55 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 has started
[02-06-20 1:04:05 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 1:04:15 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 1:04:25 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
..... [This line is repeated many times. Always the same.]
[02-06-20 2:08:24 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:08:34 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:08:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:08:54 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:09:04 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:09:14 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:09:24 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is already running with pid: 4610
[02-06-20 2:09:34 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is no longer running
[02-06-20 2:09:34 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 has been killed
[02-06-20 2:09:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is new
[02-06-20 2:09:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 image file found, file: /images/L4Ene-5
[02-06-20 2:09:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 6 clients found
[02-06-20 2:09:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 sending on base port 51530
[02-06-20 2:09:44 pm]  | Command: /usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 600 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p1.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p2.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p3.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p4.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p5.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p6.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p7.img;/usr/local/sbin/udp-sender --interface ens3 --min-receivers 6 --max-wait 10 --portbase 51530 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/L4Ene-5/d1p8.img;
[02-06-20 2:09:44 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 has started
[02-06-20 2:09:54 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 has been completed
[02-06-20 2:09:54 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 has been killed
[02-06-20 2:09:54 pm]  | Task ID: 25 Name: Multi-Cast Task - 15-16-27al30 is now completed
[02-06-20 2:10:04 pm] Task not created as there are no associated tasks
[02-06-20 2:10:04 pm] Or there was no number defined for joining session
[02-06-20 2:10:04 pm]  * No new tasks found
[02-06-20 2:10:14 pm]  * No new tasks found

Sebastian Roth

@tec618 said:

... /usr/local/sbin/udp-sender --max-wait 600 ... d1p1.img;/usr/local/sbin/udp-sender ... --max-wait 10 ... d1p2.img; ...

Good I asked about the log and had a closer look at this now. I remember that we have changed the timeouts but this was a long time ago so I didn’t remember. Currently the first timeout is set to 10 minutes so hosts that boot quicker than others to the multicast don’t rush off and let others behind. But for subsequent partitions the timeout is only 10 seconds. This was made because when you have one out of 20 hosts that fails on the first partition (for whatever reason) then the whole set of hosts would wait for 10 minutes on each partition.

Now possibly the 10 seconds is too little for some of your machines. Edit the file /var/www/html/fog/lib/service/multicasttask.class.php line 662 and change number 10 to 30 sec.

Cancel all multicast tasks that might still be running and restart the service: systemctl restart FOGMulticastManager

Tom Elliott

@Sebastian-Roth I made the change in working-1.6. I took a more cautious approach to set to one minute instead of 30 seconds.

Sebastian Roth

@Tom-Elliott I am not sure this is a good idea as default. This means that between every partition it needs to wait for at least one minute if one of the multicast client fails…

Tom Elliott

@Sebastian-Roth That, I think, is okay. 30 seconds could be too short a time (as the partition get’s re-expanded) potentially. It’s one minute vs 10 seconds or 30 seconds. I’m just taking a cautious approach. Only one minute to wait between a partition isn’t too much to ask I don’t think. I can certainly shorten the time. Or we could add a global setting for the admin to select an appropriate time.

Sebastian Roth

@Tom-Elliott said in FOG v1.5.7 on Ubuntu 18.04 random problem multicast:

Or we could add a global setting for the admin to select an appropriate time.

That would surely be a nice feature for 1.6.

You are probably right hat 1 minute is not asking too much. Though we haven’t seen issues with 10 seconds for many years now. I wouldn’t change that for 1.5.x. The OP is welcome to just manually adjust this.

tec618

@Sebastian-Roth @Tom-Elliott thanks for your comments.
It is very possible that this is the reason for our multicast problems. It may also be the reason for the host database update problem at the end of the multicast task (post https://forums.fogproject.org/topic/14143/dev-branch-multicast-for-some-hosts-db-not-updated-after-restore/2?_=1580905777518).
I will modify that value (60 seconds), I will test it and I will tell you the result.
This new parameter would be a good improvement for the next version 1.6

FOG v1.5.7 on Ubuntu 18.04 random problem multicast

176

11.6k

17.1k

154.5k