Multicast randomly hangs around 70-90% on last partition

benc

@george1421 I edited www.conf and set the values you specified. There was only one copy, located in /etc/php/7.1/fpm/pool.d/. The server at this location is running Ubuntu Server 18.04.1 and FOG 1.4.4. Tried a multicast with 5 machines and it failed the first time at 86%, then I tried it a second time and it failed at 93%. My third attempt was with 2 clients, and it failed at 90%. I attached a pic of a client after each attempt just after it hung. After it hangs, elapsed time on the clients is still counting up, and GB/min is slowly decreasing since there is no activity. I looked at each client to confirm they all showed the same thing. They all stop at the same block and the Partclone screens are identical. I checked the hard drive LEDs on each client to make sure none of them are hung or showing signs of drive failure. All clients have a 120GB SSD. During the first 3 or 4 minutes, the speed is around 11-12 GB/min, but I can usually tell when it’s about to fail because the speed will start dropping to around 8-9 GB/min. HD LEDs on each client flash as expected, and none of them are staying lit constantly. I have also tried using different images in case it is something related to the image. Also, when I use unicast to deploy, it works fine on all machines and runs near 12 GB/min the whole way through.

george1421

@benc The fix I posted will only address a specific issue with FOG 1.5.4, It will not help fog 1.4.4 since it doesn’t use php-fpm.

So what does the multicast log say in /opt/fog/logs? Does it give you a clue to why its failing?

benc

@george1421 I haven’t had a chance to dig much deeper into this issue since my last post but I did look through the logs of the last server I was using. I see a lot of timeouts, but I’m not familiar enough with the logs to be able to identify the issue. I can post a log here if that would help. Which log(s) would be relevant?

Fernando Gietz

Hi,

We are having performance problems with the multicast task too with FOG 1.5.4 version. In some sites the performance is good but in another ones the performance is vey very bad.

We have two fog servers: the old fog server with 0.30 version and the new one with 1.5.4. We can deploy without problems with the old one (the both servers are in the same vlan and deploy to the same vlans) but with the new one no.

We are testing the net to know which is the problem but without success, but we have noticed that the --mcast-data-address always is the same value and in the old fog server always is different. Can be this parameter the problem?

benc

0_1532980312008_multicast.log.udpcast.1.log

I pulled one of the multicast logs from the last server I used. Had to add .log to the end of the file to upload it here. Hope this helps.

Sebastian Roth

@benc Thanks for the multicast log. Really strange behavior I find. Why would the last partition play differently on multicast that do all the other partitions do (just thinking out loud)?
From the log it seems kind of random. Sometimes it’s just one client not answering and next it’s all of them at the same time:

Timeout notAnswered=[2] notReady=[2] nrAns=5 nrRead=5 nrPart=6 avg=106
Timeout notAnswered=[0,1,2,3,4,5] notReady=[0,1,2,3,4,5] nrAns=0 nrRead=0 nrPart=6 avg=105

What kind of filesystems do you have on those four partitions? Is it all FAT32 or NTFS?

Can you post the contents of /images/Val-Public/d1.partitions (as well … fixed_size_partitions and …minimum.partitions if this is a resizable image type) - just trying to get a bigger picture here.

benc

@sebastian-roth My guess is that the problem, whatever it is, only shows up on a partition that is over a certain size. Or perhaps it has to do with the time elapsed. I wouldn’t think it has anything to do with the type of partition or the data it contains. This image is a pretty straightforward UEFI Windows 10 install. The first 3 partitions are whatever Windows puts there during install. The last partition is NTFS. I am thinking about finding another smaller image to test with and see if maybe the smaller image multicasts successfully.

0_1533058838262_d1.fixed_size_partitions.log
0_1533058851863_d1.minimum.partitions.log
0_1533058865275_d1.partitions.log

Sebastian Roth

@benc said in Multicast randomly hangs around 70-90% on last partition:

I am thinking about finding another smaller image to test with and see if maybe the smaller image multicasts successfully.

Definitely give that a try. See if you can pin point what exactly is causing this. So far I have no clue I am afraid.
The partition files you posted seem perfectly fine from my point of view.

Would you be able to put in a different hard drive in two or three of these PCs just for testing multicast on those and see if it makes any difference?

benc

@sebastian-roth I will try putting different hard drives in the clients, and if that shows the same results I’ll probably just reinstall Win10 on one of the machines, capture that, and use that as my smaller test image.

benc

I am working in a different location today, and both of the multicasts that I tried have worked all the way through. I copied the same image I have been using all along from the last location’s server to this location’s server, deployed it to 1 PC, changed a few settings, captured, and used multicast to deploy it to 5 PCs and then 3 PCs. I have attached one of the successful multicast logs from the server at this location.

0_1533155793310_multicast.log.udpcast.12.log

benc

The last 3 of my FOG servers I’ve been working with have successfully completed all multicasts. It looks like I’ve got the issue with about half of my servers. Haven’t yet found the issue or the difference between the working servers and the non-working servers. I may just try to reinstall Ubuntu Server 16 and start there. One thing I did try on the last server with the issue was to reinstall FOG. It failed on every package that had curl in it. Don’t know anything about curl but maybe that’s a clue.

benc

I’ve tried putting new drives in 2 PCs and trying a multicast again. Same results. I tried the same thing at another location and actually got up to 98% before it got stuck. I tried a couple more times and it hung randomly around 90%.

I’m starting to think that I made a mistake by trying to keep our FOG servers up to date. I’m relatively new to the Linux world and I just assumed that running apt-get update / apt-get upgrade / apt-get dist-upgrade / do-release-upgrade every now and then was probably a good idea to keep security tight. I have not had time to rebuild any of my FOG servers yet to see if that fixes my issues. When I do rebuild, I’ll most likely just throw a new drive in and start over. For long-term stability and reliability, what distro/version should I go with? Most of my experience in Linux has been with Ubuntu so I’d like to stay with that, but I’m open to suggestions.

Sebastian Roth

@benc Running system upgrades as you do is not a bad thing. It’s wise to keep your system up to date! Usually in the Linux world such an upgrade would break things badly (seldomly!) or not at all. Sure, there are situations where an upgrade might introduce such subtle issues but that’s not what I see very often. So keep this good habit of keeping your systems updated!

From what I see we are fairly sure this is not a general issue with the clients and not a general problem of FOG as you see it happening at some locations but working fine at others. I wouldn’t say it’s impossible but I highly doubt this problem arises from upgrading your server OS packages. To me this sounds like some kind of network traffic shaping / limiter kicking in at some amount of traffic having passed through in one session.

Do you have different switches (configurations) at those locations?

PS: Debian and CentOS are pretty solid systems. Debian is closer to what you are used from using Ubuntu. CentOS is more enterprise like, being based on RHEL.

benc

@sebastian-roth The switches at each location are identical, and the configuration is fundamentally the same except that some locations have two switches stacked together to provide enough ports. One VLAN, same addressing scheme, same types of devices connected. Right now I’m really combing through the details of the configs, comparing the working locations to the ones that don’t. There could also be something with the fact that some locations have two switches and others have just one. That shouldn’t matter, but who knows. I’ll check back in with my findings.

Fernando Gietz

I think that is interesant see this post:

Multicast data address not change from one task to another one

Multicast randomly hangs around 70-90% on last partition

100

12.7k

17.6k

156.8k