SOLVED Multicast randomly hangs around 70-90% on last partition
Hello. Let me start by saying FOG is freaking awesome. I use it every day in our environment. Most of what I do is unicast, but occasionally it would be beneficial to use multicast. We have 15 libraries with a FOG server at each location. Each one is standalone, and they are not replicating images or anything like that. Only one library is able to successfully finish a multicast. When trying to multicast at any other location, the task starts just fine and every machine waits until everyone has joined and is ready to begin. The first 3 partitions image ok, then somewhere towards the end of the 4th and final partition every machine hangs. The 4th partition is a little over 40GB, so it runs for probably 3 or 4 minutes before hanging. This behavior is consistent across every other location except for that one. I have not been able to figure out what is different at that location. The server, switches, router, and even the client machines at every location are identical. All fog servers were running Ubuntu Server 16.04 when we started using fog about a year ago, but I’ve tried 16.10, 17.04, 17.10, and 18.04 in the process of troubleshooting just to see what would happen. Right now the fog server at the location that is working is running Ubuntu 18.04. This past week I’ve tried at 4 other locations and the multicast always hangs before it finishes that last partition.
I’m good with networking and Windows environments, but not so much Linux.
edit: I forgot to mention that I’ve tried the multicast on 2 machines at a time all the way up to 16 at a time. Our switches are Juniper EX-3300, 48 ports with PoE. Fog server and all clients are on the same VLAN.
I think that is interesant see this post:
@sebastian-roth The switches at each location are identical, and the configuration is fundamentally the same except that some locations have two switches stacked together to provide enough ports. One VLAN, same addressing scheme, same types of devices connected. Right now I’m really combing through the details of the configs, comparing the working locations to the ones that don’t. There could also be something with the fact that some locations have two switches and others have just one. That shouldn’t matter, but who knows. I’ll check back in with my findings.
@benc Running system upgrades as you do is not a bad thing. It’s wise to keep your system up to date! Usually in the Linux world such an upgrade would break things badly (seldomly!) or not at all. Sure, there are situations where an upgrade might introduce such subtle issues but that’s not what I see very often. So keep this good habit of keeping your systems updated!
From what I see we are fairly sure this is not a general issue with the clients and not a general problem of FOG as you see it happening at some locations but working fine at others. I wouldn’t say it’s impossible but I highly doubt this problem arises from upgrading your server OS packages. To me this sounds like some kind of network traffic shaping / limiter kicking in at some amount of traffic having passed through in one session.
Do you have different switches (configurations) at those locations?
PS: Debian and CentOS are pretty solid systems. Debian is closer to what you are used from using Ubuntu. CentOS is more enterprise like, being based on RHEL.
I’ve tried putting new drives in 2 PCs and trying a multicast again. Same results. I tried the same thing at another location and actually got up to 98% before it got stuck. I tried a couple more times and it hung randomly around 90%.
I’m starting to think that I made a mistake by trying to keep our FOG servers up to date. I’m relatively new to the Linux world and I just assumed that running apt-get update / apt-get upgrade / apt-get dist-upgrade / do-release-upgrade every now and then was probably a good idea to keep security tight. I have not had time to rebuild any of my FOG servers yet to see if that fixes my issues. When I do rebuild, I’ll most likely just throw a new drive in and start over. For long-term stability and reliability, what distro/version should I go with? Most of my experience in Linux has been with Ubuntu so I’d like to stay with that, but I’m open to suggestions.
The last 3 of my FOG servers I’ve been working with have successfully completed all multicasts. It looks like I’ve got the issue with about half of my servers. Haven’t yet found the issue or the difference between the working servers and the non-working servers. I may just try to reinstall Ubuntu Server 16 and start there. One thing I did try on the last server with the issue was to reinstall FOG. It failed on every package that had curl in it. Don’t know anything about curl but maybe that’s a clue.
I am working in a different location today, and both of the multicasts that I tried have worked all the way through. I copied the same image I have been using all along from the last location’s server to this location’s server, deployed it to 1 PC, changed a few settings, captured, and used multicast to deploy it to 5 PCs and then 3 PCs. I have attached one of the successful multicast logs from the server at this location.
@sebastian-roth I will try putting different hard drives in the clients, and if that shows the same results I’ll probably just reinstall Win10 on one of the machines, capture that, and use that as my smaller test image.
I am thinking about finding another smaller image to test with and see if maybe the smaller image multicasts successfully.
Definitely give that a try. See if you can pin point what exactly is causing this. So far I have no clue I am afraid.
The partition files you posted seem perfectly fine from my point of view.
Would you be able to put in a different hard drive in two or three of these PCs just for testing multicast on those and see if it makes any difference?
@sebastian-roth My guess is that the problem, whatever it is, only shows up on a partition that is over a certain size. Or perhaps it has to do with the time elapsed. I wouldn’t think it has anything to do with the type of partition or the data it contains. This image is a pretty straightforward UEFI Windows 10 install. The first 3 partitions are whatever Windows puts there during install. The last partition is NTFS. I am thinking about finding another smaller image to test with and see if maybe the smaller image multicasts successfully.
@benc Thanks for the multicast log. Really strange behavior I find. Why would the last partition play differently on multicast that do all the other partitions do (just thinking out loud)?
From the log it seems kind of random. Sometimes it’s just one client not answering and next it’s all of them at the same time:
Timeout notAnswered= notReady= nrAns=5 nrRead=5 nrPart=6 avg=106 Timeout notAnswered=[0,1,2,3,4,5] notReady=[0,1,2,3,4,5] nrAns=0 nrRead=0 nrPart=6 avg=105
What kind of filesystems do you have on those four partitions? Is it all FAT32 or NTFS?
Can you post the contents of
/images/Val-Public/d1.partitions(as well … fixed_size_partitions and …minimum.partitions if this is a resizable image type) - just trying to get a bigger picture here.
I pulled one of the multicast logs from the last server I used. Had to add .log to the end of the file to upload it here. Hope this helps.
We are having performance problems with the multicast task too with FOG 1.5.4 version. In some sites the performance is good but in another ones the performance is vey very bad.
We have two fog servers: the old fog server with 0.30 version and the new one with 1.5.4. We can deploy without problems with the old one (the both servers are in the same vlan and deploy to the same vlans) but with the new one no.
We are testing the net to know which is the problem but without success, but we have noticed that the --mcast-data-address always is the same value and in the old fog server always is different. Can be this parameter the problem?
@george1421 I haven’t had a chance to dig much deeper into this issue since my last post but I did look through the logs of the last server I was using. I see a lot of timeouts, but I’m not familiar enough with the logs to be able to identify the issue. I can post a log here if that would help. Which log(s) would be relevant?
@benc The fix I posted will only address a specific issue with FOG 1.5.4, It will not help fog 1.4.4 since it doesn’t use php-fpm.
So what does the multicast log say in /opt/fog/logs? Does it give you a clue to why its failing?
@george1421 I edited www.conf and set the values you specified. There was only one copy, located in /etc/php/7.1/fpm/pool.d/. The server at this location is running Ubuntu Server 18.04.1 and FOG 1.4.4. Tried a multicast with 5 machines and it failed the first time at 86%, then I tried it a second time and it failed at 93%. My third attempt was with 2 clients, and it failed at 90%. I attached a pic of a client after each attempt just after it hung. After it hangs, elapsed time on the clients is still counting up, and GB/min is slowly decreasing since there is no activity. I looked at each client to confirm they all showed the same thing. They all stop at the same block and the Partclone screens are identical. I checked the hard drive LEDs on each client to make sure none of them are hung or showing signs of drive failure. All clients have a 120GB SSD. During the first 3 or 4 minutes, the speed is around 11-12 GB/min, but I can usually tell when it’s about to fail because the speed will start dropping to around 8-9 GB/min. HD LEDs on each client flash as expected, and none of them are staying lit constantly. I have also tried using different images in case it is something related to the image. Also, when I use unicast to deploy, it works fine on all machines and runs near 12 GB/min the whole way through.
@george1421 Something I forgot to mention is that I’m running FOG 1.4.4 at all of my locations except one. I upgraded one of them to 1.5.4 to see what would happen. Same results. I am going to try editing www.conf as you suggested. Will report back shortly.
Lets assume is the issue we’ve found after FOG 1.5.4 has been released. Similar posts have addressed other multicasting issues with FOG 1.5.4. What the developers have seen is that under certain conditions php-fpm runs out of usable memory during a multicast. Probably the most useful value to you is bumping the memory from the default of 32MB to 256MB.
- Change to the /etc directory from the fog server linux command prompt.
- Search for www.conf file. It can be in a number of locations depending on what version of php is installed. Use this command.
find /etc -name www.conf(hopefully you will only find one)
- Edit that file file and ensure these settings are accurate. Don’t just add them since all should be there except
php_admin_value[memory_limit] = 256Myou will need to add that entry.
php_admin_value[memory_limit] = 256M pm.max_requests = 2000 pm.max_children = 35 pm.min_spare_servers = 5 pm.start_servers = 5
- Save and exit your text editor.
- Reboot the fog server.
- See if that fixes what is wrong. You really should only see this strangeness under heavy load, but I guess it might show up sooner under certain conditions.
Also we found there is something strange going on in the linux kernels after 4.15.2, I’m going to recommend that you downgrade your FOG/FOS kernel to 4.15.2. The issue with later kernels is that its taking 3-5 minutes to create the disk structure under certain circumstances, where with 4.15.2 and older its only seconds to create the structure.
Now the kernel will not impact your issue, but processing is incomplete might be related to the missing php-fpm configuration setting.
@sebastian-roth Yes, I have tried multicasting only two machines, and I’ve tried several different pairs in case it was a hard drive like you said. I’ve seen a failed drive twice before, and I’ve seen that if one machine drops off the network or dies for some reason, the others will hang because they only go as fast as the slowest member.
Lenovo ThinkCentre M90
Intel Core i5 @ 2.67 GHz
2TB mechanical SATA hard drive
onboard gigabit network card
ByteSpeed mini desktop
Intel Core i3 @ 3.9GHz
120GB SATA SSD
onboard gigabit network card
@benc Just to recap what you have tried so far. Have you done a multicast session with only two machines, tried that several times using two different machines every time? I just want to make sure this is not an issue cause by a couple of damaged disks in your PCs that might slow down or even break a whole multicast session as everyone would be waiting for the slowest member.
What kind of client machines do you have? Exact specs might help here.