Hosts drop out of Multicast Session on Partition switch

Critchleyb

Just to list the Changes I have made to attempt to improve stability:

Within

/var/www/html/fog/service/multicasttask.class.php

Line 491

Added: ' --retries-until-drop 100',
Added: ' --max-bitrate 500m',

Line 641:
Changed Value: From 10 to 30

Ran the following commands

$ syctl -w net.core.rmem_max=26214400
$ syctl -w net.core.rmem_default=26214400

I’m also going to try to only deploy to devices on the same switch as to avoid traffic being passed through our router.

I’ve taken a look at hdparm, but don’t understand how to create postinitscripts or how to use hdparm currently so I’m not going to take a shot at that just yet.

I’m next able to test this out next Tues/Weds. So I will let you know how it goes.

Critchleyb

No luck with the changes I’ve made unfortunately.

I attempted to image 6 devices on the same switch yesterday, only 3 completed deployment.

One stopped at 47% where it fell behind on blocks, Another at 75% and another took too long Syncing between partitions and dropped out.

Not entirely sure what is causing this, but i think its a mix of sender and receiver. Its worth nothing that I am attempting to deploy 150GB on Partition 1, and 400GB on Partition 2 from a mid spec Laptop.

Sebastian Roth

@Critchleyb Have you done several runs? Is it always the same clients having the issue? Are they all on the same switch?

You might want to try setting the sysctl parameters on the clients as well. For that edit /images/dev/postinitscripts/fog.postinit and add the systcl calls at the end of the file. Make sure that file is executable chmod +x /images/dev/postinitscripts/fog.postinit, start a new multicast task for your clients and boot them up.

Critchleyb

@Sebastian-Roth Hey Again,

Yeah I’ve done several runs. The Same Devices Dropped out which leads me to think those ones have hardware issues. Though before imaging they performed fine.

I only get around 5 - 6 hours in the Live environment to deploy, so couldn’t get extensive testing done. I’m getting the Site to send those devices to our main office where I am going to test deploy on our main fog server to see if it was just the laptop struggling to deploy to them.

I just can’t have such a high failure rate in my environment. Each time one fails and needs to have a unicast deployment to fix it adds 2 hours to the job. And the sites are quite far away and have led to me having some late nights trying to get them back up and running for the next day.

I also tried Half Duplex just to test and the results were the same. All devices were on the same switch & subnet / vlan.

Really stuck with what to do, the idea was to build a solution to re image devices across the country without having to send entire replacement PC’s. But without better hardware to support faster Unicasting i cant see it as a viable option using a portable server.

Sebastian Roth

@Critchleyb Not sure what else we could advise. We have tuned things a fair bit but still you seem to have clients drop out of multicast fairly often. Are you good with network analyzing using Wireshark? Don’t really think you’ll find something causing this in your network but it might be worth a try having a look.

Maybe crank up --retries-until-drop to values like 1000. But on the other hand a slow client will just pull down the speed for everyone else.

Something that might be causing such dropouts could be network driver issue. Are all the clients you have cloned in the last batch exactly the same? What network cards do they have? Exact model? If it turns out to be a Linux network driver issue that might explain things.

Critchleyb

@Sebastian-Roth I’m not great with it, but could maybe get our network guy to take a look if i was able to replicate it within our office. Not an easy thing to replicate though with the equipment we have available.

They didn’t get dropped, they just fell a few blocks behind and slowed everything to a crawl. I eventually turned them off and the others returned to normal speed.

All the clients were exactly the same, they have Gigabyte GA-Z170M-D3H Motherboard with Intel l219-V Network Cards. I’ve never had this issue unicasting them for the past Couple of years from our main fog server however.

Sebastian Roth

@Critchleyb I have thought about this a bit more. Most issues arise when you try to multicast across switches and even more across routers/subnets. This can be very hard to get to work properly if you’re not one of these network wizards. But you said you had the problem even when multicasting clients all being on the same switch. This rules out the switch from my point of view as all switches I have seen handle multicast within their own fabrics fairly well.

You might want to test with iPerf to see if you get the same results. Using iPerf you might get a feeling of what’s going on in your network. For that edit /images/dev/postinitscripts/fog.postinit on your FOG server (be careful, this will be used by every client doing a task so better don’t do this in an environment where someone else might be using this FOG sever to deploy clients at the same time) to look like this:

#!/bin/bash
## This file serves as a starting point to call your custom pre-imaging/post init loading scripts.
## <SCRIPTNAME> should be changed to the script you're planning to use.
## Syntax of post init scripts are
#. ${postinitpath}<SCRIPTNAME>
curl -ko iperf https://iperf.fr/download/ubuntu/iperf_2.0.9
chmod +x iperf
./iperf -s -u -B 224.0.0.1 -p 9000 -i 1

Now also install iperf on your FOG server. Make sure it’s version 2.x as the newer 3.x does not support multicast testing anymore. Most distros should have it in the repos or you can get it here.

Get together a couple of clients you want to test with and manually schedule a debug task for those (doesn’t matter if it’s capture or deploy as we don’t want to run the task itself anyway). You cannot schedule as debug via group so this has to be done individually for each host. Let them all boot up and hit ENTER twice to get to the shell. Then start the iperf test simply by running the command fog. Do this with all your test clients and let them wait there.

Now back to the FOG server run this command to start the test: iperf -c 224.0.0.1 -u -p 9000 -i 1 -b 1000M -t 10

This will do a first 10 second test with max bandwidth of 1000MBit/s. Adjust the last two parameters and start the test runs from your FOG server as often as you like. Once the clients are up and waiting you can start testing from the server again and again without touching the clients.

Although you only get the interesting results watching the client’s screens. You see transfer rates and more importantly lost packets (in total and percent).

Hint: The logic of client/server is a bit reverse in this iPerf test. In multicast mode you want one client (the FOG server) to send data to all your multicast server listeners (hosts).

Sebastian Roth

@Critchleyb said in Hosts drop out of Multicast Session on Partition switch:

Intel l219-V

Maybe it really is a driver issue. Makes me wonder if other people have same problems. On the other hand you said that it would unicast to all PCs just fine, right?!

I found this kind of old topic. It talks about i219-LM cards but I have heard about issues when offloading is enabled and drivers cannot handle it properly. As well here is another topic talking about disabling energy saving. So in the same script where you added the iperf stuff you might comment that and add the folling two lines to test:

ethtool --offload eth0 gso off gro off tso off
ethtool --set-eee eth0 eee off

Then schedule a normal multicast task for your clients and see if it’s any better or even worse?!

Critchleyb

@Sebastian-Roth Thanks for all the info. I’m going to try and get a test environment working as its difficult to test this without affecting a live environment and potentially breaking some of the devices they have booked out later in the day >.<

That way I’ll be able to get a whole lot more info without the threat of running out of time and affecting a live environment!

george1421

@Sebastian-Roth In regards to iperf. When I was bandwidth bench marking some hardware designs, I copied iperf v2 (he’ll have to search for it) to the fog server. Then I pxe booted a target computer in debug mode. From the target FOS system I just uses scp to copy the file to the target computer from the FOG server. This way my testing didn’t impact normal imaging as with the postinit scripts. It gives the same results just without needing to create a postinit script. Its just a different way to go about it neither way is better than the other IMO.

FWIW, I could flood a 1 GbE network link with just 2 simultaneous unicast deployments. That 3rd unicast made the packet error rates jump up quite a bit.

Sebastian Roth

@Critchleyb Did you come to a conclusion with this or got a little further with the commands and ideas provided?

Hosts drop out of Multicast Session on Partition switch

88

12.6k

17.5k

156.3k