Hosts drop out of Multicast Session on Partition switch



  • Hello,
    I’ve just deployed a multicast task to 12 devices in my first multicast test in a live environment.

    The 12 devices booted into fog fine, started the Multicast Deployment when they all joined and begun imaging the first Partition. It only took a couple of seconds, but when it switched the to 500GB main partition, 5 of the 12 Machines did not start deploying and instead are just sat on a blank partclone screen. The 7 Remaining Devices carried on with deployment.

    Advice on steps to fix this appreciated!

    I’m on Debian 9 Fog Version 1.5.5 imaging from a portable laptop using DNSMasq.


  • Developer

    @Critchleyb said in Hosts drop out of Multicast Session on Partition switch:

    Intel l219-V

    Maybe it really is a driver issue. Makes me wonder if other people have same problems. On the other hand you said that it would unicast to all PCs just fine, right?!

    I found this kind of old topic. It talks about i219-LM cards but I have heard about issues when offloading is enabled and drivers cannot handle it properly. As well here is another topic talking about disabling energy saving. So in the same script where you added the iperf stuff you might comment that and add the folling two lines to test:

    ethtool --offload eth0 gso off gro off tso off
    ethtool --set-eee eth0 eee off
    

    Then schedule a normal multicast task for your clients and see if it’s any better or even worse?!


  • Developer

    @Critchleyb I have thought about this a bit more. Most issues arise when you try to multicast across switches and even more across routers/subnets. This can be very hard to get to work properly if you’re not one of these network wizards. But you said you had the problem even when multicasting clients all being on the same switch. This rules out the switch from my point of view as all switches I have seen handle multicast within their own fabrics fairly well.

    You might want to test with iPerf to see if you get the same results. Using iPerf you might get a feeling of what’s going on in your network. For that edit /images/dev/postinitscripts/fog.postinit on your FOG server (be careful, this will be used by every client doing a task so better don’t do this in an environment where someone else might be using this FOG sever to deploy clients at the same time) to look like this:

    #!/bin/bash
    ## This file serves as a starting point to call your custom pre-imaging/post init loading scripts.
    ## <SCRIPTNAME> should be changed to the script you're planning to use.
    ## Syntax of post init scripts are
    #. ${postinitpath}<SCRIPTNAME>
    curl -ko iperf https://iperf.fr/download/ubuntu/iperf_2.0.9
    chmod +x iperf
    ./iperf -s -u -B 224.0.0.1 -p 9000 -i 1
    

    Now also install iperf on your FOG server. Make sure it’s version 2.x as the newer 3.x does not support multicast testing anymore. Most distros should have it in the repos or you can get it here.

    Get together a couple of clients you want to test with and manually schedule a debug task for those (doesn’t matter if it’s capture or deploy as we don’t want to run the task itself anyway). You cannot schedule as debug via group so this has to be done individually for each host. Let them all boot up and hit ENTER twice to get to the shell. Then start the iperf test simply by running the command fog. Do this with all your test clients and let them wait there.

    Now back to the FOG server run this command to start the test: iperf -c 224.0.0.1 -u -p 9000 -i 1 -b 1000M -t 10

    This will do a first 10 second test with max bandwidth of 1000MBit/s. Adjust the last two parameters and start the test runs from your FOG server as often as you like. Once the clients are up and waiting you can start testing from the server again and again without touching the clients.

    Although you only get the interesting results watching the client’s screens. You see transfer rates and more importantly lost packets (in total and percent).

    Hint: The logic of client/server is a bit reverse in this iPerf test. In multicast mode you want one client (the FOG server) to send data to all your multicast server listeners (hosts).



  • @Sebastian-Roth I’m not great with it, but could maybe get our network guy to take a look if i was able to replicate it within our office. Not an easy thing to replicate though with the equipment we have available.

    They didn’t get dropped, they just fell a few blocks behind and slowed everything to a crawl. I eventually turned them off and the others returned to normal speed.

    All the clients were exactly the same, they have Gigabyte GA-Z170M-D3H Motherboard with Intel l219-V Network Cards. I’ve never had this issue unicasting them for the past Couple of years from our main fog server however.


  • Developer

    @Critchleyb Not sure what else we could advise. We have tuned things a fair bit but still you seem to have clients drop out of multicast fairly often. Are you good with network analyzing using Wireshark? Don’t really think you’ll find something causing this in your network but it might be worth a try having a look.

    Maybe crank up --retries-until-drop to values like 1000. But on the other hand a slow client will just pull down the speed for everyone else.

    Something that might be causing such dropouts could be network driver issue. Are all the clients you have cloned in the last batch exactly the same? What network cards do they have? Exact model? If it turns out to be a Linux network driver issue that might explain things.



  • @Sebastian-Roth Hey Again,

    Yeah I’ve done several runs. The Same Devices Dropped out which leads me to think those ones have hardware issues. Though before imaging they performed fine.

    I only get around 5 - 6 hours in the Live environment to deploy, so couldn’t get extensive testing done. I’m getting the Site to send those devices to our main office where I am going to test deploy on our main fog server to see if it was just the laptop struggling to deploy to them.

    I just can’t have such a high failure rate in my environment. Each time one fails and needs to have a unicast deployment to fix it adds 2 hours to the job. And the sites are quite far away and have led to me having some late nights trying to get them back up and running for the next day.

    I also tried Half Duplex just to test and the results were the same. All devices were on the same switch & subnet / vlan.

    Really stuck with what to do, the idea was to build a solution to re image devices across the country without having to send entire replacement PC’s. But without better hardware to support faster Unicasting i cant see it as a viable option using a portable server.


  • Developer

    @Critchleyb Have you done several runs? Is it always the same clients having the issue? Are they all on the same switch?

    You might want to try setting the sysctl parameters on the clients as well. For that edit /images/dev/postinitscripts/fog.postinit and add the systcl calls at the end of the file. Make sure that file is executable chmod +x /images/dev/postinitscripts/fog.postinit, start a new multicast task for your clients and boot them up.



  • No luck with the changes I’ve made unfortunately.

    I attempted to image 6 devices on the same switch yesterday, only 3 completed deployment.

    One stopped at 47% where it fell behind on blocks, Another at 75% and another took too long Syncing between partitions and dropped out.

    Not entirely sure what is causing this, but i think its a mix of sender and receiver. Its worth nothing that I am attempting to deploy 150GB on Partition 1, and 400GB on Partition 2 from a mid spec Laptop.



  • @Sebastian-Roth Thanks for that.

    Just to list the Changes I have made to attempt to improve stability:

    Within

    /var/www/html/fog/service/multicasttask.class.php
    
    Line 491
    
    Added: ' --retries-until-drop 100',
    Added: ' --max-bitrate 500m',
    
    Line 641:
    Changed Value: From 10 to 30
    
    Ran the following commands
    
    $ syctl -w net.core.rmem_max=26214400
    $ syctl -w net.core.rmem_default=26214400
    

    I’m also going to try to only deploy to devices on the same switch as to avoid traffic being passed through our router.

    I’ve taken a look at hdparm, but don’t understand how to create postinitscripts or how to use hdparm currently so I’m not going to take a shot at that just yet.

    I’m next able to test this out next Tues/Weds. So I will let you know how it goes.


  • Developer

    @Critchleyb said in Hosts drop out of Multicast Session on Partition switch:

    I’m assuming 10 means Mins and not Secs

    I think it’s 10 seconds! But this time should still be way enough for clients to join the next session, right?! If you think this is an issue. Go to the same file I mentioned earlier and you will find the value 10 in line 641. I forgot to mention that you need to restart the service after changes - systemctl restart FOGMulticastManager

    About the “rmem…” values. Here is an article, not related to UDPcast but pretty much to the point what you are looking for. It talks about setting the value as high as 25 MB (26214400) so definitely way higher than what you see as default. Now that I write this I remember that we had this in the forums some time ago: https://forums.fogproject.org/topic/11249/uncompleted-multicast

    As well here is an interesting post you wanna read: https://forums.fogproject.org/topic/12252/multicast-deploy-terribly-slow-and-huge-re-xmits-percentage



  • @Sebastian-Roth

    Did you actually see each and every client starting the transfer (partclone window filling up with more text and figures)? If not then I expect those clients to have been to slow to join

    I didn’t see them fill up with text, the assumption was based on my logs showing “Starting Transfer” and then Timeout Messages coming afterwards for multiple clients.

    I forgot to mention something very important. The MaxWait you set in the FOG settings is only used for the very first partitions. Partition two, three and so on have a timeout of 10!

    That’s interesting, I did notice that on the partition switch some of the devices came up with a “Clearing NFTS flag” line underneath the Partclone Screen where as some went straight through to partition two. Though on the partition switch the hosts that carried on deploying did within seconds, so I don’t think it reached the Max Wait time of 10 Minutes (I’m assuming 10 means Mins and not Secs)

    About the switch and multicast stuff… This is a huge and a bit complex topic. Can’t give you an appropriate answer to that just now. Most switches handle multicast fairly well without config adjustment. The more different components are involved the higher the chance you need to manually adjust configs to make it work properly.

    Okay! I don’t entirely understand how it’s all working in depth, I think some of the hosts in my environment might be having issues if the data needs to bounce through the switch, so in the future i’ll image all the hosts on a single switch each time to keep things simple.

    Be aware that network equipment or the receiver may be droppingpackets because of a bandwidth which is too high. Try limiting itusing “max-bitrate”.

    The receiver may also be dropping packets because it cannot write the data to disk fast enough. Use hdparm to optimize disk access onthe receiver. Try playing with the settings in /proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max, i.e. setting them to a higher value.

    I’ve taken a look at these files and they each contain a single value of “163840” without context. Not sure what this value relates to or what an acceptable value to change it to would be.

    Could you link to the page you found this info at? I’m also not sure where the “Max-Bitrate” value is and if its on the page I don’t want to keep bothering you! :P

    Really appreciate all the help you and the other devs / mods here take time to give. You lot do a great job, the best support with any piece of software I’ve ever had and its open source! haha.


  • Developer

    @Critchleyb said in Hosts drop out of Multicast Session on Partition switch:

    That explains why some clients were dropping on partition changes, though it is interesting to me that the transfer begun so all of the clients checked in, but then instantly timed out on deploy.

    Not exactly sure if that is true. Did you actually see each and every client starting the transfer (partclone window filling up with more text and figures)? If not then I expect those clients to have been to slow to join. I forgot to mention something very important. The MaxWait you set in the FOG settings is only used for the very first partitions. Partition two, three and so on have a timeout of 10! We decided to do this because clients should be pretty much “in sync” deploying the first partition and shouldn’t take long to join the session for the next partition. And on the other hand, let’s say you have one out of 16 clients having a major issue and not being able to join the session. You’d have a long wait delay between every partition if MaxWait of 600 would be used for all partitions.

    About the switch and multicast stuff… This is a huge and a bit complex topic. Can’t give you an appropriate answer to that just now. Most switches handle multicast fairly well without config adjustment. The more different components are involved the higher the chance you need to manually adjust configs to make it work properly.



  • @Sebastian-Roth Thanks, I’ll work through and see if I get any success. Unfortunately the site I was imaging at I wont be able to head back to, but I’m heading out to another site mid next week.

    Interesting stuff about each partition being a single UDPcast session. That explains why some clients were dropping on partition changes, though it is interesting to me that the transfer begun so all of the clients checked in, but then instantly timed out on deploy.

    As for network setup, I do have a question on that. As i understand it, Multicasting sends data to the switch, and then hosts request that data and it is pulled down.

    Our setup sometimes involve Multiple Meraki 225 48 Port switches that all have an uplink going to a Cisco 4331 Router. The Fog server is plugged into one of the Meraki 225 ports on the same VLAN as the Hosts.

    If I connect the FOG Server to Switch 1, and then try to multicast to two Hosts, one on Switch 1 & one on Switch 2, is this going to cause any issues?


  • Developer

    @Critchleyb Let’s keep the most obvious question in focus here. Why do some of your clients timeout at all? UPDcast does some re-sending of lost packets and has some wait time (will get to that later) but seems like something within your network (or the clients itself) seem to drop/reject packets or simply can’t keep up with it. As I don’t know your network structure and setup I have no idea what could be causing this. We’d probably need to capture packets using tcpdump on the FOG server or one of the clients (mirror port on the switch) and take a close look at it in Wireshark.

    Is it always the same clients dropping out or always different ones? I would advice you to test in smaller groups, e.g. try four in a multicast session and if those work nicely try another batch of four different ones. Note down which ones work fine and which cause problems.

    Now about the timeout…

    The FOG setting MaxWait is set to 600, however as i understand it, this is the time it will wait to start the session and doesn’t affect the amount of time UDPCast will wait for a host during an active task.

    You need to know that every partition is a single UDPcast session. The FOG multicast manager starts a command like this: udp-sender ... d1p1.img ; udp-sender ...d1p2.img ; .... - one upd-sender command for each partition running one after the other. You are right the --max-wait parameter is only telling udp-sender to wait for clients before it starts.

    Looking through the man page I found an interesting option that you might give a try - --retries-until-drop:

    How many time to send a REQACK until dropping a receiver. Lower retrycounts make “udp-sender” faster to react to crashed receivers, but they also increase the probability of false alerts …

    Edit /var/www/html/fog/lib/service/multicasttask.class.php, jump to line 491 and add a new line to make it look like this:

                sprintf(' %s', $duplex),
                ' --ttl 32',
                ' --nokbd',
                ' --nopointopoint',
                ' --retries-until-drop X',
            );
            $buildcmd = array_values(array_filter($buildcmd));
    

    I couldn’t figure out what the default for this value is. Maybe try a value of 100 and then work your way down. Just an idea. Could also try 5 at first and work your way upwards.

    As well I found an interesting comment in the man page about dropped packets:

    Be aware that network equipment or the receiver may be droppingpackets because of a bandwidth which is too high. Try limiting itusing “max-bitrate”.
    The receiver may also be dropping packets because it cannot write the data to disk fast enough. Use hdparm to optimize disk access onthe receiver. Try playing with the settings in /proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max, i.e. setting them to a higher value.

    You could adjust those values on the clients in a postinitscript.



  • I’ve had a look at the logs, It seems like the clients are being treated as timed out with the error:

    Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=10000
    

    I Understand UDPCast is no longer under active development, but is there anywhere in the code that i am able to increase the timeout duration whilst a task is in process?

    The FOG setting MaxWait is set to 600, however as i understand it, this is the time it will wait to start the session and doesn’t affect the amount of time UDPCast will wait for a host during an active task.

    I finished most of my deployment yesterday, but didn’t start a single multicast deployment where at least 2 clients didn’t drop out, and in my environment this causes some issues as im trying to fit reimaging in before bookings to use the devices and time frames can be tight :(


 

421
Online

5.4k
Users

12.6k
Topics

118.9k
Posts