Hosts drop out of Multicast Session on Partition switch



  • Hello,
    I’ve just deployed a multicast task to 12 devices in my first multicast test in a live environment.

    The 12 devices booted into fog fine, started the Multicast Deployment when they all joined and begun imaging the first Partition. It only took a couple of seconds, but when it switched the to 500GB main partition, 5 of the 12 Machines did not start deploying and instead are just sat on a blank partclone screen. The 7 Remaining Devices carried on with deployment.

    Advice on steps to fix this appreciated!

    I’m on Debian 9 Fog Version 1.5.5 imaging from a portable laptop using DNSMasq.



  • @Sebastian-Roth Thanks for that.

    Just to list the Changes I have made to attempt to improve stability:

    Within

    /var/www/html/fog/service/multicasttask.class.php
    
    Line 491
    
    Added: ' --retries-until-drop 100',
    Added: ' --max-bitrate 500m',
    
    Line 641:
    Changed Value: From 10 to 30
    
    Ran the following commands
    
    $ syctl -w net.core.rmem_max=26214400
    $ syctl -w net.core.rmem_default=26214400
    

    I’m also going to try to only deploy to devices on the same switch as to avoid traffic being passed through our router.

    I’ve taken a look at hdparm, but don’t understand how to create postinitscripts or how to use hdparm currently so I’m not going to take a shot at that just yet.

    I’m next able to test this out next Tues/Weds. So I will let you know how it goes.


  • Developer

    @Critchleyb said in Hosts drop out of Multicast Session on Partition switch:

    I’m assuming 10 means Mins and not Secs

    I think it’s 10 seconds! But this time should still be way enough for clients to join the next session, right?! If you think this is an issue. Go to the same file I mentioned earlier and you will find the value 10 in line 641. I forgot to mention that you need to restart the service after changes - systemctl restart FOGMulticastManager

    About the “rmem…” values. Here is an article, not related to UDPcast but pretty much to the point what you are looking for. It talks about setting the value as high as 25 MB (26214400) so definitely way higher than what you see as default. Now that I write this I remember that we had this in the forums some time ago: https://forums.fogproject.org/topic/11249/uncompleted-multicast

    As well here is an interesting post you wanna read: https://forums.fogproject.org/topic/12252/multicast-deploy-terribly-slow-and-huge-re-xmits-percentage



  • @Sebastian-Roth

    Did you actually see each and every client starting the transfer (partclone window filling up with more text and figures)? If not then I expect those clients to have been to slow to join

    I didn’t see them fill up with text, the assumption was based on my logs showing “Starting Transfer” and then Timeout Messages coming afterwards for multiple clients.

    I forgot to mention something very important. The MaxWait you set in the FOG settings is only used for the very first partitions. Partition two, three and so on have a timeout of 10!

    That’s interesting, I did notice that on the partition switch some of the devices came up with a “Clearing NFTS flag” line underneath the Partclone Screen where as some went straight through to partition two. Though on the partition switch the hosts that carried on deploying did within seconds, so I don’t think it reached the Max Wait time of 10 Minutes (I’m assuming 10 means Mins and not Secs)

    About the switch and multicast stuff… This is a huge and a bit complex topic. Can’t give you an appropriate answer to that just now. Most switches handle multicast fairly well without config adjustment. The more different components are involved the higher the chance you need to manually adjust configs to make it work properly.

    Okay! I don’t entirely understand how it’s all working in depth, I think some of the hosts in my environment might be having issues if the data needs to bounce through the switch, so in the future i’ll image all the hosts on a single switch each time to keep things simple.

    Be aware that network equipment or the receiver may be droppingpackets because of a bandwidth which is too high. Try limiting itusing “max-bitrate”.

    The receiver may also be dropping packets because it cannot write the data to disk fast enough. Use hdparm to optimize disk access onthe receiver. Try playing with the settings in /proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max, i.e. setting them to a higher value.

    I’ve taken a look at these files and they each contain a single value of “163840” without context. Not sure what this value relates to or what an acceptable value to change it to would be.

    Could you link to the page you found this info at? I’m also not sure where the “Max-Bitrate” value is and if its on the page I don’t want to keep bothering you! :P

    Really appreciate all the help you and the other devs / mods here take time to give. You lot do a great job, the best support with any piece of software I’ve ever had and its open source! haha.


  • Developer

    @Critchleyb said in Hosts drop out of Multicast Session on Partition switch:

    That explains why some clients were dropping on partition changes, though it is interesting to me that the transfer begun so all of the clients checked in, but then instantly timed out on deploy.

    Not exactly sure if that is true. Did you actually see each and every client starting the transfer (partclone window filling up with more text and figures)? If not then I expect those clients to have been to slow to join. I forgot to mention something very important. The MaxWait you set in the FOG settings is only used for the very first partitions. Partition two, three and so on have a timeout of 10! We decided to do this because clients should be pretty much “in sync” deploying the first partition and shouldn’t take long to join the session for the next partition. And on the other hand, let’s say you have one out of 16 clients having a major issue and not being able to join the session. You’d have a long wait delay between every partition if MaxWait of 600 would be used for all partitions.

    About the switch and multicast stuff… This is a huge and a bit complex topic. Can’t give you an appropriate answer to that just now. Most switches handle multicast fairly well without config adjustment. The more different components are involved the higher the chance you need to manually adjust configs to make it work properly.



  • @Sebastian-Roth Thanks, I’ll work through and see if I get any success. Unfortunately the site I was imaging at I wont be able to head back to, but I’m heading out to another site mid next week.

    Interesting stuff about each partition being a single UDPcast session. That explains why some clients were dropping on partition changes, though it is interesting to me that the transfer begun so all of the clients checked in, but then instantly timed out on deploy.

    As for network setup, I do have a question on that. As i understand it, Multicasting sends data to the switch, and then hosts request that data and it is pulled down.

    Our setup sometimes involve Multiple Meraki 225 48 Port switches that all have an uplink going to a Cisco 4331 Router. The Fog server is plugged into one of the Meraki 225 ports on the same VLAN as the Hosts.

    If I connect the FOG Server to Switch 1, and then try to multicast to two Hosts, one on Switch 1 & one on Switch 2, is this going to cause any issues?


  • Developer

    @Critchleyb Let’s keep the most obvious question in focus here. Why do some of your clients timeout at all? UPDcast does some re-sending of lost packets and has some wait time (will get to that later) but seems like something within your network (or the clients itself) seem to drop/reject packets or simply can’t keep up with it. As I don’t know your network structure and setup I have no idea what could be causing this. We’d probably need to capture packets using tcpdump on the FOG server or one of the clients (mirror port on the switch) and take a close look at it in Wireshark.

    Is it always the same clients dropping out or always different ones? I would advice you to test in smaller groups, e.g. try four in a multicast session and if those work nicely try another batch of four different ones. Note down which ones work fine and which cause problems.

    Now about the timeout…

    The FOG setting MaxWait is set to 600, however as i understand it, this is the time it will wait to start the session and doesn’t affect the amount of time UDPCast will wait for a host during an active task.

    You need to know that every partition is a single UDPcast session. The FOG multicast manager starts a command like this: udp-sender ... d1p1.img ; udp-sender ...d1p2.img ; .... - one upd-sender command for each partition running one after the other. You are right the --max-wait parameter is only telling udp-sender to wait for clients before it starts.

    Looking through the man page I found an interesting option that you might give a try - --retries-until-drop:

    How many time to send a REQACK until dropping a receiver. Lower retrycounts make “udp-sender” faster to react to crashed receivers, but they also increase the probability of false alerts …

    Edit /var/www/html/fog/lib/service/multicasttask.class.php, jump to line 491 and add a new line to make it look like this:

                sprintf(' %s', $duplex),
                ' --ttl 32',
                ' --nokbd',
                ' --nopointopoint',
                ' --retries-until-drop X',
            );
            $buildcmd = array_values(array_filter($buildcmd));
    

    I couldn’t figure out what the default for this value is. Maybe try a value of 100 and then work your way down. Just an idea. Could also try 5 at first and work your way upwards.

    As well I found an interesting comment in the man page about dropped packets:

    Be aware that network equipment or the receiver may be droppingpackets because of a bandwidth which is too high. Try limiting itusing “max-bitrate”.
    The receiver may also be dropping packets because it cannot write the data to disk fast enough. Use hdparm to optimize disk access onthe receiver. Try playing with the settings in /proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max, i.e. setting them to a higher value.

    You could adjust those values on the clients in a postinitscript.



  • I’ve had a look at the logs, It seems like the clients are being treated as timed out with the error:

    Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=10000
    

    I Understand UDPCast is no longer under active development, but is there anywhere in the code that i am able to increase the timeout duration whilst a task is in process?

    The FOG setting MaxWait is set to 600, however as i understand it, this is the time it will wait to start the session and doesn’t affect the amount of time UDPCast will wait for a host during an active task.

    I finished most of my deployment yesterday, but didn’t start a single multicast deployment where at least 2 clients didn’t drop out, and in my environment this causes some issues as im trying to fit reimaging in before bookings to use the devices and time frames can be tight :(


 

587
Online

5.4k
Users

12.6k
Topics

118.6k
Posts