Multicast deploy terribly slow and huge re-xmits percentage
-
Hi,
I’m trying to deploy a linux image over about 100 workstations. For testing purpose I’ve tried with only one classroom (16 workstations).
With these 16 workstations deploying in multicast was terribly slow (between 20 and 50 MB/min) and in the udpcast log I see a lot of timeout with a really high re-xmits percentage (about 230%) like this :
Timeout notAnswered=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] notReady=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] nrAns=0 nrRead=0 nrPart=16 avg=270 bytes= 79 416 064 re-xmits=0129589 (237.5%) slice=0112 - 8 Timeout notAnswered=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] notReady=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] nrAns=0 nrRead=0 nrPart=16 avg=154 bytes= 81 536 000 re-xmits=0132880 (237.2%) slice=0112 - 4 Timeout notAnswered=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] notReady=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] nrAns=0 nrRead=0 nrPart=16 avg=249 bytes= 90 015 744 re-xmits=0146857 (237.5%) slice=0112 - 1 Timeout notAnswered=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] notReady=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] nrAns=0 nrRead=0 nrPart=16 avg=226 bytes= 96 375 552 re-xmits=0157193 (237.4%) slice=0112 - 11 Timeout notAnswered=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] notReady=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] nrAns=0 nrRead=0 nrPart=16 avg=265 bytes=114 476 544 re-xmits=0185615 (236.0%) slice=0112 - 4
If I deploy this image over only 2 workstations my speed grows to about 1 GB/min but I always get timeouts and a re-xmits percentage around 60% like this :
Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=936 Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=906 Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=1076 Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=1030 bytes= 9 438 906K re-xmits=4274638 ( 64.3%) slice=0112 - 0 Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=1033 bytes= 9 573 791K re-xmits=4335096 ( 64.3%) slice=0112 - 1 Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=713 Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=1077 Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=821 Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=921 bytes= 9 578 887K re-xmits=4338016 ( 64.3%) slice=0112 - 0 Timeout notAnswered=[0,1] notReady=[0,1] nrAns=0 nrRead=0 nrPart=2 avg=1098
I’ve tested some tweaks like :
$ sysctl -w net.core.rmem_max=16777216 $ sysctl -w net.core.rmem_default=16777216
But nothing improve the stability of multicasting… Is anybody have an idea ? I’m really stuck and I can’t deploy only 1 or 2 machines at a time.
Thanks,
Regards,
Bruno -
Just for informational purpose, workstations are DELL Optiplex 7010 connected on a extreme network switch (100mb ports) and fog-server is hosted on a DELL poweredge R420 server connected on same stack (1 Gb port).
-
@bmacadre said in Multicast deploy terribly slow and huge re-xmits percentage:
connected on a extreme network switch (100mb ports)
That’s your problem.
The switch is important with multicasting.
There is a certain amount of processing power involved with replicating a packet to all ports - and cheap switches just don’t cut it.
There’s also maximum total throughput to consider. For example, at home I have a consumer grade Cisco Small business switch. It’s 1Gbps on each port and has 5 ports. But total internal throughput is 3Gbps. That means that I would never be able to multicast at home using that switch at 1Gbps speeds for more than 2 computers at a time. However I have a new 8 port 1Gbps z-link switch from China (for 28 bucks new) that has internal throughput of 5Gbps. Meaning that device would be able to multicast to 4 computers at once with 1Gbps speed to each.
Again, cheap equipment just doesn’t cut it when it comes to multicast and really needing every port to operate at it’s maximum speed. The higher end Cisco equipment usually doesn’t have a problem though with this, they have the horsepower and typically have very high total internal throughput.
-
Thanks for replying me
@wayne-workman said in Multicast deploy terribly slow and huge re-xmits percentage:
There is a certain amount of processing power involved with replicating a packet to all ports - and cheap switches just don’t cut it.
That’s the real problem, these switch aren’t cheap switch (they are really expensive when we bought them many years ago), they have an internal bandwitch of 48,8 Gb/s (for x250e) and 128 Gb/s (for x450e). So no problem on this side.
After some research, and many many tests, I found a problem with workstations in 100Mb. I think it’s probably a bug (or a need of some tweaking) in workstation’s network driver (kernel 4.17). I explain :
First of all : To avoid some congestion on switch, I set a max bitrate in storage configuration at 80mb.
-
First test : Workstations and server on a x250e (workstation on 100Mb ports and server on a 1Gb port). Result : Many packets are dropped (about 1 milion for a 10 GB image) and about 50% of re-xmits.
-
Second test : Workstations and server on a x450e all on 1Gb ports (auto-neg). Result : No drop at all and 0% of re-xmits.
-
Third test : Workstations and server on a x450e all on 1Gb ports but all workstation’s ports are fixed in 100Mb/full duplex. Result : Same as first test.
Conclusion : Problem is not switches, they can easily manage this load. So I think there’s a problem on the client side… But I’ve no idea about that…
Regards,
Bruno -
-
@bmacadre I don’t understand where the 100Mb is coming from? You said in your second post:
extreme network switch (100mb ports)
Which makes me think the switch is a 100Mbps switch.
-
The extreme network switch x250e-48 has 48 10/100Mb ports and 2 1Gb ports (Combo copper/fiber). And the x450e-48 has 50 10/100/1000Mb ports (two of them are combo copper/fiber).
So in my first post (and my first try) workstations and server are connected to a x250e (workstations on a 100Mb port) and server on a 1Gb port.
Sorry for my bad explanation (and my bad english). I hope it’s clearer now.
Regards,
Bruno -
@bmacadre Try to put the FOG Server on one of the 100Mbps ports. This would obviously severely hinder unicast imaging, but if you’re mostly doing multicast then this might work better.