@Critchleyb Let’s keep the most obvious question in focus here. Why do some of your clients timeout at all? UPDcast does some re-sending of lost packets and has some wait time (will get to that later) but seems like something within your network (or the clients itself) seem to drop/reject packets or simply can’t keep up with it. As I don’t know your network structure and setup I have no idea what could be causing this. We’d probably need to capture packets using tcpdump on the FOG server or one of the clients (mirror port on the switch) and take a close look at it in Wireshark.
Is it always the same clients dropping out or always different ones? I would advice you to test in smaller groups, e.g. try four in a multicast session and if those work nicely try another batch of four different ones. Note down which ones work fine and which cause problems.
Now about the timeout…
The FOG setting MaxWait is set to 600, however as i understand it, this is the time it will wait to start the session and doesn’t affect the amount of time UDPCast will wait for a host during an active task.
You need to know that every partition is a single UDPcast session. The FOG multicast manager starts a command like this: udp-sender ... d1p1.img ; udp-sender ...d1p2.img ; ....
- one upd-sender command for each partition running one after the other. You are right the --max-wait
parameter is only telling udp-sender to wait for clients before it starts.
Looking through the man page I found an interesting option that you might give a try - --retries-until-drop
:
How many time to send a REQACK until dropping a receiver. Lower retrycounts make “udp-sender” faster to react to crashed receivers, but they also increase the probability of false alerts …
Edit /var/www/html/fog/lib/service/multicasttask.class.php
, jump to line 491 and add a new line to make it look like this:
sprintf(' %s', $duplex),
' --ttl 32',
' --nokbd',
' --nopointopoint',
' --retries-until-drop X',
);
$buildcmd = array_values(array_filter($buildcmd));
I couldn’t figure out what the default for this value is. Maybe try a value of 100 and then work your way down. Just an idea. Could also try 5 at first and work your way upwards.
As well I found an interesting comment in the man page about dropped packets:
Be aware that network equipment or the receiver may be droppingpackets because of a bandwidth which is too high. Try limiting itusing “max-bitrate”.
The receiver may also be dropping packets because it cannot write the data to disk fast enough. Use hdparm to optimize disk access onthe receiver. Try playing with the settings in /proc/sys/net/core/rmem_default
and /proc/sys/net/core/rmem_max
, i.e. setting them to a higher value.
You could adjust those values on the clients in a postinitscript
.