Uncompleted multicast

kkroflin

I’m testing multicast as deployment method with FOG 1.5.0-RC-10 and I am experiencing problems. Previously I haven’t used multicast with FOG so I don’t know if this is a new issue or an old one.

After a few minutes of running multicast task (3 Xen VMs), one of the hosts fails and then after several minutes another host fails.

When the host fails, a “partclone fail” message is visible on the screen and it gets rebooted automatically after 1 minute and multicast task is canceled. The remaining hosts are still being deployed but they also fail after some time. When multicast on the first host is aborted and the host rebooted, “Task (18) Multi-Cast Task has been completed.” is written in multicast log. After the reboot, the host is booted and Windows starts repairing disk errors.

The screenshots and the log file are available on the Google Drive: https://drive.google.com/open?id=1gVETBU9oJAVLqBOCyNm3AjqKYLYHMWlU

Is there anything I could try to get multicast working?

Also, in case of an error like this, I would expect to receive some clear error state for the host or infinite imaging loop, because if the task failed at 95% percent, I probably wouldn’t notice that the task was not completed correctly.

kkroflin

@sebastian-roth said in Uncompleted multicast:

@kkroflin Great you tested this on the command line! To me this looks like a crude network issue. Are you good with wireshark analysis network traffic? I’d think you can find something there. Maybe it’s some kind of jumbo frame misconfiguration!?

We don’t use jumbo frames. I tried Wireshark on the receiver and I noticed that the packets were mostly missing in large chunks. When I limited max-bitrate on the sender, there were no issues and the receiver was only notifying the sender about completed blocks. Slowing down the receiver with additional “gzip -9” after gunzip increased re-xmits and timeouts.

I found out that changing the receive buffers in the receiver’s OS helped a lot. Re-xmits dropped to “0” on physical client after changing buffers to 16777216 from default 212992:

$ sysctl -w net.core.rmem_max=16777216
$ sysctl -w net.core.rmem_default=16777216

On Xen VM receivers there were still re-xmits but significantly fewer. Transfer rate has changed from 50 Mbps to 100 Mbps on Xen VM receivers after changing rmem. Increasing buffers even more has reduced re-xmits.

My guess is that changing rmem has also lowered the probability that the receiver is dropped by server because of not confirming received packages and not answering to (dropped) control packages. It think that dropping slow receivers when packages are being dropped could also be reduced by increasing parameter --retries-until-drop on the sender but I have not tried that.

After changed rmem, all Xen VM receivers completed transfer.

There was still many timeouts even on physical receiver with re-xmits=0 which I would associate with server that can send packets with substantially quicker rate than that the receivers can process them.

Sebastian Roth

@kkroflin The important error message here is pigz: skipping: <stdin>: corrupted -- incomplete deflate data

It means that the image is somehow corrupt and cannot be extracted because of that. We had someone with that not long ago and in his case it turned out that he had a FOG storage node where the image hasn’t been transfered to properly. After he fixed that (re-transfer the image from master to storage) the deploy went fine.

Previously I haven’t used multicast with FOG so I don’t know if this is a new issue or an old one.

Multicast is one of FOG’s key features and it’s part of it since years.

Also, in case of an error like this, I would expect to receive some clear error state …

Tom changed that behavior not that long ago - see here. Though I don’t know why from the top of my head. I think there was an issue with some devices (maybe MS Surface) that would fail “lightly” and we wanted to make those work. Maybe see if you can find a discussion on this in the forum from around that date (4th of April 2017).

kkroflin

@sebastian-roth said in Uncompleted multicast:

FOG storage node where the image hasn’t been transfered to properly. After he fixed that (re-transfer the image from master to storage) the deploy went fine.

I’ve checked master and storage nodes and md5 checksum is the same. I have also redeployed image and checked that only master node was being used but the same issue occurred. Unicasting with the same image works ok.

Tom changed that behavior not that long ago - see here. Though I don’t know why from the top of my head. I think there was an issue with some devices (maybe MS Surface) that would fail “lightly” and we wanted to make those work. Maybe see if you can find a discussion on this in the forum from around that date (4th of April 2017).

I looked at Tom’s posts around that date but haven’t noticed anything related

Sebastian Roth

@kkroflin said:

I’ve checked master and storage nodes and md5 checksum is the same.

Can you please post a picture of that just to make sure we’re looking at the same things here.

Other than that… What network setting is used with those VMs? Is FOG running on the host itself? Where is the storage node, where the master node running?

kkroflin

I did more testing yesterday directly with udp-sender/receiver and got the same error. I noticed a very high number of re-xmits and timeouts. My guess is that hosts eventually were not able to catch up with the sender and they were dropped with message “Dropped by server now=37642 last=37640”.

$ cat d1p2.img | udp-sender --min-receivers 3 --portbase 9000 --interface eth0 --full-duplex --ttl 2
$ udp-receiver --portbase 9000 | gunzip -c > /dev/null

# High number of timeout and re-xmits like this:
Timeout notAnswered=[0] notReady=[0] nrAns=2 nrRead=2 nrPart=3 avg=1199
bytes=177 521 344  re-xmits=0030530 ( 25.0%) slice=0112 -   2
Timeout notAnswered=[1,2] notReady=[1,2] nrAns=1 nrRead=1 nrPart=3 avg=1281

# Eventually first receiver dropped with message:
Dropped by server now=37642 last=37640
Block received: 0/0/0
Transfer complete.
gunzip: unexpected end of file

Imaging physical hosts worked well. Xen VM hosts imaged also well when I changed vCPUs from 1 to 4. Also, with 1 vCPU, changing multicast to half-duplex, limiting bitrate to 60 Mbps or setting --fec 8x8/128 udp-sender argument worked ok.

I think I will be able to use multicast with physical hosts with or without a little bit of tuning, but it would be still nice to have some error state if imaging has failed. Currently I have confirmed successful transfer by looking at the multicast.log.udpcast.29 and seeing that receivers were disconnected only after transfer had been completed.

@sebastian-roth said in Uncompleted multicast:

Can you please post a picture of that just to make sure we’re looking at the same things here.

Here is the md5sum on fog-server and fog-server-storage:

Other than that… What network setting is used with those VMs? Is FOG running on the host itself? Where is the storage node, where the master node running?

Both fog-server and test VM hosts are on the Xen servers that are connected via 2x10 Gbps LACP connection to 2 stacked Cisco switches. Fog-server is currently on one Xen server and VM hosts are on another Xen server. Fog server and hosts have one NIC and they are in the same VLAN in this test, although the situation across VLAN is the same.

fog-server is a production VM that was used in this configuration two times (once on RC-9 and second time RC-9 or RC-10, I’m not sure) for imaging 180 computers via unicast across different VLANs.

fog-server-storage is a test storage node which I have used a few times but usually it’s not in use because primary server can take needed workload. Fog-server-storage is Hyper-V VM connected with 2x1 Gbps LACP connection.

Sebastian Roth

@kkroflin Great you tested this on the command line! To me this looks like a crude network issue. Are you good with wireshark analysis network traffic? I’d think you can find something there. Maybe it’s some kind of jumbo frame misconfiguration!?

I think I will be able to use multicast with physical hosts with or without a little bit of tuning, but it would be still nice to have some error state if imaging has failed.

I do understand that. I’ve looked through the forum posts myself and if might have been as a matter of trying to debug this issue.
@george1421 Do you remember the debugging session with Tom on 3rd/4th of April 2017? Could it be that Tom pushed this commit? Or do you have any other idea?

kkroflin

@sebastian-roth said in Uncompleted multicast:

@kkroflin Great you tested this on the command line! To me this looks like a crude network issue. Are you good with wireshark analysis network traffic? I’d think you can find something there. Maybe it’s some kind of jumbo frame misconfiguration!?

We don’t use jumbo frames. I tried Wireshark on the receiver and I noticed that the packets were mostly missing in large chunks. When I limited max-bitrate on the sender, there were no issues and the receiver was only notifying the sender about completed blocks. Slowing down the receiver with additional “gzip -9” after gunzip increased re-xmits and timeouts.

I found out that changing the receive buffers in the receiver’s OS helped a lot. Re-xmits dropped to “0” on physical client after changing buffers to 16777216 from default 212992:

$ sysctl -w net.core.rmem_max=16777216
$ sysctl -w net.core.rmem_default=16777216

On Xen VM receivers there were still re-xmits but significantly fewer. Transfer rate has changed from 50 Mbps to 100 Mbps on Xen VM receivers after changing rmem. Increasing buffers even more has reduced re-xmits.

My guess is that changing rmem has also lowered the probability that the receiver is dropped by server because of not confirming received packages and not answering to (dropped) control packages. It think that dropping slow receivers when packages are being dropped could also be reduced by increasing parameter --retries-until-drop on the sender but I have not tried that.

After changed rmem, all Xen VM receivers completed transfer.

There was still many timeouts even on physical receiver with re-xmits=0 which I would associate with server that can send packets with substantially quicker rate than that the receivers can process them.

Sebastian Roth

@kkroflin Yeah, you’re definitely on the right track there! I see you’re pretty good with this kind of stuff. Should we mark this solved then?

There was still many timeouts even on physical receiver with re-xmits=0 which I would associate with server that can send packets with substantially quicker rate than that the receivers can process them.

Yeah, this is just what happens when using UDP if you have machines/VMs that can handle a different amount of traffic as sender/receiver. TCP can handle this better. A quick search on the web revealed this post: https://github.com/esnet/iperf/issues/261
Seems like you already found the right parameters to play with: https://www.ibm.com/support/knowledgecenter/en/SSQPD3_2.4.0/com.ibm.wllm.doc/UDPSocketBuffers.html

I think this is a really great thread we should keep in mind if we see similar questions coming up!

kkroflin

@sebastian-roth said in Uncompleted multicast:

Should we mark this solved then?

Thanks for the help! You can mark this as solved.

Should I create a bug report about accepting aborted multicast stream as a valid deployment?

george1421

@sebastian-roth said in Uncompleted multicast:

Do you remember the debugging session with Tom on 3rd/4th of April 2017?

Sorry no I don’t remember, I’ve slept a few time since then. (lol).

This is an interesting thread. I can say I’ve seen with LACP, if its not configured right you can get random clients not able to talk to the central server. But I don’t think that is the case here, because the image starts to transmit then fails sometime later.

This fix is interesting too.

$ sysctl -w net.core.rmem_max=16777216
$ sysctl -w net.core.rmem_default=16777216

I seem to recall these settings when dealing with NFS performance tuning from several years ago. There are also transmit parameters along the same values.

So from the FOG perspective, I wonder if these need to be included by default using the fog installer?? There should be no risk to set these values, the only cost will be memory consumed on the FOG server.

While I worked with multicasting last summer, I wonder if there are similar kernel settings that need to happen on the target computer? For testing purposes we could make/set these values using a FOG post init script. While multicasting works, I wonder if it could be better.

I know I’m rambling a bit, its early here in the USA and I need a second cup of coffee to get going.

Sebastian Roth

@george1421 said in Uncompleted multicast:

@sebastian-roth said in Uncompleted multicast:

Do you remember the debugging session with Tom on 3rd/4th of April 2017?

Sorry no I don’t remember, I’ve slept a few time since then. (lol).

@george1421 @Tom-Elliott @Joe-Schmitt So should we revert this commit you think?

kkroflin

@george1421 said in Uncompleted multicast:

This fix is interesting too.

$ sysctl -w net.core.rmem_max=16777216
$ sysctl -w net.core.rmem_default=16777216

rmem* changes helped when I was testing directly with udp-sender/receiver and when the received data was directed to /dev/null. Successful test deploy with 1 vCPU VM hosts (additional writing to the disk) that I mentioned in previous posts was with half-duplex mode which I forgot to change back to full-duplex. Full-duplex on Xen VM host is still problematic, high number of re-xmits and aborted multicast.

Xen servers’s internals switches are difficult to debug and I’m not sure where the packets are being dropped but there are some lost packets on Xen servers’s internal switches too. Xen VM booted in debug mode was still dropping RX packets too when full-duplex multicast was used. I’ll try to consult our Xen Server support to check settings on the Xen Servers. From few posts I have seen, Xen’s Open vSwitch has very small queues by default which could be a problem for UDP cast.

On the physical host, before changing rmem*, there were dropped packets on the Extreme switch on the port connected to the PC so my conclusion was that the problem was (mainly) on the receivers.

In my environment, I think I’ll stick to half-duplex mode which allows receivers to sync more often so that buffers are not maxed out. I did not find another udp-sender parameter which would allow control of how many packets are sent without the confirmation from the receiver.

Uncompleted multicast

82

12.5k

17.5k

156.3k