Posts made by kkroflin

kkroflin

@george1421 said in Uncompleted multicast:

This fix is interesting too.

$ sysctl -w net.core.rmem_max=16777216
$ sysctl -w net.core.rmem_default=16777216

rmem* changes helped when I was testing directly with udp-sender/receiver and when the received data was directed to /dev/null. Successful test deploy with 1 vCPU VM hosts (additional writing to the disk) that I mentioned in previous posts was with half-duplex mode which I forgot to change back to full-duplex. Full-duplex on Xen VM host is still problematic, high number of re-xmits and aborted multicast.

Xen servers’s internals switches are difficult to debug and I’m not sure where the packets are being dropped but there are some lost packets on Xen servers’s internal switches too. Xen VM booted in debug mode was still dropping RX packets too when full-duplex multicast was used. I’ll try to consult our Xen Server support to check settings on the Xen Servers. From few posts I have seen, Xen’s Open vSwitch has very small queues by default which could be a problem for UDP cast.

On the physical host, before changing rmem*, there were dropped packets on the Extreme switch on the port connected to the PC so my conclusion was that the problem was (mainly) on the receivers.

In my environment, I think I’ll stick to half-duplex mode which allows receivers to sync more often so that buffers are not maxed out. I did not find another udp-sender parameter which would allow control of how many packets are sent without the confirmation from the receiver.

kkroflin

@sebastian-roth said in Uncompleted multicast:

Should we mark this solved then?

Thanks for the help! You can mark this as solved.

Should I create a bug report about accepting aborted multicast stream as a valid deployment?

kkroflin

@sebastian-roth said in Uncompleted multicast:

@kkroflin Great you tested this on the command line! To me this looks like a crude network issue. Are you good with wireshark analysis network traffic? I’d think you can find something there. Maybe it’s some kind of jumbo frame misconfiguration!?

We don’t use jumbo frames. I tried Wireshark on the receiver and I noticed that the packets were mostly missing in large chunks. When I limited max-bitrate on the sender, there were no issues and the receiver was only notifying the sender about completed blocks. Slowing down the receiver with additional “gzip -9” after gunzip increased re-xmits and timeouts.

I found out that changing the receive buffers in the receiver’s OS helped a lot. Re-xmits dropped to “0” on physical client after changing buffers to 16777216 from default 212992:

$ sysctl -w net.core.rmem_max=16777216
$ sysctl -w net.core.rmem_default=16777216

On Xen VM receivers there were still re-xmits but significantly fewer. Transfer rate has changed from 50 Mbps to 100 Mbps on Xen VM receivers after changing rmem. Increasing buffers even more has reduced re-xmits.

My guess is that changing rmem has also lowered the probability that the receiver is dropped by server because of not confirming received packages and not answering to (dropped) control packages. It think that dropping slow receivers when packages are being dropped could also be reduced by increasing parameter --retries-until-drop on the sender but I have not tried that.

After changed rmem, all Xen VM receivers completed transfer.

There was still many timeouts even on physical receiver with re-xmits=0 which I would associate with server that can send packets with substantially quicker rate than that the receivers can process them.

kkroflin

I did more testing yesterday directly with udp-sender/receiver and got the same error. I noticed a very high number of re-xmits and timeouts. My guess is that hosts eventually were not able to catch up with the sender and they were dropped with message “Dropped by server now=37642 last=37640”.

$ cat d1p2.img | udp-sender --min-receivers 3 --portbase 9000 --interface eth0 --full-duplex --ttl 2
$ udp-receiver --portbase 9000 | gunzip -c > /dev/null

# High number of timeout and re-xmits like this:
Timeout notAnswered=[0] notReady=[0] nrAns=2 nrRead=2 nrPart=3 avg=1199
bytes=177 521 344  re-xmits=0030530 ( 25.0%) slice=0112 -   2
Timeout notAnswered=[1,2] notReady=[1,2] nrAns=1 nrRead=1 nrPart=3 avg=1281

# Eventually first receiver dropped with message:
Dropped by server now=37642 last=37640
Block received: 0/0/0
Transfer complete.
gunzip: unexpected end of file

Imaging physical hosts worked well. Xen VM hosts imaged also well when I changed vCPUs from 1 to 4. Also, with 1 vCPU, changing multicast to half-duplex, limiting bitrate to 60 Mbps or setting --fec 8x8/128 udp-sender argument worked ok.

I think I will be able to use multicast with physical hosts with or without a little bit of tuning, but it would be still nice to have some error state if imaging has failed. Currently I have confirmed successful transfer by looking at the multicast.log.udpcast.29 and seeing that receivers were disconnected only after transfer had been completed.

@sebastian-roth said in Uncompleted multicast:

Can you please post a picture of that just to make sure we’re looking at the same things here.

Here is the md5sum on fog-server and fog-server-storage:

Other than that… What network setting is used with those VMs? Is FOG running on the host itself? Where is the storage node, where the master node running?

Both fog-server and test VM hosts are on the Xen servers that are connected via 2x10 Gbps LACP connection to 2 stacked Cisco switches. Fog-server is currently on one Xen server and VM hosts are on another Xen server. Fog server and hosts have one NIC and they are in the same VLAN in this test, although the situation across VLAN is the same.

fog-server is a production VM that was used in this configuration two times (once on RC-9 and second time RC-9 or RC-10, I’m not sure) for imaging 180 computers via unicast across different VLANs.

fog-server-storage is a test storage node which I have used a few times but usually it’s not in use because primary server can take needed workload. Fog-server-storage is Hyper-V VM connected with 2x1 Gbps LACP connection.

kkroflin

@sebastian-roth said in Uncompleted multicast:

FOG storage node where the image hasn’t been transfered to properly. After he fixed that (re-transfer the image from master to storage) the deploy went fine.

I’ve checked master and storage nodes and md5 checksum is the same. I have also redeployed image and checked that only master node was being used but the same issue occurred. Unicasting with the same image works ok.

Tom changed that behavior not that long ago - see here. Though I don’t know why from the top of my head. I think there was an issue with some devices (maybe MS Surface) that would fail “lightly” and we wanted to make those work. Maybe see if you can find a discussion on this in the forum from around that date (4th of April 2017).

I looked at Tom’s posts around that date but haven’t noticed anything related

kkroflin

I’m testing multicast as deployment method with FOG 1.5.0-RC-10 and I am experiencing problems. Previously I haven’t used multicast with FOG so I don’t know if this is a new issue or an old one.

After a few minutes of running multicast task (3 Xen VMs), one of the hosts fails and then after several minutes another host fails.

When the host fails, a “partclone fail” message is visible on the screen and it gets rebooted automatically after 1 minute and multicast task is canceled. The remaining hosts are still being deployed but they also fail after some time. When multicast on the first host is aborted and the host rebooted, “Task (18) Multi-Cast Task has been completed.” is written in multicast log. After the reboot, the host is booted and Windows starts repairing disk errors.

The screenshots and the log file are available on the Google Drive: https://drive.google.com/open?id=1gVETBU9oJAVLqBOCyNm3AjqKYLYHMWlU

Is there anything I could try to get multicast working?

Also, in case of an error like this, I would expect to receive some clear error state for the host or infinite imaging loop, because if the task failed at 95% percent, I probably wouldn’t notice that the task was not completed correctly.

kkroflin

I wasn’t aware of postinit scripts :(. Looking at postinit now, it seems that it would need network mount to work through default router (the firewall) and we tried to minimize what we are enabling through the firewall. XenServer PXE boot (testing host) doesn’t support classless static routes either so we had to let TFTP through to make this work.

Maybe we could reconfigure our “single fog server - multiple subnets/vlans” setup? We have tried to create multiple network adapters on the same FOG server and to have clients from each network to communicate with the server through IP address in the same subnet but we were not able to set up this to work and I’m not sure if FOG supports this as I have seen that FOG requires network interface to be specified in some places.

kkroflin

Recently we have reconfigured our network in a way that clients are behind NAT with default router set to our firewall. As we have FOG server (1.5.0 RC9) on a different subnet and the firewall has limited capacity we have created additional classless static route to the FOG server (and some other local servers).

We have found that udhcpc script in FOG’s init was not setting classless routes defined in (non-fog) DHCP server so I have changed the scripts in init to support this. Files are available on GitHub: https://github.com/kkroflin/fog-init-scripts

It would be great if this or something similar is implemented in FOG.