Slow Unicast Deploy on new Machines

tomierna

@george1421 I’ve started a debug session, turned off as many of those advanced features as it would let me, and am currently tailing messages.

ethtool gave the following error:

Cannot get device udp-fragmentation-offload settings: Operation not supported
Cannot get device udp-fragmentation-offload settings: Operation not supported
Actual changes:
scatter-gather: off
        tx-scatter-gather: off
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp6-segmentation: off
generic-segmentation-offload: off [requested on]
generic-receive-offload: off

The only text relating to the device in messages (first two lines are for a different driver in the kernel, right?):

e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
e1000: Copyright(c) 1999-2006 Intel Corporation.
e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
e1000e: Copyright(c) 1999-2015 Intel Corporation.
e1000e 0000:00:1f.6 0000:00:1f.6 (uninitialized): registered PHC clock
e1000e 0000:00:1f.6 eth0: (PCI Express:2.5GT/s:Width x1) MY:MA:CA:DD:RE:SS
e1000e 0000:00:1f.6 eth0: Intel(R) PRO/1000 Network Connection
e1000e 0000:00:1f.6 eth0: MAC: 12, PHY: 12, PBA No: 1000FF-0FF
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

Looking at the output of ifconfig -a for the device while the deploy is running

eth0      Link encap:Ethernet  HWaddr MY:MA:CA:DD:RE:SS  
          inet addr:10.0.0.179  Bcast:10.0.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2923202 errors:0 dropped:657 overruns:0 frame:0
          TX packets:493054 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4385852212 (4.0 GiB)  TX bytes:50905921 (48.5 MiB)
          Interrupt:16 Memory:ed200000-ed220000

That dropped RX packets sure looks suspicious!

Is there anything else I can poke while the deploy is running? I’m prepared to abandon this deploy and twist other knobs…

tomierna

The deploy of the main partition is finished and I’m holding off on finishing it up to get some more stats.

There haven’t been any substantive additional lines in /var/log/messages.

ifconfig -a shows a couple thousand dropped packets.

ethtool -S eth0 shows:

NIC statistics:
     rx_packets: 14019324
     tx_packets: 2266954
     rx_bytes: 21033100462
     tx_bytes: 236885691
     rx_broadcast: 845
     tx_broadcast: 4
     rx_multicast: 14
     tx_multicast: 0
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 14
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 2672
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 0
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_csum_offload_good: 14018965
     rx_csum_offload_errors: 0
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 1
     rx_smbus: 46
     dropped_smbus: 0
     rx_dma_failed: 0
     tx_dma_failed: 0
     rx_hwtstamp_cleared: 0
     uncorr_ecc_errors: 0
     corr_ecc_errors: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0

rx_missed_errors corresponds roughly with the ifconfig dropped packets.

tomierna

After closely watching the deploy process a few times with statistics resets in between, I can confirm 16-24 link-down events is normal, because of the number of boots and reboots including Snap-In runs.

Some of these machines still have their BIOS date set incorrectly, and that makes KMS activation not work, so the 24-count includes the initial Snap-In to activate, and then the subsequent reboots and re-Snap-In to activate properly once the time is coherent.

I’m going to run another debug session soon and this time I’m going to increase the RX Ring Buffer to maximum - I’ve seen some chatter about this helping mitigate dropped packets with the e1000e card.

tomierna

Changing the ring buffer size to maximum didn’t do anything to help the speed, and the number of dropped packets is climbing.

In the meantime, I’ve also captured a new image from a t410i and deployed it to five machines as unicast. I was getting a solid 5GB/min for all of them, so my server and 10GbE link are working swimmingly. The deploy of all five of those took about seven minutes.

Tom Elliott

@tomierna as basic as this sounds would it be simple enough to test with a new patch cable?

tomierna

For comparison purposes, from a debug deploy’s shell prompt, I forced a t480’s port to 100Mbit/Full Duplex and then continued the deploy.

It started at roughly 550MB/min, and has settled to 405MB/min.

ethtool -S eth0 shows no increase in rx_missed_errors, but shows a large number of rx_crc_errors: 247402.

The switch port has a similar number of errors shown as “Collision Frames”.

Midstream, I forced 1GbE auto. The switch port agreed that it had re-negotiated at 1000Mb. There has been no increase in speed on the deploy, but now the CRC errors aren’t increasing, and instead the dropped packets count is increasing slowly (as before with 1000Mbit negotiated).

It seems like while the physical link is negotiating at 1000Mbit, something is throttling until it settles at around 55Mbit/sec.

To try to test things another way, I created an Ubuntu 18.04 USB boot drive and booted a t480 from that. Then, with my 2GB random-data file on the FOG server, I downloaded it via Firefox. It gave me a solid 100MB/sec.

dmesg on Ubuntu verified it’s using the same e1000e driver as the FOG kernel is using.

My next step is to fully install Ubuntu so I can create a 24GB random file to better test a download of that duration and size.

@Tom-Elliott - re: new patch cable - I’ll try, but this was happening with all of the stations, of which at least two of the cables were brand new. The Ubuntu test was on the same port and with the same cable, and it was getting proper 1GbE speeds.

tomierna

Sooooo, here’s where it gets weird.

I’m currently deploying to a t480 in debug mode.

It’s showing about 480MB/minute or 64Mbit/sec.

I’ve also got a fast USB3 SSD, formatted as EXT4 connected.

In a shell connected to the t480, I’ve used both cURL and wget to copy my 32GB randomized test file from the FOG server’s web server at 100+MB/sec.

I think this excludes the network card?

Thinking maybe it is an NFS issue, I copied the same image being deployed to the internal m.2 ssd to the external USB3 ssd with rsync, so I could see the copy speed. It copied in 3 minutes. That’s a 19GB .Img file.

So, it’s not NFS?

How about partimage vs. partclone? Nope: the old t410i image was partimage. The new one is partclone. All five simultaneous unicast deploys from earlier today of the new t410i image were 1GbE speed.

I’m really flummoxed at this point. Right now I’m getting GbE speeds on network copies to the external USB3 while a deploy is running at less than 100Mb/sec speeds writing to the internal m.2 drive.

Is it some sort of incompatibility with the m.2 ssd? How would I test that?

tomierna

I’m excluding the m.2 ssd as being problematic while booted into the FOG deploy kernel.

I stopped the deploy, formatted the largest partition with ext4, and then re-did my rsync test from the NFS images share.

Solidly 100MB/s, or 3 minutes for the whole image to copy.

george1421

I have to had it to you, you’ve tried about every permutation I could think of.

I finally found my old post about making fog go fast (benchmarking). https://forums.fogproject.org/topic/10459/can-you-make-fog-imaging-go-fast

So now the question in my mind is, is it the network, linux driver, or the local hard disk. Some of the commands from the go fast post can give you an idea of the commands needed to test the network and local disk writing speeds using the debug session from the FOS engine.

But I’m suspecting that the issue is related to your collisions and packet drops. I reset the port counters on the switch in the build up room today and deployed to 10 different systems with no collisions or crc errors. I’m just wondering do you see the same errors if you connect the 480 to your core switch, the one with the 10GB uplinks? It could be the switch too. (??).

For your m.2 test, you can use the dd command from my post to create a local 1GB file and compare both the read and write stats to my post too. But I’m really leaning towards the network is at fault some how.

tomierna

@george1421 I’m wrapping up for today, and I’ll work on it some more Monday.

I’ll read through your post for sure, so thanks for that!

The collisions/crc errors only happened when I forced the port into 100Mb/Full link mode.

Rx packet drops are what accrued while in 1GbE link, and there weren’t nearly as many.

I’ll test again next week, but I don’t think the dropped packets counter went up when I was doing straight network copies - I think that counter only went up during the deploys.

I’m satisfied that the m.2 is not the bottleneck, based on my final test today.

You ask me to connect to my core switch, but the topology is much more flat than that; Server 10GbE Fiber -> Imaging Station Switch -> Imaging client machines. There is no other network hardware in between.

george1421

@tomierna said in Slow Unicast Deploy on new Machines:

You ask me to connect to my core switch, but the topology is much more flat than that; Server 10GbE Fiber -> Imaging Station Switch -> Imaging client machines. There is no other network hardware in between.

The 480s did this before you upgraded to the 10GbE links too?

tomierna

@george1421 Yes, there was a single GbE connection between the server and the previous switch, an unmanaged Netgear GS116.

The first set of deploys I tried were similarly slow on the previous switch. The project to change out the switch for a managed one and add the 10GbE link was long planned, but since I couldn’t get any info out of the GS116, I figured having a management console would help debug things.

Doing five unicast t410i’s each at gigabit speeds makes me think the 10GbE link and VM are not the problem.

Sebastian Roth

@tomierna You’ve done a great job debugging that issue and trying to nail it down.

on Ubuntu verified it’s using the same e1000e driver as the FOG kernel is using

Ubuntu just as other distros do add many custom kernel patches. Possibly there is one or the other patch fixing exactly that issue you see on those machines. Be it network driver, m.2 drivers or what! Within FOS (the FOG mini linux OS) we use a plain vanilla kernel with a handful of patches added. Whatever you find, we are happy to add a patch to our kernels to help you make this run faster!

tomierna

@sebastian-roth - I think I’ve shown it’s not a FOG Client kernel issue, since I’ve been able to do http and NFS copies in a debug shell at GbE speeds. (See my post from a few days ago)

This week I’m going to try other versions of PartClone, as well as doing some other checking of port statistics when just doing the http and NFS copies.

tomierna

I just re-tested and am confirming GbE transfer speeds and no errors on the switch port or on the t480’s ethernet from a debug shell when rsync-ing the image files from the NFS mount to the m.2 SSD.

Does anyone have older or newer static binaries of Partclone which will run in the client fog boot image?

I’d like to test a couple of older versions, and possibly a newer test version. For old, maybe 0.2.88 and 0.2.80, and for newer, maybe 0.3.11?

I can’t figure out how to build them from source in a way that works on the debug-boot machine.

tomierna

Doing some more testing.

I finally compiled partclone 0.2.88, 0.2.91 and 0.3.11 and copied the binaries and enough of the libraries over to my debug-deploy machine so that they would run without error.

None of the versions made any difference in deploy speed. 0.2.91 and 0.3.11 took longer to write the GPT, but I think I remember reading about that somewhere.

Next I used gunzip to expand one of my partition files so I could run partclone from the shell to see if excluding pigz and cat from the FOS machine made a difference. It did not.

I added -d2 to the command to increase the error log verbosity.

The log doesn’t show any errors, but it does show all the reads and writes. The default buffer is 1MB, and so each of the reads is 256 4096-byte blocks.

Many of the writes say “write 1048576, 0 left”, but a large number of the writes appear to be fragmented. Here are three read/write cycles, with the first being non-fragmented, and the next two being fragmented:

read more: io_all: read 1049600, 0 left.
io_all: write 1048576, 0 left.
blocks_read = 256 and copied = 535552
read more: io_all: read 1049600, 0 left.
io_all: write 753664, 0 left.
io_all: write 57344, 0 left.
io_all: write 237568, 0 left.
blocks_read = 256 and copied = 535808
read more: io_all: read 1049600, 0 left.
io_all: write 204800, 0 left.
io_all: write 327680, 0 left.
io_all: write 360448, 0 left.
io_all: write 155648, 0 left.
blocks_read = 256 and copied = 536064

I don’t yet have a comparison log from a t410i to see if this type of write fragmentation pattern is normal, or if it’s potentially a reason for the slowdown.

Again, when I format the partition as ext4 and do a straight copy or rsync from the NFS server or wget or curl from the http server, I get 100+MB/sec. Only when it passes through partclone does it slow down to 400-500MB/min.

Tom Elliott

@tomierna then maybe this is ram related? We use fifo to handle receiving the data. This then stores the incoming data into ram, and from ram it is read and then written to disk.

I’m not sure how familiar you are with linux, but maybe this can help others as well.

The fifo literally means first in first out. It allows a file based access to a non file and is typically used for receiving a stream of data, of which we are when pulling from NFS. The data stream is likely working properly and meeting the upper limits of data that can be received in RAM.

In the near future we will be using the same process but with http/s streams. Maybe this can help us understand where the issue is?

Hopefully this can be of some use.

tomierna

Another day, another data point:

I’ve copied the decompressed image file to an external USB3 SSD (over NFS, at 100+MB/s with rsync), and while in the debug-deploy shell, I ran partclone using the SSD as the source.

The partclone session started out fast, but like with the NFS-based sessions, after about 5%, started to slow down. By 7%, transfer speed from SSD was at around 800MB/min, and by 10% was at 600MB/min.

/var/log/partclone.log showed similar write fragmentation patterns to what I posted last night.

I’m going to look next at kernel tunables to see if there are any io buffers I can set to be larger.

tomierna

On a partclone mailing list a test to determine if write io was a bottleneck was mentioned: restore to /dev/null.

I tried that, and got a solid 13GB/min from the SSD and 7.3GB/min from the NFS share.

This tells me write performance to the m.2 drive is probably the culprit.

Any kernel parameters I should be looking at? I will be doing a diff between sysctl -a on an Ubuntu 18.04 machine and the FOS client kernel.

george1421

@tomierna It might be related to the kernel driver itself. You’ve done a hell of a lot of debugging here and I’ve lost a bit of where you are in the process.

What kernel version works in ubuntu?

In ubuntu, if you run lspci -NN it might show you the device and the kernel driver being used. Thinking about it, it might not show you the disk controller if its not connected via the internal pcie buss.

And if booted into a fog debug session and then mounted the m.2 drive (/dev/sda1) over /mnt then ran this dd command.

dd if=/dev/zero of=/mnt/test1.img bs=1G count=1 oflag=direct
What throughput does it provide. This is just writing zeros to the test1.img file as fast as it can.

Slow Unicast Deploy on new Machines

107

12.6k

17.5k

156.4k