Posts made by tomierna

tomierna

@george1421 I’m wrapping up for today, and I’ll work on it some more Monday.

I’ll read through your post for sure, so thanks for that!

The collisions/crc errors only happened when I forced the port into 100Mb/Full link mode.

Rx packet drops are what accrued while in 1GbE link, and there weren’t nearly as many.

I’ll test again next week, but I don’t think the dropped packets counter went up when I was doing straight network copies - I think that counter only went up during the deploys.

I’m satisfied that the m.2 is not the bottleneck, based on my final test today.

You ask me to connect to my core switch, but the topology is much more flat than that; Server 10GbE Fiber -> Imaging Station Switch -> Imaging client machines. There is no other network hardware in between.

tomierna

I’m excluding the m.2 ssd as being problematic while booted into the FOG deploy kernel.

I stopped the deploy, formatted the largest partition with ext4, and then re-did my rsync test from the NFS images share.

Solidly 100MB/s, or 3 minutes for the whole image to copy.

tomierna

Sooooo, here’s where it gets weird.

I’m currently deploying to a t480 in debug mode.

It’s showing about 480MB/minute or 64Mbit/sec.

I’ve also got a fast USB3 SSD, formatted as EXT4 connected.

In a shell connected to the t480, I’ve used both cURL and wget to copy my 32GB randomized test file from the FOG server’s web server at 100+MB/sec.

I think this excludes the network card?

Thinking maybe it is an NFS issue, I copied the same image being deployed to the internal m.2 ssd to the external USB3 ssd with rsync, so I could see the copy speed. It copied in 3 minutes. That’s a 19GB .Img file.

So, it’s not NFS?

How about partimage vs. partclone? Nope: the old t410i image was partimage. The new one is partclone. All five simultaneous unicast deploys from earlier today of the new t410i image were 1GbE speed.

I’m really flummoxed at this point. Right now I’m getting GbE speeds on network copies to the external USB3 while a deploy is running at less than 100Mb/sec speeds writing to the internal m.2 drive.

Is it some sort of incompatibility with the m.2 ssd? How would I test that?

tomierna

For comparison purposes, from a debug deploy’s shell prompt, I forced a t480’s port to 100Mbit/Full Duplex and then continued the deploy.

It started at roughly 550MB/min, and has settled to 405MB/min.

ethtool -S eth0 shows no increase in rx_missed_errors, but shows a large number of rx_crc_errors: 247402.

The switch port has a similar number of errors shown as “Collision Frames”.

Midstream, I forced 1GbE auto. The switch port agreed that it had re-negotiated at 1000Mb. There has been no increase in speed on the deploy, but now the CRC errors aren’t increasing, and instead the dropped packets count is increasing slowly (as before with 1000Mbit negotiated).

It seems like while the physical link is negotiating at 1000Mbit, something is throttling until it settles at around 55Mbit/sec.

To try to test things another way, I created an Ubuntu 18.04 USB boot drive and booted a t480 from that. Then, with my 2GB random-data file on the FOG server, I downloaded it via Firefox. It gave me a solid 100MB/sec.

dmesg on Ubuntu verified it’s using the same e1000e driver as the FOG kernel is using.

My next step is to fully install Ubuntu so I can create a 24GB random file to better test a download of that duration and size.

@Tom-Elliott - re: new patch cable - I’ll try, but this was happening with all of the stations, of which at least two of the cables were brand new. The Ubuntu test was on the same port and with the same cable, and it was getting proper 1GbE speeds.

tomierna

Changing the ring buffer size to maximum didn’t do anything to help the speed, and the number of dropped packets is climbing.

In the meantime, I’ve also captured a new image from a t410i and deployed it to five machines as unicast. I was getting a solid 5GB/min for all of them, so my server and 10GbE link are working swimmingly. The deploy of all five of those took about seven minutes.

tomierna

After closely watching the deploy process a few times with statistics resets in between, I can confirm 16-24 link-down events is normal, because of the number of boots and reboots including Snap-In runs.

Some of these machines still have their BIOS date set incorrectly, and that makes KMS activation not work, so the 24-count includes the initial Snap-In to activate, and then the subsequent reboots and re-Snap-In to activate properly once the time is coherent.

I’m going to run another debug session soon and this time I’m going to increase the RX Ring Buffer to maximum - I’ve seen some chatter about this helping mitigate dropped packets with the e1000e card.

tomierna

The deploy of the main partition is finished and I’m holding off on finishing it up to get some more stats.

There haven’t been any substantive additional lines in /var/log/messages.

ifconfig -a shows a couple thousand dropped packets.

ethtool -S eth0 shows:

NIC statistics:
     rx_packets: 14019324
     tx_packets: 2266954
     rx_bytes: 21033100462
     tx_bytes: 236885691
     rx_broadcast: 845
     tx_broadcast: 4
     rx_multicast: 14
     tx_multicast: 0
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 14
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 2672
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 0
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_csum_offload_good: 14018965
     rx_csum_offload_errors: 0
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 1
     rx_smbus: 46
     dropped_smbus: 0
     rx_dma_failed: 0
     tx_dma_failed: 0
     rx_hwtstamp_cleared: 0
     uncorr_ecc_errors: 0
     corr_ecc_errors: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0

rx_missed_errors corresponds roughly with the ifconfig dropped packets.

tomierna

@george1421 I’ve started a debug session, turned off as many of those advanced features as it would let me, and am currently tailing messages.

ethtool gave the following error:

Cannot get device udp-fragmentation-offload settings: Operation not supported
Cannot get device udp-fragmentation-offload settings: Operation not supported
Actual changes:
scatter-gather: off
        tx-scatter-gather: off
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp6-segmentation: off
generic-segmentation-offload: off [requested on]
generic-receive-offload: off

The only text relating to the device in messages (first two lines are for a different driver in the kernel, right?):

e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
e1000: Copyright(c) 1999-2006 Intel Corporation.
e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
e1000e: Copyright(c) 1999-2015 Intel Corporation.
e1000e 0000:00:1f.6 0000:00:1f.6 (uninitialized): registered PHC clock
e1000e 0000:00:1f.6 eth0: (PCI Express:2.5GT/s:Width x1) MY:MA:CA:DD:RE:SS
e1000e 0000:00:1f.6 eth0: Intel(R) PRO/1000 Network Connection
e1000e 0000:00:1f.6 eth0: MAC: 12, PHY: 12, PBA No: 1000FF-0FF
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

Looking at the output of ifconfig -a for the device while the deploy is running

eth0      Link encap:Ethernet  HWaddr MY:MA:CA:DD:RE:SS  
          inet addr:10.0.0.179  Bcast:10.0.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2923202 errors:0 dropped:657 overruns:0 frame:0
          TX packets:493054 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4385852212 (4.0 GiB)  TX bytes:50905921 (48.5 MiB)
          Interrupt:16 Memory:ed200000-ed220000

That dropped RX packets sure looks suspicious!

Is there anything else I can poke while the deploy is running? I’m prepared to abandon this deploy and twist other knobs…

tomierna

Also, for what it’s worth, “Spanning Tree State” is set to Disable and operation mode is set to RTSP on the switch.

tomierna

@george1421 that makes sense, but for one imaging session each port saw at least 21 link down events during this 9-unit unicast.

One was 25 link down events, one was 32, and one was 33.

Doesn’t that seem a little odd?

tomierna

I reset the statistics in the GC728X Switch before starting the most recent batch of unicast deployments (I’m doing 9 right now).

There is a column in the port statistics for the switch called “Link Down Events”, and they are counting upwards. 15% in to the deploy, most of the counters are at 11, and some are higher, in the 20’s.

There is a thread on Spiceworks saying this card was flapping on Cisco switches. The poster ruled out green ethernet. It’s off on my switch too.

Tomorrow I will reset the statistics again and then capture an image to see if the link-down events count up during a capture. I’m also preparing a new t410i image, so I’ll be able to test if the ports flap with those machines on either capture or deploy.

I’ll try turning off the various auto-negotiation stuff to see if that makes the flapping cease.

tomierna

@george1421 - our imaging network is not routed, and I don’t have access to file servers on the rest of our LAN.

So, to test download speeds, I created a 2GB (random) file in /var/www/fog/client/ and added a case to download.php to allow me to download directly from the FOG Server with Chrome.

I was getting consistently 85-95MB/s.

I’m not sure the best way to test a file upload that would show me speed, short of installing an FTP program and doing sFTP back and forth, but upload speed isn’t a problem in the FOG-booted scenario.

tomierna

@george1421

The I219-LM shows the following under Hardware Ids in the Details tab of Device Manager:

PCI\VEN_8086&DEV_15D7&SUBSYS_225D17AA&REV_21
PCI\VEN_8086&DEV_15D7&SUBSYS_225D17AA
PCI\VEN_8086&DEV_15D7&CC_020000
PCI\VEN_8086&DEV_15D7&CC_0200

[MOD Note] linux device translation [8086:15D7] - Geo

tomierna

There was a BIOS update for the t480 machines, but after installing it on one, and running a unicast deploy, it doesn’t seem to have fixed anything.

1.12 was the original and 1.14 is the current BIOS, and there is a note about Ethernet instability when net booting before Windows starts, but alas, deploying is still slow.

Also, I don’t know if I answered it, but the switch statistics show very few packet or frame errors. I also checked that Green Ethernet was disabled on the switch.

tomierna

Thanks for the responses so far. I’ll try to answer the questions from everyone.

@george1421 - My FOG Server is a VM on a XenServer (7.3). The VM is running CentOS 7.4.1708, and has 4GB RAM and 2 CPUs allocated. Looking at the memory usage on the server, it doesn’t appear critical, but I have plenty of RAM in the master, so I can certainly try adding more. Disk subsystem is a large number of 2TB drives (24 I think?) in RAID configuration, though I’d have to check the management console to say which config. It’s hardware raid though, and the XenCenter for that VM doesn’t seem to show that it is taxing the disk subsystem. This is a pretty beefy VM server.

The t410i machines have 7200RPM 500GB drives. The t480 machines have 256GB M.2 SSDs.

Deploying one t480 ends up between 400-500MB/m.

Deploying one t410i shows expected throughput from a 1Gb port.

Deploying multiple t480 (unicast) ends up between 400-500MB/m on each machine.

I’ve not deployed multiple t410i (unicast) since trading out the switch and going to a 10GbE link to the server, but with the previous hardware, they shared a single 1GbE link to the server and it showed in the statistics.

Looking at the BIOS on the t410i, I don’t see any uEFI switch, so I presume it’s running in traditional boot.

With the t480s, UEFI is set to “only” and CSM Support has to be on, otherwise rEFInd complains and forces a “press any key” screen to appear after imaging completes.

For testing purposes, I changed the boot setting to Legacy Only on one t480, and it hasn’t made a difference in Deploy speed.

I will check the firmware version, but it is probably current since the machines were build-to-order, and they were shipped directly to us only a couple of weeks ago.

Deploying a t410i on a particular network port and then trying a t480 on the same port shows the t410i with proper throughput, and the t480 with very low throughput.

Re: 1GbE vs 100Mb/s link, I have checked in the switch management interface when a t480 deploy is running, and the link speed is listed as 1GbE.

@Tom-Elliott - I’m also leaning toward the t480s having some sort of strange issue. It could be BIOS-level, or maybe client kernel level. The fact that it captures at a normal fraction of link speed but deploys at much reduced speed makes me think it’s not the FOG Server.

@sebastian-roth - I started with whatever client kernel was installed with 1.5.0, and updated to 4.17.0 in an attempt to debug. Should I try going back farther?

Some additional data and stuff I’ve tried today:

The 8-machine multicast that I mentioned starting never completed. It stalled at 22%, and I let it sit for a bit to see if it would recover. It never did.
As mentioned above, I changed one to Legacy BIOS mode, and that didn’t change the deploy throughput.
I’ve looked through the t480 BIOS config pretty closely, and I don’t see anything related to network that I think would make a difference.
Early on when starting to make these machines image smoothly, I had to turn off the IP6 stack for netboot.

tomierna

Deploying t410i gets 5-7GB/min, or around 900Mbit/second.

Deploying t480 on the same switch port and cable gets 400-500MB/min, or around 60Mbit/second.

Capture of images is 5-7GB/min from either model. That’s what is so strange.

Thanks for the note about the newest kernels, I’ll downgrade.

Re: Partimage vs. Partclone compression, I can try, but I don’t think that accounts for a 15-20x speed differential.

Re: Perceived transfer rates vs. actuals: the image for our t410i is 24GB in size. The t480 image is 19GB. A t410i will deploy in less than 10 minutes, and a t480 will take over an hour.

tomierna

I just started a Multicast deploy of eight of the t480 machines, and they are doing the same thing.

Another data point, whether Unicast or Multicast, the deploys to the t480’s show an absurd speed on the Active Tasks display for the first 4-5% of the deploy - over GbE speeds, and then it goes down steadily until coalescing around 400-500MB/minute.

tomierna

I’ve been using FOG since around .32, and I’m running up against a strange slow-deployment scenario on FOG 1.5.0.

For years, we’ve been imaging our inventory of Lenovo t410i computers. The topology of our FOG system has the server and clients sharing a private LAN, with Gigabit throughout. Non managed switch.

With the t410i machines, they are imaging from a Partimage image, as these were converted from our old server.

We regularly see 5-6 GB/min when either Capturing or Deploying these singly, and multicast deployments to up to 8 machines with similar results.

We just bought a bunch of Lenovo t480 machines, and I’ve been working on the image for these. Capturing images from these goes similarly fast compared to the t410i’s, but when deploying them I’m only seeing between 400 and 500MB/min.

While a machine is in mid-deploy, the switch link lights are indicating 1Gb link. I’ve tried the newest kernels for the client machines with no difference.

I swapped out the non-managed Gigabit switch with a new, managed one (because I upgraded the link to our server to 10Gb fiber), and have seen no change in deploy speed on these machines. The new switch’s management console shows the machines are linking at GbE speeds.

It’s not the cables; I can put a t410i on any of the same positions, and they deploy at Gb speeds.

The Ethernet chipset on the t480 clients is Intel 219lm.

The new switch is a Netgear GC728X. The server card is an Intel X520-DA2 with 10GTek transceivers on both ends.

Are there BIOS or client kernel-level stuff I should look toward?

Thoughts?