Slow Unicast Deploy on new Machines
@george1421 - our imaging network is not routed, and I don’t have access to file servers on the rest of our LAN.
So, to test download speeds, I created a 2GB (random) file in /var/www/fog/client/ and added a case to download.php to allow me to download directly from the FOG Server with Chrome.
I was getting consistently 85-95MB/s.
I’m not sure the best way to test a file upload that would show me speed, short of installing an FTP program and doing sFTP back and forth, but upload speed isn’t a problem in the FOG-booted scenario.
We bought 50 of these machines and one arrived with a cracked screen. I just received the replacement from the RMA of that broken machine, and of course it images at full speed.
The replacement machine came with a Samsung m.2 drive, part: MZVLW256HEHP-000L7
The other 49 machines have the Lenovo equivalent: LENSE20256GMSP34MEAT2TA
I’ve contacted my Lenovo rep with the hopes that I can work with an engineer to narrow down a fix.
@tomierna Once you have windows up and running, if you download and upload a 1GB+ file, do you get about the same transfer rates (remembering our point of view has changed from the server to the client)? If you have a m.2 sata/nvme drive, I would still expect a better download from the network results (which is kind of the opposite flow from what you are seeing today with imaging).
The I219-LM shows the following under Hardware Ids in the Details tab of Device Manager:
[MOD Note] linux device translation [8086:15D7] - Geo
@tomierna While this won’t fix anything, can you go into windows device manager and record the hardware ID here?
It should looks (nothing) like this:
The above is for an intel 82579LM network adapter. That ID translates to a linux id of [8086:1502]. With that ID we can search to see if other linux folks are seeing similar results. But since the 480s are so new, there may be some undiscovered issue with the linux driver for that nic.
There was a BIOS update for the t480 machines, but after installing it on one, and running a unicast deploy, it doesn’t seem to have fixed anything.
1.12 was the original and 1.14 is the current BIOS, and there is a note about Ethernet instability when net booting before Windows starts, but alas, deploying is still slow.
Also, I don’t know if I answered it, but the switch statistics show very few packet or frame errors. I also checked that Green Ethernet was disabled on the switch.
@tomierna Very well then.
Your FOG server is sufficient and probably can be ruled out as the root of your issues here. I’m also leaning towards something unique with this new hardware.
Since you were/are using unmanaged switches this is probably not the issue, but we have see on the enterprise managed switches, that sometimes the “green ethernet” [IEE 802.3az] settings get confused and cause the communications to switch to backup mode. But one would think this should happen in either mode (capture and deploy) not just in deploy mode. The unmanaged switches typically don’t support this green function so I don’t think this is the case here.
Thanks for the responses so far. I’ll try to answer the questions from everyone.
@george1421 - My FOG Server is a VM on a XenServer (7.3). The VM is running CentOS 7.4.1708, and has 4GB RAM and 2 CPUs allocated. Looking at the memory usage on the server, it doesn’t appear critical, but I have plenty of RAM in the master, so I can certainly try adding more. Disk subsystem is a large number of 2TB drives (24 I think?) in RAID configuration, though I’d have to check the management console to say which config. It’s hardware raid though, and the XenCenter for that VM doesn’t seem to show that it is taxing the disk subsystem. This is a pretty beefy VM server.
The t410i machines have 7200RPM 500GB drives. The t480 machines have 256GB M.2 SSDs.
Deploying one t480 ends up between 400-500MB/m.
Deploying one t410i shows expected throughput from a 1Gb port.
Deploying multiple t480 (unicast) ends up between 400-500MB/m on each machine.
I’ve not deployed multiple t410i (unicast) since trading out the switch and going to a 10GbE link to the server, but with the previous hardware, they shared a single 1GbE link to the server and it showed in the statistics.
Looking at the BIOS on the t410i, I don’t see any uEFI switch, so I presume it’s running in traditional boot.
With the t480s, UEFI is set to “only” and CSM Support has to be on, otherwise rEFInd complains and forces a “press any key” screen to appear after imaging completes.
For testing purposes, I changed the boot setting to Legacy Only on one t480, and it hasn’t made a difference in Deploy speed.
I will check the firmware version, but it is probably current since the machines were build-to-order, and they were shipped directly to us only a couple of weeks ago.
Deploying a t410i on a particular network port and then trying a t480 on the same port shows the t410i with proper throughput, and the t480 with very low throughput.
Re: 1GbE vs 100Mb/s link, I have checked in the switch management interface when a t480 deploy is running, and the link speed is listed as 1GbE.
@Tom-Elliott - I’m also leaning toward the t480s having some sort of strange issue. It could be BIOS-level, or maybe client kernel level. The fact that it captures at a normal fraction of link speed but deploys at much reduced speed makes me think it’s not the FOG Server.
@sebastian-roth - I started with whatever client kernel was installed with 1.5.0, and updated to 4.17.0 in an attempt to debug. Should I try going back farther?
Some additional data and stuff I’ve tried today:
The 8-machine multicast that I mentioned starting never completed. It stalled at 22%, and I let it sit for a bit to see if it would recover. It never did.
As mentioned above, I changed one to Legacy BIOS mode, and that didn’t change the deploy throughput.
I’ve looked through the t480 BIOS config pretty closely, and I don’t see anything related to network that I think would make a difference.
Early on when starting to make these machines image smoothly, I had to turn off the IP6 stack for netboot.
@tomierna Interesting case you have there. From the information given so far I would suspect the Intel 219lm NIC and/or driver to be the problem. But it’s kinda strange you see the slowness only when deploying. I will try to investigate, see if we can find some known driver issues.
Would be interesting to know if downgrading the kernel as suggested by George will also help with this issue. I doubt it but sure give it a go. There is nothing to loose.
@george1421 Based on this, it would seem, to me, the 480 has a 10/100 NIC (possibly) vs the 410 having a 10/100/1000 NIC?
Just my thoughts on the whole thing.
Typically, because of the compression applied, you will see faster than your network speeds, though not by too much. For example, on a 1Gb network (both sides) and using SSD (both sides) you could see 13-18 GB/min, where on a 1Gb network the theoretical (goldilocks?) maximum (translated) would be 7.5 GB/min.
So compression is important in this. As CPU and write to disk is often much faster than the network itself. (This is also partially why Network->Disk is faster than Disk -> Disk, as the disk in question has to spin up, and locate the other point on the disk (same disk or not)).
It really seems that the NIC on the 480’s is different than the 410’s, or some other variable. Seeing as things seem normal on one, and not on the other, it really points to the machine being the problem, not something fog is doing.
@tomierna OK just because I’m a type ‘A’ person.
500MB/m translates to 8.3MB/s
A single 100Mb/s link moves about 12.5MB/s theoretical maximum.
A 1GbE link has a theoretical throughput of 125MB/s.
(this is still a process of finding out where the problem isn’t. I’m still trying to build a truth table in my head).
So you can put a T410 and T480 on the same network jack and then deploy the same image to that target computer, and both are in the same firmware mode (uefi or bios) and the 410 has 6GB/m and the 480 has 500MB/m?
Deploying t410i gets 5-7GB/min, or around 900Mbit/second.
Deploying t480 on the same switch port and cable gets 400-500MB/min, or around 60Mbit/second.
Capture of images is 5-7GB/min from either model. That’s what is so strange.
Thanks for the note about the newest kernels, I’ll downgrade.
Re: Partimage vs. Partclone compression, I can try, but I don’t think that accounts for a 15-20x speed differential.
Re: Perceived transfer rates vs. actuals: the image for our t410i is 24GB in size. The t480 image is 19GB. A t410i will deploy in less than 10 minutes, and a t480 will take over an hour.
@tomierna If you have a managed switch do you see any collisions or other issues when you look at the port counters.
Tell us a bit more about your FOG server itself.
Is it physical or virtual
What host OS is running on the FOG server?
How much ram is in the FOG server?
What does the disk subsystem look like? Is is a single sata hdd, ssd, or raid?
If you only deploy a single unicast stream to a single target computer do you get 500MB/m transfer rates?
What happens if you run 2 unicast deployments at the same time, does it change your throughput?
Are the 410s and 480s in the same firmware mode (bios or uefi)
Also make sure the 480s have the latest firmware installed.
OK also just for clarity, if you use the same network port, do the T480 and T410 provide the same through put?
I just started a Multicast deploy of eight of the t480 machines, and they are doing the same thing.
Another data point, whether Unicast or Multicast, the deploys to the t480’s show an absurd speed on the Active Tasks display for the first 4-5% of the deploy - over GbE speeds, and then it goes down steadily until coalescing around 400-500MB/minute.
Well 1.5.0 has some issues, well so does 1.5.4 but we have a few work arounds.
Just to be clear deploying to T410i computers goes fast but to the T480 on the same network port is only getting 500MB/s transfer rates?
I can tell you if you have the very latest kernels, that you should down grade to 4.15.2, the latest kernels are having an issue creating the MBR/GPT partitions on the disk. This isn’t a FOG issue per se, but a linux kernel issue. Down grading to 4.15.2 running under FOG 1.5.4 helps quite a bit.
If you are using Partimage images, it may be worth it to deployment, and then create a new image definition using partclone and zstd compression. This format and compressor gives far superior decompression rates than the older gzip and partimage.
The rate at which you see parclone’s transfer rates are a little deceiving, that is a composite score of network transfer rates as well as decompression rates at the target computer. In FOG, the target computer does all of the heavy lifting during imaging. The FOG server itself just moves the image file from the hard drive to the network. The target computer then decompresses the image and writes it to the target media.