HP ProBook 640 G8 imaging extremely slowly
-
@george1421 The working one is different, 8086:15e3
-
@jacob-gallant Ok the 15e3 nic is an older nic that was first introduced in the 4.6 linux kernel. The 15fc was first introduced in 5.5 linux kernel and we are currently trying 5.6.18 “right?” (from the FOS Linux debug console you can key in
uname -r
to give you the kernel version).Here is an experimental FOS Linux kernel 5.10.2. Download this file and rename as
bzImage
(case is important)
https://drive.google.com/file/d/1-4HyQD8ttz_GCE_vKrvuydFVqcPUMqzU/view?usp=sharingrename the bzImage file in
/var/www/html/fog/service/ipxe
directory and drop this file in there. Lets see if this kernel gives us a better deployment. I know there was again a major rewrite in the 5.9.x series of the linux kernel, akin to what happened with 5.5 -
@george1421 Same results with 5.10.12 I’m afraid. We were using 5.6.18 for all of the previous tests, that’s right.
-
@jacob-gallant Well nuts. I was hoping the updated kernel would function better. Yes we need 5.6.18 to have support for that network interface, if you were using 4.19x the network interface wouldn’t work at all.
-
@Jacob-Gallant @george1421 So far it all looks like a driver issue in the Linux kernel. Though I am really wondering that we don’t find other users’ reports about this NIC.
Maybe this is some kind of jumbo frame issue?
@Jacob-Gallant Would you be willing to capture a short part of the network traffic on your FOG server and upload the PCAP so we can take a look? Schedule a debug deploy task. Boot the host up and ein
ip a s
and note down the IP address before you start the job viafog
command. Now runtcpdump -w /tmp/dump.pcap host x.x.x.x
as root on your FOG server using the IP address noted down. Leave that tcpdump sit there and step through the deply task on the machine. Quickly after the first blue partclone screen starts you want to stop tcpdump on your FOG server (Ctrl+c) so the PCAP file is not growing too much! I am fairly sure we see the retransmits at that point already and might find why.Just copy the file /tmp/dump.pcap from your server and upload to a share we can access.
-
@george1421 I’m currently researching this issue. I do see others with speed problems with this series of nic adapters.
-
@sebastian-roth @george1421 Thanks to you both for all of your time. Here is the capture:
https://drive.google.com/file/d/1WS8e2R9kR-ZjpqzgikmSg0CakZJYJi4h/view?usp=sharing
-
@Jacob-Gallant I looked at the PCAP for quite some time now. We see clear signs of “network congestion” - meaning that packets are being re-transmitted causing the TCP connection to slow down.
The connection starts just fine and the host sends a file read request to the FOG server. Now the FOG server starts to send a first large packet. Standard ethernet MTU is 1518 bytes and the FOG server sends 7240 bytes in one single TCP packet - a so called jumbo frame.
So I am wondering if you can improve speed by disabling LRO (Large Receive Offload), TSO (TCP Segmentation Offload) and GSO (Generic Segmentation Offload) using ethtool. Schedule and boot into another debug deploy session. On the shell run:
ip a s ethtool -K eth0 lro off ethtool -K eth0 tso off ethtool -K eth0 gso off
The first command is just to confirm the network interface name (could be
eth0
or different) to use with ethtool later on. You can try disabling all three at once or just one and give it a try.There are various I219-V cards/chips listed with different PCI IDs. Searching with 8006:15fc I couldn’t find much on the web but searching for I-219V there are a few people complaining about issues:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802691
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1785171
https://forums.linuxmint.com/viewtopic.php?t=327435
https://access.redhat.com/solutions/3615791
Though I am really in doubt if any of those match your exact situation. -
@sebastian-roth said in HP ProBook 640 G8 imaging extremely slowly:
@Jacob-Gallant I looked at the PCAP for quite some time now. We see clear signs of “network congestion” - meaning that packets are being re-transmitted causing the TCP connection to slow down.
The connection starts just fine and the host sends a file read request to the FOG server. Now the FOG server starts to send a first large packet. Standard ethernet MTU is 1518 bytes and the FOG server sends 7240 bytes in one single TCP packet - a so called jumbo frame.
So I am wondering if you can improve speed by disabling LRO (Large Receive Offload), TSO (TCP Segmentation Offload) and GSO (Generic Segmentation Offload) using ethtool. Schedule and boot into another debug deploy session. On the shell run:
ip a s ethtool -K eth0 lro off ethtool -K eth0 tso off ethtool -K eth0 gso off
The first command is just to confirm the network interface name (could be
eth0
or different) to use with ethtool later on. You can try disabling all three at once or just one and give it a try.There are various I219-V cards/chips listed with different PCI IDs. Searching with 8006:15fc I couldn’t find much on the web but searching for I-219V there are a few people complaining about issues:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802691
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1785171
https://forums.linuxmint.com/viewtopic.php?t=327435
https://access.redhat.com/solutions/3615791
Though I am really in doubt if any of those match your exact situation.Apologies for the delay in getting back to you, I’ve been working from home so far this week so I didn’t have access to the device. Unfortunately these steps didn’t improve anything.
-
@Jacob-Gallant After running those commands mentioned, can you run
ethtool -k
(lower case k this time) and take a picture of the output and post here? -
@sebastian-roth Here you are! https://photos.app.goo.gl/WnHEE63jFEjKvT4N9
-
@Jacob-Gallant Ok, seems like it actually did disable TSO and GSO. Can’t find LRO in the output but maybe the driver doesn’t support that.
Unfortunately I am running out of ideas with that.
Can you try booting up from a Linux Live DVD/USB and do some network testing with that? Try a distro with a very recent kernel if possible.
-
@sebastian-roth Hi Sebastien, apologies again for the delayed response. I ran a live USB of Ubuntu 20.10 and network performance was normal. We also have Windows 10 loaded on one of the devices manually and it performs normally as well. It seems specific to FOG performance unfortunately.
-
@Jacob-Gallant Well I did expect Windows to have normal network speed. But Ubuntu is using the Linux Kernel and therefore a pretty similar driver for this network card. What tests did you do for network speeds? Iperf again to really be able to compare results?
Please boot up Ubuntu again and run the following commands in a root command shell:
uname -a lspci -nn | grep -A 2 -i net
There is some light at the end of the tunnel if Ubuntu doesn’t show the same issue. But it will be a long struggle to find out why. Comparing kernel versions and an enourmous list of patches Ubuntu adds to the official kernel.
Now that I write this I think it’s better to test other live distros as well, try Arch Live and maybe SystemRescueCD. With every distro run the same iperf test to be able to compare results and run the above commands, posting results here.
-
@sebastian-roth I hadn’t used iperf, just a regular speed test (speedtest.net). Here are the results from iperf for ubuntu (still quite a few retries when connecting to the main FOG server):
https://photos.app.goo.gl/FDvPSgLoKVAUWDpY7Here are the results from the command above:
https://photos.app.goo.gl/tcVtyXBZnWzbVN1B6And here are the iperf3 results from Arch:
https://photos.app.goo.gl/qXU7b5tn8b5ohAan9I can’t get SystemRescueCD to work as of yet, it will not connect to the network at all with that, but I’ll post the results when I get them.
-
@jacob-gallant said in HP ProBook 640 G8 imaging extremely slowly:
Here are the results from iperf for ubuntu (still quite a few retries when connecting to the main FOG server):
https://photos.app.goo.gl/FDvPSgLoKVAUWDpY7Looks pretty similar to what we had with FOG with many retries from my point of view: https://photos.app.goo.gl/xXFPLZFHAJT7dPEo9
As well Arch shows the retries. I really wonder why we don’t find more people reporting issues with that driver/NIC?!?
About the
lspci
command, I am sorry I got that wrong just typing it from the top of my head. I meant:lspci -k | grep -A 2 -i net
(so we see which kernel driver is used) -
@sebastian-roth How does this look? https://photos.app.goo.gl/2iZT3HmDE3A1wxJH9
-
@jacob-gallant said in HP ProBook 640 G8 imaging extremely slowly:
How does this look? https://photos.app.goo.gl/2iZT3HmDE3A1wxJH9
Yes, perfect. So we know Ubunut using a 5.8.x kernel (with many specific patches included) is using the same kernel driver
e1000e
that we also use with FOS. From the iperf output to me it looks like Ubuntu has the same issue with high number of retries when testing with iperf - same as using Arch Linux. You seem to not notice the issue when testing with speedtest.net but I think this test is not valid in this case because packets from the internet usually come in smaller portions (lower path MTU than in the local subnet where you have jumbo frames) and would not cause the same slowness…So sorry I have put some hope on this when we had the first tests with Ubuntu. Now I think it’s just the same.
As a last resort we might compiling a one-off kernel for you using the driver provided by Intel - though I have to say that I haven’t looked into this yet and it might turn out to be a hazzle. Not sure yet.
-
@sebastian-roth OK, totally understand. Just let me know! Thanks for everything Sebastian.
-
@Jacob-Gallant There is one more thing you might want to look at with the current kernel before you get into testing the patched one below. Schedule a debug deploy task and when you get to the shell run
ip link show | grep mtu
and see what number it states right after the key wordmtu
.Although it did not compile straight from the Intel code it wasn’t too much work to fix and get it build.
Download patched kernel binary, and put in
/var/www/html/fog/service/ipxe/
directory on your FOG server. Now edit the host settings of your HP ProBook 640 G8 and set Host Kernel tobzImage-5.10.12-e1000e-3.8.4
. Schedule a deploy task and watch the screen when it PXE boots - it should saybzImage-5.10.12-e1000e-3.8.4...ok
when loading the kernel.Will be interesting to hear of deployment speeds are in a normal range with this kernel.
Just for reference if we need to re-compile this again:
- When using the Intel driver code v3.8.4 there are still calls to PM QoS functions that don’t exist in 5.10.x kernels anymore. Swaping out the function names as seen in this post on the kernel mailing list.
- Next is a function call that was completely removed.
- Then I commented out the use of
xdp_umem_page *pages
in kcompat.c as this was removed in mainline kernel and is only used for older kernel versions in kcompat.c anyway. - Finally re-enabled
CONFIG_PM
in our kernel config to get past the last compile error. A different solution would be to move the function definition ofe1000e_pm_thaw
outside the#ifdef CONFIG_PM
block.