Lenovo T14 Gen 2
I have been using FOG for the past 6 months with great success, thank you so much to the developers!
We just got in a new Lenovo model, the T14 Gen 2. I created my golden image and it captured with no problems. Now, however, when I go to deploy the image, I get very slow deployment speed. I have done some troubleshooting and I can’t determine the cause.
Sometimes it goes really slowly (40 hours for a 29GB image) and sometimes it hangs in various places before partclone even runs. No problem pxe booting or sending inventory to server.
- I have the FOG server (running kernal 5.10.34, FOG version 1.5.9) on a switch that the PCs are directly connected to. I tested with this switch removed from the equation, same speed. (~10-50MB/min)
- I tried deploying a different image to the same PC just in case my new image was causing the problem, same result.
- I tried deploying an image to a different model PC (Lenovo T490) and it worked perfectly, usually about 5-10 minutes to deploy.
- I have checked top on the server, and it is hardly under any load at all.
This feels like a network issue, but like I said, on another laptop it works perfectly.
Can anyone please help point me in the right direction?
@mmarquis Well too bad we can’t fix this easily. As a next step I would grab a Linux Live boot ISO with a recent Linux kernel and see if networking is slow there as well. Maybe use Arch, burn a CD/DVD or write the ISO to a USB key drive and boot it up. When it’s up check if you have an IP and can ping the FOG server. Then mount the FOG NFS share and copy one of the image files over:
mkdir /images mount -t nfs -o nolock,proto=tcp,rsize=32768,intr,noatime 192.168.x.y:/images /images
To get some stats when copying you can use different tools but I am not sure if those are part of the Arch Live ISO. See which one is working for you:
pv /images/IMAGENAME/d1p3.img > /tmp/test.img rsync --progress /images/IMAGENAME/d1p3.img /tmp dd if=/images/IMAGENAME/d1p3.img of=/tmp/test.img status=progress
Today I put the PC and FOG server on an even dumber switch than they were already on. (Just kidding, but this one is a 5 port un-managed Netgear switch we all know so well.)
Same behavior on the dumber switch unfortunately.
I’m about to send out my last one, the one I’ve been testing with. I don’t know if we’ll ever get this model again, lol. I really hope the next time I image with FOG it goes back to normal. I’ll have new machines in next week, so I suppose we’ll see what happens.
I appreciate the help, at this point it’s just the mystery of the thing for me. Let me know if you have any other thoughts or ideas.
Can’t say I know what I’m looking at, but I hope you find something cool in there.
Unfortunately not. It’s an Intel network chip using the Linux e1000e driver from what we see in the outputs. I was hoping to find some messages that point to network driver issues but there is nothing in the dmesg output. Looks all clean.
Did you try my suggestion on using a dumb mini switch to connect the Lenovo T14 Gen 2 to your normal network switch?
@mmarquis Well that’s interesting. I suggest you schedule a debug deploy task for another host which doesn’t have the network problems to make sure all the steps outlined really work in your setup. We have done this with many other people over the years and it usually works.
The system booting up to so the actual work is a lightweight custom Linux OS based on buildroot.org. We call it FOS. No iptables tools and no rules set. It’s still based on old school init.d scripts instead of systemd. To check if SSHd is running:
ps ax | grep ssh
Do you see a proper IP address being assinged in the
ip a soutput? On the one hand I could imagine the network issue to be a problem but then I think a task would fail way earlier in the process if network is completely down because at the beginning of a task it checks into the FOG server and errors out on failure. Why would that work but not SSH? Possible but kind of unlikely.
Edit: Now that I read your post again and think about it I wonder if it’s just inbound traffic that is problematic on the Lenovo. But still, response packets from the task checkin obviously make it through. Can’t imagine that inbound TCP SYN packets are dropped because of a driver issue.
The other option you have to get the dmesg outputs over is using a USB thumb drive (best format with FAT32). Plug that in, use
lsblkto find it’s device name, mount and copy over the text files. In this case you only have one command shell in the device itself. So take the first dmesg output and copy to USB drive before starting the task (command
fog). Step through the task and when you have enough of waiting in a slow partclone screen you should be able to cancel that with ctrl-c to get back to the shell and grab another dmesg output.
I’m having trouble ssh’ing into this thing.
I even put them on their own little network, and I can ping from the Linux session to my Windows device, but I can’t ping the PC we’re troubleshooting.
The bash shell doesn’t recognize “iptables” or “systemctl”
It feels like we’re now troubleshooting something else, so I apologize. But I’m not able to do the steps you outlined.
(Incidentally, I can ping and SSH to the FOG server from the same Windows PC.)
I’m not sure if this is just the same network driver issue we’re facing in general. Any thoughts?
@mmarquis Sounds like a network (driver) issue. Finding and fixing this issue is probably going to be a long endeavor with deep knowledge of the Linux kernel involved. Though there are a few easy things you can try before diving into the big ocean.
There is a slight chance it’s some kind of EEE (Energy-Efficient Ethernet) thing causing this. I suppose that would only happen if the Lenovo T14 Gen 2 is connected to a EEE-capable switch. So you might check the switch settings and disable EEE (single port or all together) or even easier, grab an old dumb mini switch and hook that in between. That way EEE should not be triggered in the driver because the dumb mini switch lacks EEE functionality.
Second thing you might want to look at is getting a full
dmesgoutput to see if there is information on why speed is so slow.
- Schedule a new deploy task for one of your Lenovo T14 Gen 2 devices but this time enable the Schedule as debug task checkbox just before you hit the button to create the task in the FOG web UI.
- Boot the device up as usual and you will end up in a command shell (after you hit ENTER twice as shown on the screen).
- Run command
ip a sto find out the IP address pulled from the DHCP server and then
passwdto set a temporary root user password in this FOS session.
- Use PuTTY or any other SSH command tool to connect to the device and login as root. Now you have two command shells.
- Run command
dmesg > dmesg_bootup.txtin the SSH command window. Output will be written to a text file and you won’t see it on screen. Leave the SSH command shell open, we’ll need that later again.
- Use WinSCP or any other SSH file transfer tool to login and copy over the dmesg_bootup.txt file to your computer. Leave that connected as well.
- Now go back to the command shell on the device itself and fire up the FOG task using the simple command
fog. Now you need to step through the process pressing ENTER key. Go all the way through to where you have the blue partclone window with slow speed (possibly not the first partclone screen but one of the later ones with the biggest partition).
- Go back on the SSH command shell and get another dmesg output:
dmesg > dmesg_deploy1.txt
- See if you can copy the new
dmesg_deploy1.txtto your computer as well using WinSCP - reload file listing to see the newly created file.
- At this stage we should have enough information and you can stop the task by shutting down the Lenovo T14 Gen 2 through a
haltcommand on the SSH terminal and cancel it in the FOG web UI.
Upload the text files here in the forums or to an external file share and post a link here.
Okay, so doing a little more troubleshooting today.
Plugged the T14 into a Lenovo USB-C docking station and presto! Now we’re getting good speeds again.
I believe this would indicate some weirdness with the LAN card. Do you guys recommend any way of fixing that?
Thank you Sebastian!
I scoured the BIOS settings and did some googling, I don’t see any VMD settings on this BIOS.
For some reason I’m having trouble uploading an image,
But here’s a link to a picture of the screen:
Regardless, the output is:
4.0 GB copied in 2.23 s (1.9 GB/s)
(I can type out all of the output if needed, but I figured that was the key data.)
Now it dropped back down to 15MB/min with a ~30+hr estimate.
You want to read all of this as they found this to be cause by some UEFI setting (“Storage Controller for VMD”). The USB key connected to the computer just kind of masked the issue but they were able to fix it by changing that particular UEFI setting. It could still be the same (VMD) thing in your case just using the USB key is not masking the issue. So still check your settings to see if you can find it on the Lenovo T14 Gen 2 as well.
Other than that you’ll need to do some debugging to find out if it’s disk or network IO causing the slowness. More easy to test is disk IO, so start with that. Cancel the current deploy task and schedule a new one for this same host but this time mark the checkbox for debug in the web UI just before you click the “Create Task” button.
Now boot the host machine up as usual and it should bring you to a command shell. Here you run
lsblkto find out what the hard drive is named - could be
/dev/nvme.... Now with that run the following two commands:
ddcommand will write zeros to your disk and for that reason will ERASE any data on it. It won’t cause any harm to your disk but data will be wiped out!! Use with caution and make sure you think about it and understand before going ahead.
hdparm -Tt /dev/sda dd if=/dev/zero of=/dev/sda bs=1G count=4 oflag=direct
Take a picture of the output on screen (or note it down) and post here.
Thank you Sebastian!
I did as you suggested.
So, it seemed to have an effect. At first it was hitting 300+MB/min and my estimate was around 2 hours. Now it dropped back down to 15MB/min with a ~30+hr estimate.
@mmarquis Try restarting the whole imaging with the USB key connected right from the early startup. So shut the host down hard and PXE boot into the existing task again.
Thank you for the response!
Ha! That is strange. I think this little black bar popped up when I did that:
“sd 0:0:0:0: [sda] No Caching mode page found”
Have given it about 5 minutes, no change in speed.
@mmarquis This is going to sound strange, but insert a usb flash drive (any size) and see if imaging speed returns to normal.