slow speed and timeout issues
-
I having issues when i try to image a dell 7070 . It has hard drive Model – mz-9lq256a
Part Number – MZ9LQ256HAJD-000D1. It is slow and sometimes disconnects. If i switch the hard drive to Kxg50znv256 NVME Toshiba 256gb everything is fast and works again. I have seen this post but the fix does not work.https://forums.fogproject.org/topic/13777/extremely-slow-deploy-to-nvme-drives/17?lang=en-US&page=1
I know for a fact and can replicate the issue if i put back the hard drive that it came with mz-9lq256a.
does anyone know how to fix this ? I am working with dell on getting the hard drives replaced but would love to find out a way to work around it. We have a bunch of 7070 in stock.
-
@mmoore5553 What version of fog are you using to start with?
What have you done in the two posted links to debug this issue?
-
i am sorry but i applied nvme_core.default_ps_max_latency_us=5500 and nvme_core.default_ps_max_latency_us=0 in the settings.
I updated to the latest version last night - 1.5.8.28.
this is as far as i have went since it appears to be hard drive issue or how the software is seeing the hard drive. I did not know what else to try.
-
I have an update on this. will follow up a bit later with the status for the FOG Devs
-
@mmoore5553 Just to make we are hunting down the right rabbit… Do you have more than one hard drive of that mz-9lq256a model? Test at least on two or three to make sure this is reproducible on all of them. As well test on two or three different DELL 7070 machines to make sure!
-
Here is the executive brief on this issue.
With the MZ9LQ256HAJD-000D1 installed iPXE is having an issue downloading the background image of the iPXE menu. Switching to the Toshiba nvme drive it works normally. There is something that iPXE is looking at/needs from local storage on this 7070.
TLDR version;
I hand an extended chat session with the OP. On the same computer without changing anything but the nvme drive the system was failing to download the background image for the iPXE menu. After about 20 seconds it had only downloaded 5% of the image (according to the video). I had the OP update the firmware just to rule out the uefi firmware being at fault. We next setup booting FOS Linux from a usb flash drive. Once the usb flash drive was setup the OP attempted to image with the MZ9LQ256 drive. FOS did image the drive but the OP said it was slower than normal. Sustained speed was about 4.2GB/m. I had the OP repeat the same steps with the toshiba drive. The toshiba drive imaged at 8.3GB/m, what the OP called normal. I had the OP addnvme_core.default_ps_max_latency_us=0
to the kernel parameters on the usb flash drive and attempt to reimage the slow drive again. This time the slow drive imaged at the normal rate of 8.3GB/m. So in FOS Linux the slow drive needed the latency parameter where the toshiba drive did not need this parameter. These tests were done on the same hardware with only the nvme drive changing. So it appears that iPXE is trying to do something with that slow disk. Once FOS Linux boots it images fine. -
@Sebastian-Roth this happen on every model that we had on the 7070. They all had the same hard drive. If it switched it to the bigger nvme hard drive everything work as expected. I worked with george to try to track down more issues.
I am willing to test in anyway that you would want me too. I can be a guinea pig
if you want to see the videos that i took then i will be more than happy to share with you. I can take more if needed.
I am more than happy to make a donation.
-
@Sebastian-Roth My initial thoughts were a bit off. Initially I thought, what the heck does iPXE care about disk subsystem. It transfers files from the fog server to memory and then executes them…
Then I looked at the code and “SANBOOT” popped out. it needs to init the disk subsystem in case ipxe needs to chain to boot from the local hard drive. Now I don’t know what more that means other than the disk subsystem IS of the things it inits.
-
@george1421 As far as I understand the iPXE code I would expect sanboot code only to be used/initialized when you really use the
sanboot
command. The PXE environment is somehow restricted in terms of memory and so I reckon iPXE would not init stuff that is not needed explicitly.Searching the forums for 7070 we seem to have a few posts on that:
https://forums.fogproject.org/topic/13851/massive-packet-loss-nic-issues-with-new-dell-7070-ultra-in-fog/
https://forums.fogproject.org/topic/13933/issues-with-optiplex-7070/@mmoore5553 Please take a close look at the last topic (issues with optiplex 7070) and give that a try!
-
good news. I had some time to play today and found in the bios if i go to advanced configuration. Then to ASPM. I had to disable that . This controls the handshake between the device and pci express hub to determin the best aspm mode supported by the device. Once that was disabled everything was fast again and i could use the new hard drive and onboard nic.
-
@mmoore5553 Thanks for getting back to us with this information. I am sure this will be helpful for others as well. So the issue seems to be caused by some energy saving mode.
-
@Sebastian-Roth yes. this is new in the upcoming models. I had to reach out to dell once i found it and let them know about it. They said the newer bios will have them. they had no clue that this would cause an issue.
-
@mmoore5553 I suppose the Linux kernel developers are onto this issue as well. Possibly it’s already fixed in one of the more recent kernel lines (5.4.x or 5.6.x).
If you are really keen you could compile your own custom kernel using a newer version and see if it’s fixed upstream already. Just let us know if you need help with that.
-
@Sebastian-Roth Here is a one-off kernel v5.5.3 that I created for some reason in Feb 2020… https://drive.google.com/open?id=1thopskSYJd7ueDQeFg_VT4eeNcrNHvIx
-
@Sebastian-Roth I mean, this seems to be the very same problematic code (or closely related to) that has already been given us various issues, primarily low speeds, which was subsequentially partially (or fully for some disks) addressed by setting the latency kernel parameter to 0 by default.
That said, as we then discovered, in some cases it is not sufficient and we had to disable ASPM using the NVME cli utility for them to work normally. I don’t believe that was ever integrated into FOS since we don’t fully know if this could cause issues in otherwise properaly working drives.
For NVME disks specifically it’s APST, a subset of ASPM.
sudo nvme set-feature -f 0x0c -v=0 /dev/nvme0
That line should disable it (assuming disk name is nvme0)
-
@Quazz Good point but it might not help in this case. If I remember correctly @george1421 sent me a link to a video where we saw that the symptom in this case was that it took literally minutes to download the kernel binary on PXE booting a task - so even before the kernel is even loaded.
@Sebastian-Roth said:
I suppose the Linux kernel developers are onto this issue as well. Possibly it’s already fixed in one of the more recent kernel lines (5.4.x or 5.6.x).
Now that makes my last comment sound really stupid!
Probably something we’d need to dig into with the iPXE developers. But they seem very busy and unresponsive in the last months and I don’t think we’ll get very far with this.To really pin that down one of us devs would need the mentioned hardware to test on. But I am wondering if it’s worth it as we can’t promise to get it fixed. Could be a firmware bug really.
-
@Sebastian-Roth Yes it was on the iPXE side. During testing we booted FOS Linux off a usb flash drive and it imaged fine. Well with some clarity, it imaged fine once we added
nvme_core.default_ps_max_latency_us=0
to the usb/grub boot parameters. SO that kind of points to ipxe/hardware/uefi firmware that had a conflict with this specific nvme disk.