slow speed and timeout issues
-
@Sebastian-Roth this happen on every model that we had on the 7070. They all had the same hard drive. If it switched it to the bigger nvme hard drive everything work as expected. I worked with george to try to track down more issues.
I am willing to test in anyway that you would want me too. I can be a guinea pig
if you want to see the videos that i took then i will be more than happy to share with you. I can take more if needed.
I am more than happy to make a donation.
-
@Sebastian-Roth My initial thoughts were a bit off. Initially I thought, what the heck does iPXE care about disk subsystem. It transfers files from the fog server to memory and then executes them…
Then I looked at the code and “SANBOOT” popped out. it needs to init the disk subsystem in case ipxe needs to chain to boot from the local hard drive. Now I don’t know what more that means other than the disk subsystem IS of the things it inits.
-
@george1421 As far as I understand the iPXE code I would expect sanboot code only to be used/initialized when you really use the
sanboot
command. The PXE environment is somehow restricted in terms of memory and so I reckon iPXE would not init stuff that is not needed explicitly.Searching the forums for 7070 we seem to have a few posts on that:
https://forums.fogproject.org/topic/13851/massive-packet-loss-nic-issues-with-new-dell-7070-ultra-in-fog/
https://forums.fogproject.org/topic/13933/issues-with-optiplex-7070/@mmoore5553 Please take a close look at the last topic (issues with optiplex 7070) and give that a try!
-
good news. I had some time to play today and found in the bios if i go to advanced configuration. Then to ASPM. I had to disable that . This controls the handshake between the device and pci express hub to determin the best aspm mode supported by the device. Once that was disabled everything was fast again and i could use the new hard drive and onboard nic.
-
@mmoore5553 Thanks for getting back to us with this information. I am sure this will be helpful for others as well. So the issue seems to be caused by some energy saving mode.
-
@Sebastian-Roth yes. this is new in the upcoming models. I had to reach out to dell once i found it and let them know about it. They said the newer bios will have them. they had no clue that this would cause an issue.
-
@mmoore5553 I suppose the Linux kernel developers are onto this issue as well. Possibly it’s already fixed in one of the more recent kernel lines (5.4.x or 5.6.x).
If you are really keen you could compile your own custom kernel using a newer version and see if it’s fixed upstream already. Just let us know if you need help with that.
-
@Sebastian-Roth Here is a one-off kernel v5.5.3 that I created for some reason in Feb 2020… https://drive.google.com/open?id=1thopskSYJd7ueDQeFg_VT4eeNcrNHvIx
-
@Sebastian-Roth I mean, this seems to be the very same problematic code (or closely related to) that has already been given us various issues, primarily low speeds, which was subsequentially partially (or fully for some disks) addressed by setting the latency kernel parameter to 0 by default.
That said, as we then discovered, in some cases it is not sufficient and we had to disable ASPM using the NVME cli utility for them to work normally. I don’t believe that was ever integrated into FOS since we don’t fully know if this could cause issues in otherwise properaly working drives.
For NVME disks specifically it’s APST, a subset of ASPM.
sudo nvme set-feature -f 0x0c -v=0 /dev/nvme0
That line should disable it (assuming disk name is nvme0)
-
@Quazz Good point but it might not help in this case. If I remember correctly @george1421 sent me a link to a video where we saw that the symptom in this case was that it took literally minutes to download the kernel binary on PXE booting a task - so even before the kernel is even loaded.
@Sebastian-Roth said:
I suppose the Linux kernel developers are onto this issue as well. Possibly it’s already fixed in one of the more recent kernel lines (5.4.x or 5.6.x).
Now that makes my last comment sound really stupid!
Probably something we’d need to dig into with the iPXE developers. But they seem very busy and unresponsive in the last months and I don’t think we’ll get very far with this.To really pin that down one of us devs would need the mentioned hardware to test on. But I am wondering if it’s worth it as we can’t promise to get it fixed. Could be a firmware bug really.
-
@Sebastian-Roth Yes it was on the iPXE side. During testing we booted FOS Linux off a usb flash drive and it imaged fine. Well with some clarity, it imaged fine once we added
nvme_core.default_ps_max_latency_us=0
to the usb/grub boot parameters. SO that kind of points to ipxe/hardware/uefi firmware that had a conflict with this specific nvme disk.