Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG
-
@Quazz said in Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG:
https://bugzilla.redhat.com/show_bug.cgi?id=1652865
Some other people having a similar issue, problem possibly fixed on newer kernel version.
@no0NE Go to the Kernel Update page and try grabbing the Kernel 5.1.16
Thanks for the info!
I’m quite sure it’s related to the kernels, i’ve tried ubuntu 19.10 kernel 5.3.0-18 and it works fine there… (what made me confused and starting this thread was due to the same PC’s - 1 to start with and now 3 tested all worked fine with same firmwares, OS’s, network/cables etc…!)I forgot to tell i also tried the TomElliot 5.1.16 (“mac/nvme fix”) kernel with the same results during my kernel troubleshooting this monday.
So i can’t DOA the PC to Dell, but i’ll try to call their support to ask them what may differ between the physical PC’s settings wise, hardware, firmwares etc.
-
I’ve been running into this issue for a few years now, It seems there is always one machine in our order of whatever new Dell PCs that won’t image, The network just stops responding randomly during the bootup phase into Partclone, and if it makes it to partclone it only transfers for about 10-15 seconds and then stops.
Transferring the SSD into the exact same model from the same order will let me image the drive, so it seems like its something within in each Dell PC build that causes this… however I don’t even know where to start troubleshooting it…
My solution has been to just take the drive out of the machine that won’t transfer an image and swap it with one I’ve already imaged, which has been getting increasingly difficult since Dell keeps making the bottom panels of their laptops thinner and thinner…
-
@MHImager Does this happen on models with NVMe SSDs only?
-
@Quazz Nope, even when we were ordering Dell PC’s with SATA SSD’s this was an issue.
Currently experiencing the issue with one of our new Latitude 5500 laptops, we ordered two and one imaged fine and the other isn’t able to fully transfer a image. However just in the few hours I’ve been playing around with it I’ve managed to get it to transfer a small amount of blocks by cycling the NIC (Unplugging/Plugging the cable).It’s a very strange issue, I can’t seem to figure out why it happens exactly, but I suspect it has something to do with either the MAC address burned into the NIC or just the driver that iPXE uses.
-
@MHImager Would you be able to boot some Linux Live CD/DVD and do some network transfer testing on that?!
-
@Sebastian-Roth Funny enough I just tried live booting Ubuntu 18.4.3 on the laptop with the imaging issues and I was able to ping my internal domain controller and the fog server without any packet loss and a very good response time… I would say that rules out hardware?
-
@MHImager said in Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG:
I was able to ping my internal domain controller and the fog server without any packet loss and a very good response time… I would say that rules out hardware?
I am not exactly sure. From what you wrote so far it sounds as if you see the network issues right from the beginning when you PXE boot. That would mean that it’s not packet loss due to high network load and pinging can be counted as valuable test in this case.
But then I wonder what is causing this because we have some very different components in the chain. iPXE is just for booting it up and then the Linux kernel takes over for imaging. If both have the issue it would mean both have driver issues.
Did you do the Live Linux test on the same network switch/port as you do the imaging?? -
@Sebastian-Roth Very true and yes I left the laptop in the exact same port/switch that I ran my live ubuntu test in.
It always seems to fail at any stage once booted to the fog menu, I typically go Deploy Image -> (Username and password) -> Select the image I want to deploy -> Sometimes it fails after I hit enter here, trying to download one of the files to move to where it clears the drive and starts partclone (I believe one is boot.php and bzimage) -> if it doesn’t fail there it typically will move to partclone and then fail while trying to copy each partition image file.
I’ve tried a handful of different kernels with no luck, but the weird thing is I was still able to image a different laptop of the exact same model and config, with the only known difference to me would be the MAC address…
-
@MHImager If it works reliably with another machine of the same model, I would start questioning the integrity of the problematic machine. Being things seem to be failing inconsistently, it might be worth it to run a memtest and run it through dell’s onboard diagnostics. While not at all common, I have received a couple of machines with dodgy RAM that have done weird things over the years.
-
@MHImager said in Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG:
with the only known difference to me would be the MAC address…
Are you sure about that? Firmware version? Exact same hardware? I mean, did you buy those as a batch or several orders?
-
@Sebastian-Roth said in Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG:
Are you sure about that? Firmware version? Exact same hardware? I mean, did you buy those as a batch or several orders?
Updated both laptops to BIO’s version 1.5.1, Exact same SKU, they were ordered along with 4 Optiplex 5070’s (One of those had the same issue, Had to swap the drive to image it) and 2 all in one’s which luckily imaged fine!
About a year ago we placed an order for about 35 All in one Dell PC’s (OptiPlex 5260 AIO) and one of the machines in that batch did the same thing, also in that order were about 15 laptops (Latitude 5490) again, same thing, one machine would have the same behaviour and I’d have to swap drives to image it. Should mention I was using legacy PXE boot back then too, now I’m using UEFI.
-
@MHImager If you even see such an behaviour with different models I’d think about something different causing this. Maybe it’s one switch or port or cable or??!?
-
Sorry for reviving an old thread (of mine).
Just wanted to post the solution if anyone would encounter the issue and was looking for a solution finding this thread.We solved this about the time when FOG 1.5.8 came out early spring 2020.
No matter what we did we still had the issue with ~ 7-10% of our Dell 7070 ultra PC’s after we recieved the final batch of user PC’s.
Soon after 1.5.8 rolled out i tried upgrading our FOG server to that, afterwards it all worked fine with the troublesome PC’s!
We tried lifting over the kernel and all the loaded files up until the imaging session starts as far as we know from the test environment (1.5.8) to production (1.5.7), but the issue was still there…? After a while we upgraded production FOG server to 1.5.8 and it started working there as well…Something in 1.5.7 glitches in some cases with some specific hardwares even if same models (can’t figure any difference other than perhaps some glitch depending on what MAC adresses being used if for any reason the NIC drivers misbehaves in a specific combination of MAC’s ending in some value…?). While 1.5.8. solved that - not being the Kernel itself solving this.
(if anyone would know what could cause this, i’m just curious technically to find out the reason for the errors, any developer that knows what changed other than the kernel that could solve this?)
You can close this otherwise dead thread
-
@no0NE It would be interesting to see if the linux 5.6.x or greater version of FOS Linux kernel would have also addressed this issue. You can add this kernel using the FOG Settings -> Kernel menu. You WILL need the 5.6.0 or later kernel to support the newest hardware that is out there.
FWIW: The linux kernel developers are no longer back porting new drivers to the 4.19.x series of kernels. So you will need to upgrade sooner or later.
-
@no0NE Really interesting this was solved by 1.5.8 but not the Kernel. Did you also copy over the inits to check if those would make a difference?
Anyhow, thanks for letting us know!
-
@Sebastian-Roth To my memory we replaced kernels (both 32 & 64 bit), init’s/bzimage (expanded to more and more files etc during troubleshooting to try to find the culprit) none helped other than upgrading to 1.5.8. I assume it’s something Dell specific with some weird OEM magic that messes up something somehow somewhere…