Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG



  • Hi.

    Our company recently purchased a few Dell 7070 Ultra to start preparing our environment for a change to this PC/Setup in our production environment.

    So far i’ve successfully captured and deployed a test image from 1 of my 2 test PC’s of this model, my problem is that the 2:nd PC of the exact same model & batch suddenly after loading the bzimage/kernel file (throughout the rest of the imaging process) has massive packet losses/response times, ranging from ~500-5000+ms, with a lot of dropped packages all together. Resulting in imaging taking a weekend instead of ~5 min. Imaging works correctly also with our legacy hardware running undionly.kkpxe/BIOS

    Once in Windows/UEFI/anywhere else than FOG the PC has standard response times and everything works perfectly, showing that the NIC seems to work fine…

    The problem starts before any imaging/capturing begins - as soon as the kernel is loaded, pointing probably to a driver issue, what confuses me is why the first PC works like a charm every time in that case…

    I’ve manually upgraded the kernel to Kernel.TomElliott.4.19.64.64 (from included .48 kernel) - no difference.

    At this stage i’d like to try further PC’s from this model, but it will take a couple of months before that’s possible.

    So my question is, do you have anything else to point me in a direction to troubleshoot further, or is there a newer kernel/drivers that might simply work better? It is a brand new model and even a completely new series from Dell after all…

    I’ve tried changing ports & cables between the PC’s that work and doesn’t work, it’s always this specific PC that doesn’t work with any combination of cables etc… I’ve had one imaging that suddenly 99% of the process seemed to work and i managed to deploy the image to the PC that time, but that’s once in about 50+ tries, randomly during imaging it might start working with ~1ms for 5-10 sec and then it stops working again, sometimes (maybe 10% of the time) if i pull the ethernet cable out for a couple of seconds and put it back in it works for the first 5-10 seconds as well… Really feels like a driver issue.

    Do you think it’s a new/other kernel version that should solve this, or a newer FOG version altogether or something else?

    FOG 1.5.7 stable, ARM (FOG test environment on Raspberry Pi 4, 4gB, Raspbian Buster, latest updates as of last week)

    Imaging/capturing; Dell 7070 Ultra, i5 8365U, 8gB RAM, UEFI, with default ipxe.efi & .48 x64 & .64 x64 FOG/Tom kernels

    NIC; Intel I219-LM


    UPDATE; new BIOS released 21st October, 1.1.2. Updated to this version, same issue remains on this PC. Fresh win 10 install from microsoft - network works perfect, Debian Buster, FOG 1.5.7 etc same issue, works perfect still on 1st PC.

    Today (22:nd October) i got information from our branch office that they’ve managed to run FOG with our help on the 2 Dell 7070 ultras there and they work fine as well! Very strange that this single PC would behave like this and only in Linux env.


    Any help greatly appreciated, thanks!
    Best regards,
    Robin, IT Specialist. (With client PC environment responsibility among other things)


  • Developer

    @MHImager If you even see such an behaviour with different models I’d think about something different causing this. Maybe it’s one switch or port or cable or??!?



  • @Sebastian-Roth said in Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG:

    Are you sure about that? Firmware version? Exact same hardware? I mean, did you buy those as a batch or several orders?

    Updated both laptops to BIO’s version 1.5.1, Exact same SKU, they were ordered along with 4 Optiplex 5070’s (One of those had the same issue, Had to swap the drive to image it) and 2 all in one’s which luckily imaged fine!

    About a year ago we placed an order for about 35 All in one Dell PC’s (OptiPlex 5260 AIO) and one of the machines in that batch did the same thing, also in that order were about 15 laptops (Latitude 5490) again, same thing, one machine would have the same behaviour and I’d have to swap drives to image it. Should mention I was using legacy PXE boot back then too, now I’m using UEFI.


  • Developer

    @MHImager said in Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG:

    with the only known difference to me would be the MAC address…

    Are you sure about that? Firmware version? Exact same hardware? I mean, did you buy those as a batch or several orders?



  • @MHImager If it works reliably with another machine of the same model, I would start questioning the integrity of the problematic machine. Being things seem to be failing inconsistently, it might be worth it to run a memtest and run it through dell’s onboard diagnostics. While not at all common, I have received a couple of machines with dodgy RAM that have done weird things over the years.



  • @Sebastian-Roth Very true and yes I left the laptop in the exact same port/switch that I ran my live ubuntu test in.

    It always seems to fail at any stage once booted to the fog menu, I typically go Deploy Image -> (Username and password) -> Select the image I want to deploy -> Sometimes it fails after I hit enter here, trying to download one of the files to move to where it clears the drive and starts partclone (I believe one is boot.php and bzimage) -> if it doesn’t fail there it typically will move to partclone and then fail while trying to copy each partition image file.

    I’ve tried a handful of different kernels with no luck, but the weird thing is I was still able to image a different laptop of the exact same model and config, with the only known difference to me would be the MAC address…


  • Developer

    @MHImager said in Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG:

    I was able to ping my internal domain controller and the fog server without any packet loss and a very good response time… I would say that rules out hardware?

    I am not exactly sure. From what you wrote so far it sounds as if you see the network issues right from the beginning when you PXE boot. That would mean that it’s not packet loss due to high network load and pinging can be counted as valuable test in this case.
    But then I wonder what is causing this because we have some very different components in the chain. iPXE is just for booting it up and then the Linux kernel takes over for imaging. If both have the issue it would mean both have driver issues.
    Did you do the Live Linux test on the same network switch/port as you do the imaging??



  • @Sebastian-Roth Funny enough I just tried live booting Ubuntu 18.4.3 on the laptop with the imaging issues and I was able to ping my internal domain controller and the fog server without any packet loss and a very good response time… I would say that rules out hardware?


  • Developer

    @MHImager Would you be able to boot some Linux Live CD/DVD and do some network transfer testing on that?!



  • @Quazz Nope, even when we were ordering Dell PC’s with SATA SSD’s this was an issue.
    Currently experiencing the issue with one of our new Latitude 5500 laptops, we ordered two and one imaged fine and the other isn’t able to fully transfer a image. However just in the few hours I’ve been playing around with it I’ve managed to get it to transfer a small amount of blocks by cycling the NIC (Unplugging/Plugging the cable).

    It’s a very strange issue, I can’t seem to figure out why it happens exactly, but I suspect it has something to do with either the MAC address burned into the NIC or just the driver that iPXE uses.


  • Moderator

    @MHImager Does this happen on models with NVMe SSDs only?



  • I’ve been running into this issue for a few years now, It seems there is always one machine in our order of whatever new Dell PCs that won’t image, The network just stops responding randomly during the bootup phase into Partclone, and if it makes it to partclone it only transfers for about 10-15 seconds and then stops.

    Transferring the SSD into the exact same model from the same order will let me image the drive, so it seems like its something within in each Dell PC build that causes this… however I don’t even know where to start troubleshooting it…

    My solution has been to just take the drive out of the machine that won’t transfer an image and swap it with one I’ve already imaged, which has been getting increasingly difficult since Dell keeps making the bottom panels of their laptops thinner and thinner…



  • @Quazz said in Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG:

    https://bugzilla.redhat.com/show_bug.cgi?id=1652865

    Some other people having a similar issue, problem possibly fixed on newer kernel version.

    @no0NE Go to the Kernel Update page and try grabbing the Kernel 5.1.16

    Thanks for the info!
    I’m quite sure it’s related to the kernels, i’ve tried ubuntu 19.10 kernel 5.3.0-18 and it works fine there… (what made me confused and starting this thread was due to the same PC’s - 1 to start with and now 3 tested all worked fine with same firmwares, OS’s, network/cables etc…!)

    I forgot to tell i also tried the TomElliot 5.1.16 (“mac/nvme fix”) kernel with the same results during my kernel troubleshooting this monday.

    So i can’t DOA the PC to Dell, but i’ll try to call their support to ask them what may differ between the physical PC’s settings wise, hardware, firmwares etc.


  • Moderator

    https://bugzilla.redhat.com/show_bug.cgi?id=1652865

    Some other people having a similar issue, problem possibly fixed on newer kernel version.

    @no0NE Go to the Kernel Update page and try grabbing the Kernel 5.1.16


  • Developer

    @no0NE are there any power saving features available on the nic under windows that are enabled? sometimes those get stuck in windows and the linux side doesn’t have the ability to re-enable the performance mode.



  • UPDATE; new BIOS released 21st October, 1.1.2. Updated to this version, same issue remains on this PC.

    • Fresh win 10 install from microsoft - network works perfect. Also, all Ubuntu version i’ve tried, latest 19.10 works fine (kernel 5.3.0-18)

    • Debian Buster, PXE FOG 1.5.7 etc same issue, works perfect still on 1st PC.

    Today (22:nd October) i got information from our branch office that they’ve managed to run FOG with our help on the 2 Dell 7070 ultras there and they work fine as well! Very strange that this single PC would behave like this and only in Linux env. If it weren’t for it working fine in Windows i’d DOA it, but i’ll reach out to Dell for any help on what this may be caused by.



  • @george1421

    Thanks for the feedback.
    We switched to Dell from another brand, so we don’t have any adapters laying around, but i have a Dell DA200 USB C dongle with ethernet at home, good idea to try that, i’ll bring it tomorrow just to test! :)

    I quickly tried some version of Ubuntu live ( i think 18.04) on it and it worked that time last week. But i downloaded Debian Buster now and tried as well just to make sure as much as possible, it’s the same problem there now! Only on one of the PC’s, Debian Buster on “the first” PC of the same model continues to work fine with the same cables & BIOS here as well… Thanks for pushing me to actually double check that again!

    I got to DOA submit this PC to Dell soon, but unsure how they’ll see this as it’s working in Windows, but i’ll hope they have enough goodwill with us being a new Dell customer with a big order going in.
    Before doing that, i’m reinstalling windows 10 manually now from USB / Win 10 media creator just to make sure that works still or now.

    Btw, we began our environment change focused on the 3060, then the 3070 came and we were preparing purchasing of that model, but we slowly trickled into optimizing our physical environment with the 7070 Ultra instead and landed on this model to purchase now. It’s quite different from 3060/3070 since it’s built on laptop parts to begin with. It’s still quite expandable and configurable, but no full size PCI-e etc ;)


    UPDATE

    After talking with a Dell tech who helped me brainstorm a few things and we doublechecked the revision of the NIC, which is the same on both/all machines (Rev 11 of the I219-LM Intel ethernet NIC), we didn’t get any wiser, so basically still at the same spot.

    What i’m concluding is that it seems to work on newest Kernels even on this troubling PC, so i’m leaning to have to build my own kernel and it should sort itself out (what i’d probably had done by now, if it weren’t for all the other PC’s already working…)



  • Moderator

    @no0NE I’d still recommend trying a different cable/switch port just to rule it out completely, even if the first device seems to work properly on it.

    I also found the following on this topic that’s worth checking out: https://wiki.hetzner.de/index.php/Low_performance_with_Intel_i218/i219_NIC/en


  • Developer

    @no0NE So that leads us to the suggestion George made earlier. Boot some kind of Linux Live OS and see if it causes the same problems on this hardware (but not on the other one).



  • @Sebastian-Roth said in Massive packet loss/NIC issues with new Dell 7070 Ultra in FOG:

    @no0NE Have you had the first device (working fine) on the same switch port and with the same cable that you now see the problem with the second “faulty” device?

    Absolutely sure the firmware is the exact same version on both?

    Yes & Yes, sadly. Other than that, i’ve tried several other straight & crossover cables to be fully sure. I’m running out of ideas to troubleshoot, hence this post… :) Good suggestions if i hadn’t already checked, thanks! :)


Log in to reply
 

357
Online

6.4k
Users

13.8k
Topics

130.3k
Posts