Dell 7730 precision laptop deploy GPT error message
-
@Sebastian-Roth said in Dell 7730 precision laptop deploy GPT error message:
@george1421 said in Dell 7730 precision laptop deploy GPT error message:
Do we have empirical evidence that these disks are being swapped as being reported by the uefi bios?
See @jmason’s
lsblk
listings. From my point of view this is evidence enough. The disks are different size and do swap. As far as I got the postings it seems like the output was always taken on the same deploy system. One time 477 GB drive is last and the other time it’s first.I have indeed been working with only 1 laptop with the deploy attempts and only 1 laptop from which the capture image was created, and both are identical machines. I have 20 of them in total.
-
@george1421 said in Dell 7730 precision laptop deploy GPT error message:
@Sebastian-Roth (this is more of a brain dump than an answer)
The Precision 7730 generation is pretty new, so the first thing I would check/watch for is firmware update availability.There is a BIOS update with the following fixes, but don’t see anything related to our issue. I can go ahead and update the system with all the latest fixes available if requested.
Dell Support for Precision 7730
- Fixes the issue where the mouse lags when the Dell TB16 dock is unplugged or plugged in.
- Fixes the issue where the system cannot set hard drive password with Dell Client Configuration Toolkit.
- Fixes the issue where the system always boots to Rufus formatted USB drives instead of internal hard drive.
- Improves system performance under heavy load when connected to Dell TB18DC Dock.
-
@jmason said in Dell 7730 precision laptop deploy GPT error message:
There is a BIOS update with the following fixes, but don’t see anything related to our issue. I can go ahead and update the system with all the latest fixes available if requested.
I would do this no matter what even though the change log shows the fix primarily dealing with the usb-c dock.
-
@jmason First let me say, thank you for being so detailed and helping debug this issue. I know it takes quite a bit of time to do these iterative testing, so Thank You.
So I see from the pictures you have a 20% rate where it looks like the nvme disk 1 inits before disk 0. Can we correlate the 6th and 9th order with the swap when shown by
lsblk
.I also see from the picture that the PCI address for nvme disks are not changing, at least the location vs name. My intuition is telling me that this problem is probably rooted in the linux kernel and or hardware/linux kernel race condition.
-
Here are two images showing the correlation. I only had to attempt deploy twice this time to get the difference to appear.
well I just ran it a third time and got another difference, but it appears it might be just like the first image displayed with only the output in a different order.
.
-
@jmason Great stuff. Thanks for that as well. From those pictures it looks like the order of initialization (dmesg output nvme0n1 before nvme1n1 or vice versa) does not co-relate to the disks being in different order.
- picture: init nvme1n1 before nvme0n1 - nvme0n1 954 GB disk / nvme1n1 477 GB disk
- picture: init nvme0n1 before nvme1n1 - nvme0n1 477 GB disk / nvme1n1 954 GB
- picture: init nvme0n1 before nvme1n1 - nvme0n1 954 GB disk / nvme1n1 477 GB
I suppose if you’d have three disks it could be any combination…
Let’s see what we can do about this. Can you please get a couple of different Linux Live ISOs and do exactly the same testing on those.
- Debian: https://cdimage.debian.org/debian-cd/current-live/amd64/iso-hybrid/debian-live-9.7.0-amd64-xfce.iso
- Ubuntu: http://releases.ubuntu.com/18.04.1/ubuntu-18.04.1-desktop-amd64.iso
- Arch: https://mirror.orbit-os.com/archlinux/iso/2019.01.01/archlinux-2019.01.01-x86_64.iso
- SystemRescueCD: https://osdn.net/projects/systemrescuecd/storage/releases/6.0.1/systemrescuecd-6.0.1.iso
Please see if all of those behave exactly the same (random change on every reboot) or if the disk order seems stable. At lease boot each OS ten times.
-
@Sebastian-Roth Grabbing the ISOs now, debian and ubuntu just updated their release.
https://cdimage.debian.org/debian-cd/current-live/amd64/iso-hybrid/debian-live-9.8.0-amd64-xfce.iso
http://releases.ubuntu.com/18.04.2/ubuntu-18.04.2-desktop-amd64.iso -
I am unable to get Debian live to start up after the initial run without install menu.
Ubuntu showed the behavior on the 3rd with lsblk and 5th reboot with dmesg, while reboot 7 was different than all previous, I’ll move on to the other 2 ISOs next.
-1-
-2-
-3-
-4-
-5-
-6- was like -3-
-7-
-
@jmason what mode is the sata operation within the bios in?
I’m guessing it’s in raid mode. Is it likely because RAID mode would want at least 2 HDDs regardless of mode of raid that the raid controller is changing how the disks present to the OS?
Essentially, because it wants raid the order in which they’re listed wouldn’t matter to build the array and have things work properly. But because we aren’t in a RAID configuration we are seeing the issue?
-
@Tom-Elliott I specifically have set the mode to AHCI for these systems.
-
First 4 reboots in archlinux showed all different inits it appears.
reboot -5- was like -1-
-
@Sebastian-Roth said in Dell 7730 precision laptop deploy GPT error message:
SystemRescueCD
5 reboots, 1st 4 times different init, 5th same as 1st
-
@jmason To me this seems to be enough evidence that it’s a general “issue” or known to work as intended. I suspect this to be “normal” as PCIe initialization probably returns the disks in different order. Weird thing is that I can’t find much about this being a particular issue with NVMe disks.
-
@george1421 said in Dell 7730 precision laptop deploy GPT error message:
@jmason said in Dell 7730 precision laptop deploy GPT error message:
There is a BIOS update with the following fixes, but don’t see anything related to our issue. I can go ahead and update the system with all the latest fixes available if requested.
I would do this no matter what even though the change log shows the fix primarily dealing with the usb-c dock.
I did load all available bios/firmware updates and retested the behavior and it is still the same.
-
@Sebastian-Roth Is it feasible to have an option for multiple disk non-resizeable and some kind of checkbox/option to notify fog that the machines are identical drive wise/hardware wise and would it make a difference. It’s been a long time since I did any coding, and it wasn’t related to this at all, just throwing a thought out.
-
@jmason Sorry if it sounded like I’d leave you alone now that we are fairly sure it’s just “normal” behaviour. I still think about how we can solve this for you and others. Though I still have not come up with a great solution to it I sort of postpone implementing a solution in hope of a flash of genius.
What is your deadline to get those devices imaged?
-
@Sebastian-Roth everything I’ve found on this issue refers to using the disks uuid to identify which one to apply it to. That doesn’t help us much as every drive on a system would have its own uuid. So how do we identify which is which? I know it doesn’t help anything. Everything from Serial to Pata and nvme aren’t guaranteed to be a persistent naming scheme for Linux. Luckily SATA and PATA seem to follow the channel pattern on how they’re connected and named. With NVME being on a pcie channel this makes enumeration dependent on how fast a disk feels like revealing itself to the system.
-
@Tom-Elliott You are spot on! The only thing I came up with so far is saving the disks sector sizes (in multiple disk mode only) and trying to match those on deployment again. Kind of ugly and possibly error-prone but could give it a try.
-
@Sebastian-Roth said in Dell 7730 precision laptop deploy GPT error message:
What is your deadline to get those devices imaged?
I have until mid March before my first full implementation with these new training laptops. I can always image them individually via usb until a working solution is found (aka someone learns how to control the nvme and its feelings of revealing).
-
@Tom-Elliott said in Dell 7730 precision laptop deploy GPT error message:
@Sebastian-Roth everything I’ve found on this issue refers to using the disks uuid to identify which one to apply it to. That doesn’t help us much as every drive on a system would have its own uuid.
When registering a system Host into Fog, you’d have to store the UUIDs of the drives and then specify which one would be your disk0/sda and disk/sdb, etc etc, … thinking out loud is all.
Then on deploy if the UUID fields and their mappings are set you use that, otherwise operate as usual.