Dell 7730 precision laptop deploy GPT error message
-
These are all identical systems, so could this mean some part of the capture process is possibly incorrect or something else?
One thing i’ve noticed is under windows the devices are disk 0 (500GB Linux) and disk 1 (1TB Windows), under pxe boot nvme0n1 (1TB windows) and nvme1n1 (500GB Windows) not sure if that would make any difference, I’m thinking not since it’s UEFI.
Would specifying one of these nvme drives as the Host Primary Disk make a difference?
-
@jmason Will you schedule a capture/deploy to both your master image computer and target computer, but schedule with the debug option.
On both the source and destination computers pxe boot them. You will enter debug mode and be dropped to a linux command prompt. Att he linux command prompt key in
lsblk
Post the output on both systems. This will print the geometry of both the source and destination disk(s). -
They are identical
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 953.9G 0 disk |-nvme0n1p1 259:6 0 650M 0 part |-nvme0n1p2 259:7 0 128M 0 part |-nvme0n1p3 259:8 0 952.1G 0 part `-nvme0n1p4 259:9 0 990M 0 part nvme1n1 259:1 0 477G 0 disk |-nvme1n1p1 259:2 0 128M 0 part |-nvme1n1p2 259:3 0 200M 0 part |-nvme1n1p3 259:4 0 1G 0 part `-nvme1n1p4 259:5 0 475.6G 0 part
Interesting thing right after my post, I attempted to add /dev/nvme1n1 as the Host Primary Disk, booted the task in debug mode, had error it couldn’t find the hard drives. Cancelled task and then set it the /dev/nvme0n1 and got the same message. Cleared the Host Primary Disk field and restarted again. This time there was no Failed message…
Erasing current MBR/GPT Tables …Done Restoring Partition Tables (GPT)…Done Erasing current MBR/GPT Tables …Done Restoring Partition Tables (GPT)…Failed
but
Erasing current MBR/GPT Tables …Done Restoring Partition Tables (GPT)…Done Erasing current MBR/GPT Tables …Done Restoring Partition Tables (GPT)…Done
and the deploy started. It has completed deploying the nvme1n0 (windows drive) and the partitions to nvme1n1 (linux drive).
Though it seems to be working, I am puzzled as to what changed and when all I did was chang the Hard Disk Primary parameter a few times and when setting it back to empty and restarting it worked.
I guess it is possible that the deploy to machine had something misaligned somewhere as I have been attempting to deploy to it over and over. But I have also been restoring the original image using macrium and testing it before each FOG deploy attempt.
Will also now attempt to deploy again to the same laptop.
-
@jmason I can’t give you a reference on this but it’s actually a likely cause (one that I have not though of before, grrrhhh) that disk enumeration can put your two disks in reverse order. This is known in Linux and usually circumnavigated through persistent block device naming.
Try deploying a couple of times in a row always using the debug mode and run
lsblk
before starting the task. See if it’s exactly how we imagine it to be (changing disk order). -
@george1421 On the other hand I am wondering why we have not had other people reporting this in the past. What if you have a PC with two drives, one for OS and one for data. You only ever want to image the OS disk but could happen that you deploy to the data disk?! Just thinking out loud here.
-
Yes I’ll do that, as when I just attempted to redeploy the original error returned. These are also pretty new systems so that could be a reason for not seeing it much before.
-
@jmason Yes, possibly (hopefully) this is something being more or less an issue of NVMe drives. Haha. Well, I’ll keep my head spinning on how we could possibly solve this as we have no influence on the order the Linux kernel enumerates your disks. We’d need to save disk identifier and store those with the image… I suppose.
-
Looks like that is what it is doing, after the failed redeploy (didn’t run in debug that time of course ) I ran it in debug and the lsblk gives:
nvme0n1 259:0 0 477G 0 disk nvme1n1 259:1 0 953.9G 0 disk |-nvme1n1p1 259:2 0 128M 0 part |-nvme1n1p2 259:3 0 200M 0 part |-nvme1n1p3 259:4 0 1G 0 part `-nvme1n1p4 259:5 0 475.6G 0 part
However it did not hit the error this time and appears to be deploying again now, but can’t see that working for both partitions with the mismatch…wierd.
-
Well if you need any testing of anything just let me know, I’ll be more than happy to run things on these systems
-
@jmason Thanks for testing. I’ll see what I can do for you. Guess I will take a bit of time to figure something out.
-
@jmason Hmmmm, the more I read the less I think we can do something about it. This is not something FOG or the Linux kernel is doing wrong. It’s more or less a combination of how the Dell UEFI firmware hands back the NVMe drive information to the Linux kernel. One boot it’s this way round and the next boot it might be the other way. When installing an OS on disk this is not much of an issue because you have partitions with UUIDs and labels on the disks and those can be used to identify which partition to mount for booting the OS. But in case of cloning we have a laptops with different physical disks (and identifiers) so there is no way we can use that information.
Possibly we could save the sector (or disk) size information in case of “Multiple Partition Image - All Disks (Not Resizable)” but then what happens if someone comes along with two identical size disks in their machines?
Hmmm, need more time to think about this. @george1421 @Wayne-Workman any ideas from your side?
Edit: By the way… I can imagine this being an issue when capturing the image as well. One time d1p* are from disk A and d2p* from disk B and next time it’s in reverse.
-
@Sebastian-Roth (this is more of a brain dump than an answer)
Do we have empirical evidence that these disks are being swapped as being reported by the uefi bios? It would be a bit more telling if that second disk (for debugging purposes) could be exchanged for a different size disk, then run the test again. I might see the order being swapped between models of computers, but not the same computer depending on the boot. I might think this is an oddity in the uefi firmware. The Precision 7730 generation is pretty new, so the first thing I would check/watch for is firmware update availability.Do we know if this issue is model or machine specific? It could also be a linux kernel issue where one of the drives may init faster/slower than the other so its detected by the linux OS at different times. It would be interesting to compare the FOS boot logs between the two states to see if there are any telling events. But again trying to get it to break and know when its broken is the hardest part.
-
@george1421 said in Dell 7730 precision laptop deploy GPT error message:
Do we have empirical evidence that these disks are being swapped as being reported by the uefi bios?
See @jmason’s
lsblk
listings. From my point of view this is evidence enough. The disks are different size and do swap. As far as I got the postings it seems like the output was always taken on the same deploy system. One time 477 GB drive is last and the other time it’s first.The Precision 7730 generation is pretty new, so the first thing I would check/watch for is firmware update availability.
Definitely a good point!!
It would be interesting to compare the FOS boot logs between the two states to see if there are any telling events.
Good one as well! @jmason Can you please schedule a debug deploy job. Boot that machine and run
dmesg | grep -i nvm
. Take a picture andreboot
the machine. When you are bacl to the shell, againdmesg | grep -i nvm
and take a picture. Do this maybe ten times to see if we see a difference there. -
It would also be interesting to see if the 4.15.2 kernels gave us the same random results (Actually I’d like to push it earlier than 4.13.x but the inits would get in the way, because we had issue with kernels after that and the Dell Precision swappable nvme drives that have been since fixed). To see if this randomness is linux kernel related or not. I’m not really sure what this will tell us other than if the problem was introduced in later kernels.
-
I ran it 10 times and noticed a slight difference on the 6th and 9th time.
-
@Sebastian-Roth said in Dell 7730 precision laptop deploy GPT error message:
@george1421 said in Dell 7730 precision laptop deploy GPT error message:
Do we have empirical evidence that these disks are being swapped as being reported by the uefi bios?
See @jmason’s
lsblk
listings. From my point of view this is evidence enough. The disks are different size and do swap. As far as I got the postings it seems like the output was always taken on the same deploy system. One time 477 GB drive is last and the other time it’s first.I have indeed been working with only 1 laptop with the deploy attempts and only 1 laptop from which the capture image was created, and both are identical machines. I have 20 of them in total.
-
@george1421 said in Dell 7730 precision laptop deploy GPT error message:
@Sebastian-Roth (this is more of a brain dump than an answer)
The Precision 7730 generation is pretty new, so the first thing I would check/watch for is firmware update availability.There is a BIOS update with the following fixes, but don’t see anything related to our issue. I can go ahead and update the system with all the latest fixes available if requested.
Dell Support for Precision 7730
- Fixes the issue where the mouse lags when the Dell TB16 dock is unplugged or plugged in.
- Fixes the issue where the system cannot set hard drive password with Dell Client Configuration Toolkit.
- Fixes the issue where the system always boots to Rufus formatted USB drives instead of internal hard drive.
- Improves system performance under heavy load when connected to Dell TB18DC Dock.
-
@jmason said in Dell 7730 precision laptop deploy GPT error message:
There is a BIOS update with the following fixes, but don’t see anything related to our issue. I can go ahead and update the system with all the latest fixes available if requested.
I would do this no matter what even though the change log shows the fix primarily dealing with the usb-c dock.
-
@jmason First let me say, thank you for being so detailed and helping debug this issue. I know it takes quite a bit of time to do these iterative testing, so Thank You.
So I see from the pictures you have a 20% rate where it looks like the nvme disk 1 inits before disk 0. Can we correlate the 6th and 9th order with the swap when shown by
lsblk
.I also see from the picture that the PCI address for nvme disks are not changing, at least the location vs name. My intuition is telling me that this problem is probably rooted in the linux kernel and or hardware/linux kernel race condition.
-
Here are two images showing the correlation. I only had to attempt deploy twice this time to get the difference to appear.
well I just ran it a third time and got another difference, but it appears it might be just like the first image displayed with only the output in a different order.
.