Dell 7730 precision laptop deploy GPT error message

jmason

@Sebastian-Roth said in Dell 7730 precision laptop deploy GPT error message:

Let’s try to tackle this. Please schedule another debug deploy task. Start up the client and fire up fog when you get to the shell. Step through and when you are back to the console after the error please type sgdisk -gl /images/DELL7730_Win10_Centos7/d2.mbr /dev/nvme1n1 (most probably returns an error as well. Please take a picture or copy&paste the error message if you are connected via SSH to the client)

Now as I mentioned I am not a linux person, so I typed fog at the prompt after booting into debug.

I get:

####
#   An Error has been detected !
####

Fatal Error: Unknown request type :: Null

Kernel variables and settings:
bzImage loglevel=4 initrd=init.xz root=/dev/ram0 rw ramdisk_size-127000 web=http://192.168.0.1/fog/ consoleblank=0 rootfstype=ext4 shutdown=1 mac=macaddressoflaptop ftp=192.168.0.1 storage=192.168.0.1:images/dev storageip=192.168.0.1 osid=50 irqpoll hostname=mylaptop isdebug=yes shutdown=1
 * Press [Enter] key to continue

Then back to command prompt #

Is there something I’m missing to run deploy in debug mode?

Looks like I have to type in more than just fog to get the deploy to run in debug.

I found these directions but they are about 3 years old:

https://wiki.fogproject.org/wiki/index.php/Debug_Mode#Deploy_Debug

Are these still correct for the current version?

Sebastian Roth

@jmason How did you schedule the task? Just go to the host’s settings in the web UI, click Basic Tasks, deploy and just before you create the task, tick the checkbox for debug. This way it very similar to a non-debug deploy but the difference is that you have to start manually (by running the very simple command fog - should work if you schedule the task as described!) and you are asked to step through the whole process instead of it going without interaction. Give it a try.

If you have scheduled the task as described already, then I am wondering if you do PXE boot the laptop or USB boot?!

jmason

I just overlooked the debug checkbox…running now

jmason

After stepping through the debug deploy as requested, I entered sgdisk -gl /images/DELL7730_Win10_Centos7/d2.mbr /dev/nvme1n1 and got some interesting output:

Creating new GPT entries.
Warning! Current disk size doesn't match that of the backup!
Adjusting sizes to match, but subsequent problems are possible!

Warning! Secondary partition table overlaps the last partition by 1000160625 blocks!
You will need to delete this partition or resize it in another utility

Problem: partition 3 is too big for the disk.

Problem: partition 4 is too big for the disk.
Aborting write operation!
Aborting write of new partition table.

Sebastian Roth

@jmason Well seems clear enough to me, is the source disk larger than the destination disk?

jmason

These are all identical systems, so could this mean some part of the capture process is possibly incorrect or something else?

One thing i’ve noticed is under windows the devices are disk 0 (500GB Linux) and disk 1 (1TB Windows), under pxe boot nvme0n1 (1TB windows) and nvme1n1 (500GB Windows) not sure if that would make any difference, I’m thinking not since it’s UEFI.

Would specifying one of these nvme drives as the Host Primary Disk make a difference?

george1421

@jmason Will you schedule a capture/deploy to both your master image computer and target computer, but schedule with the debug option.

On both the source and destination computers pxe boot them. You will enter debug mode and be dropped to a linux command prompt. Att he linux command prompt key in lsblk Post the output on both systems. This will print the geometry of both the source and destination disk(s).

jmason

@george1421

They are identical

NAME            MAJ:MIN RM    SIZE RO TYPE MOUNTPOINT
nvme0n1     259:0    0 953.9G  0 disk           
|-nvme0n1p1 259:6    0   650M  0 part
|-nvme0n1p2 259:7    0   128M  0 part
|-nvme0n1p3 259:8    0 952.1G  0 part
`-nvme0n1p4 259:9    0   990M  0 part
nvme1n1     259:1    0   477G  0 disk           
|-nvme1n1p1 259:2    0   128M  0 part
|-nvme1n1p2 259:3    0   200M  0 part
|-nvme1n1p3 259:4    0     1G  0 part
`-nvme1n1p4 259:5    0 475.6G  0 part

Interesting thing right after my post, I attempted to add /dev/nvme1n1 as the Host Primary Disk, booted the task in debug mode, had error it couldn’t find the hard drives. Cancelled task and then set it the /dev/nvme0n1 and got the same message. Cleared the Host Primary Disk field and restarted again. This time there was no Failed message…

Erasing current MBR/GPT Tables …Done
Restoring Partition Tables (GPT)…Done
Erasing current MBR/GPT Tables …Done
Restoring Partition Tables (GPT)…Failed

but

Erasing current MBR/GPT Tables …Done
Restoring Partition Tables (GPT)…Done
Erasing current MBR/GPT Tables …Done
Restoring Partition Tables (GPT)…Done

and the deploy started. It has completed deploying the nvme1n0 (windows drive) and the partitions to nvme1n1 (linux drive).

Though it seems to be working, I am puzzled as to what changed and when all I did was chang the Hard Disk Primary parameter a few times and when setting it back to empty and restarting it worked.

I guess it is possible that the deploy to machine had something misaligned somewhere as I have been attempting to deploy to it over and over. But I have also been restoring the original image using macrium and testing it before each FOG deploy attempt.

Will also now attempt to deploy again to the same laptop.

Sebastian Roth

@jmason I can’t give you a reference on this but it’s actually a likely cause (one that I have not though of before, grrrhhh) that disk enumeration can put your two disks in reverse order. This is known in Linux and usually circumnavigated through persistent block device naming.

Try deploying a couple of times in a row always using the debug mode and run lsblk before starting the task. See if it’s exactly how we imagine it to be (changing disk order).

Sebastian Roth

@george1421 On the other hand I am wondering why we have not had other people reporting this in the past. What if you have a PC with two drives, one for OS and one for data. You only ever want to image the OS disk but could happen that you deploy to the data disk?! Just thinking out loud here.

jmason

Yes I’ll do that, as when I just attempted to redeploy the original error returned. These are also pretty new systems so that could be a reason for not seeing it much before.

Sebastian Roth

@jmason Yes, possibly (hopefully) this is something being more or less an issue of NVMe drives. Haha. Well, I’ll keep my head spinning on how we could possibly solve this as we have no influence on the order the Linux kernel enumerates your disks. We’d need to save disk identifier and store those with the image… I suppose.

jmason

@Sebastian-Roth

Looks like that is what it is doing, after the failed redeploy (didn’t run in debug that time of course ) I ran it in debug and the lsblk gives:

nvme0n1     259:0    0   477G  0 disk    
nvme1n1     259:1    0 953.9G  0 disk           
|-nvme1n1p1 259:2    0   128M  0 part
|-nvme1n1p2 259:3    0   200M  0 part
|-nvme1n1p3 259:4    0     1G  0 part
`-nvme1n1p4 259:5    0 475.6G  0 part

However it did not hit the error this time and appears to be deploying again now, but can’t see that working for both partitions with the mismatch…wierd.

jmason

@Sebastian-Roth

Well if you need any testing of anything just let me know, I’ll be more than happy to run things on these systems

Sebastian Roth

@jmason Thanks for testing. I’ll see what I can do for you. Guess I will take a bit of time to figure something out.

Sebastian Roth

@jmason Hmmmm, the more I read the less I think we can do something about it. This is not something FOG or the Linux kernel is doing wrong. It’s more or less a combination of how the Dell UEFI firmware hands back the NVMe drive information to the Linux kernel. One boot it’s this way round and the next boot it might be the other way. When installing an OS on disk this is not much of an issue because you have partitions with UUIDs and labels on the disks and those can be used to identify which partition to mount for booting the OS. But in case of cloning we have a laptops with different physical disks (and identifiers) so there is no way we can use that information.

Possibly we could save the sector (or disk) size information in case of “Multiple Partition Image - All Disks (Not Resizable)” but then what happens if someone comes along with two identical size disks in their machines?

Hmmm, need more time to think about this. @george1421 @Wayne-Workman any ideas from your side?

Edit: By the way… I can imagine this being an issue when capturing the image as well. One time d1p* are from disk A and d2p* from disk B and next time it’s in reverse.

george1421

@Sebastian-Roth (this is more of a brain dump than an answer)
Do we have empirical evidence that these disks are being swapped as being reported by the uefi bios? It would be a bit more telling if that second disk (for debugging purposes) could be exchanged for a different size disk, then run the test again. I might see the order being swapped between models of computers, but not the same computer depending on the boot. I might think this is an oddity in the uefi firmware. The Precision 7730 generation is pretty new, so the first thing I would check/watch for is firmware update availability.

Do we know if this issue is model or machine specific? It could also be a linux kernel issue where one of the drives may init faster/slower than the other so its detected by the linux OS at different times. It would be interesting to compare the FOS boot logs between the two states to see if there are any telling events. But again trying to get it to break and know when its broken is the hardest part.

Sebastian Roth

@george1421 said in Dell 7730 precision laptop deploy GPT error message:

Do we have empirical evidence that these disks are being swapped as being reported by the uefi bios?

See @jmason’s lsblk listings. From my point of view this is evidence enough. The disks are different size and do swap. As far as I got the postings it seems like the output was always taken on the same deploy system. One time 477 GB drive is last and the other time it’s first.

The Precision 7730 generation is pretty new, so the first thing I would check/watch for is firmware update availability.

Definitely a good point!!

It would be interesting to compare the FOS boot logs between the two states to see if there are any telling events.

Good one as well! @jmason Can you please schedule a debug deploy job. Boot that machine and run dmesg | grep -i nvm. Take a picture and reboot the machine. When you are bacl to the shell, again dmesg | grep -i nvm and take a picture. Do this maybe ten times to see if we see a difference there.

george1421

It would also be interesting to see if the 4.15.2 kernels gave us the same random results (Actually I’d like to push it earlier than 4.13.x but the inits would get in the way, because we had issue with kernels after that and the Dell Precision swappable nvme drives that have been since fixed). To see if this randomness is linux kernel related or not. I’m not really sure what this will tell us other than if the problem was introduced in later kernels.

jmason

@Sebastian-Roth

I ran it 10 times and noticed a slight difference on the 6th and 9th time.

Dell 7730 precision laptop deploy GPT error message

106

12.6k

17.5k

156.3k