Error Restoring GPT Partition Tables
-
@Quazz Right since this know proven to happen on more than one hardware type, this is something the linux kernel developers should be working on. From the nvme printout we can surely see the drives changing location. The linux kernel developers may get around this by scanning the nvme drives and looking for a disk that has the boot blocks on it and picking that one for disk 0. In FOS’ case we aren’t trying to boot from the media. But I’m only guessing here. We surely need to do a bunch more research now that we have detailed info.
-
@george1421 As far as I understand, in a normal Linux install they simply use UUID of the disk to determine the order (though predictability is still low since if any device isn’t working properly, any following devices will take up preceding values)
eg
3 disks /dev/sda /dev/sdb /dev/sdc disk /dev/sdb fails to load properly. /dev/sda /dev/sdb (was /dev/sdc previously!)
Since no such data is ever stored on FOS, it’s basically a race between the devices and whoever wins is the one on top.
-
@george1421 Created an issue over at github
https://github.com/FOGProject/fos/issues/27
This one will need some consideration I think, not really straightforward
-
@tlehrian Ok I have a few more tasks for you, well kind of.
- Does this lenovo computer have the latest firmware updates installed
- Does this lenovo computer’s nvme drives have the latest firmware installed? (you should be able to get disk firmware for this model of computer from the lenovo site)
- How many times of rebooting did it take to get the drive order to flip? Did you have to do a hard boot to get them to flip or was just a reboot enough?
- Does this laptop have legacy (bios) mode? If it does, do these drives change order in bios mode?
So right now its not clear in my head if its a hardware, firmware (bios), or linux kernel error.
-
@george1421 After reading up a bit more, it seems the conclusion is that it’s not so much about the disks themselves, but rather that the assignment is arbitrary if they’re connected to different controllers, which in the case of NVME will always be the case and in the case of SATA (on most modern consumer grade hardware at least) will not be the case.
-
@george1421 Ok. This is on an HP Z2 G4 workstation.
-
and
-
I’m not sure about the firmware, but will check and let you know which firmware was installed. This will probably be tomorrow as there is a scheduled power outage for our campus tomorrow morning and we’ve shut these machines down for the day in preparation.
-
it took 2 reboots to switch, and did not require a hard reboot.
-
The BIOS does have legacy mode…I have not tried this in legacy mode, mainly as we are moving our dual-boot setup using GRUB2 to EFI on boot. I’ve never booted this machine in legacy mode. Secure boot is disabled.
-
-
@george1421 I forgot I still had one of these up and running. It looks like the firmware for the MOBO is not at latest version, and the BIOS is also not at latest version. Not sure about NVME firmware. I think this is the 981 which is an OEM version of the Samsung 970, but not sure if the same firmware applies.
-
@tlehrian What I’m trying to rule out is if its a Lenovo thing that is causing the drives to switch, I don’t think it is, but I wanted to rule out hardware (or stale firmware) as the culprit.
As for the question about bios (legacy) mode. I wanted to see with the same physical hardware, is the device swapping location related to the uefi firmware to see if it does the same thing in bios mode.
Right now I’m looking at this as I don’t know where the issue is. So I’m trying to rule out where the problem isn’t first to get the number of possibilities down.
-
@tlehrian Is the inital issue solved at all? I seem to have lost track of that part of the discussion.
-
@Sebastian-Roth Thanks for asking!
The initial issue was that we were receiving the error restoring GPT partition tables sometimes on these particular machines. We have two NVME M.2 drives on these machines, that we now know are becoming available seemingly randomly based upon a race condition as to which one becomes available to the OS first. Our drives happen to be different sizes (256GB/512GB), and we want the larger drive to be the primary drive with the OS installed, and the smaller drive to be an extra storage drive. I have set up an image as Multiple Partition - All Disks (not resizeable) to deploy to these machines. Also, these machines are booting UEFI, without secure boot.
If the correct drive wins the race, the image goes through swimmingly. If it doesn’t, I get the error that started this thread as the drive it’s trying to restore the partition tables to is not large enough to hold the image.
So, long story short, no. But that’s OK because now I know what’s causing the issue, and although it’s a bit annoying, I can work around it knowing that it’s not a larger hardware issue. And I’m happy to help troubleshoot to try to get to a resolution for it, as it would benefit me greatly as this setup composes about 25% of my lab deployments at present.
-
@george1421 Sure. I understand about narrowing down the possibilities. I can switch one of these to legacy and see what happens with the drives. To be clear, these are HP workstations with Samsung NVME M.2 drives. Are either of those companies related to Lenovo in some way?
-
@tlehrian said in Error Restoring GPT Partition Tables:
Are either of those companies related to Lenovo in some way?
Sorry I’m working on too many threads at the moment. I got the threads mixed up with similar issues.
-
@george1421 No worries. Just making sure there wasn’t something you knew that I didn’t
-
@george1421 I placed one of these machines in legacy mode to see if the drives would exhibit the same behavior and ran the tests you prescribed earlier. Indeed, after two reboots, they did switch, so we can probably rule BIOS type out. Here are the outputs:
State 1:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 238.5G 0 disk nvme1n1 259:1 0 477G 0 disk |-nvme1n1p1 259:2 0 499M 0 part |-nvme1n1p2 259:3 0 100M 0 part |-nvme1n1p3 259:4 0 16M 0 part |-nvme1n1p4 259:5 0 341.2G 0 part `-nvme1n1p5 259:6 0 135.1G 0 part > nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 S499NX0M113634 SAMSUNG MZVLB256HAHQ-000H2 1 2.95 GB / 256.06 GB 512 B + 0 B EXD71HAQ /dev/nvme1n1 S498NA0M403426 SAMSUNG MZVLB512HAJQ-000H2 1 149.49 GB / 512.11 GB 512 B + 0 B EXA71HAQ > nvme id-ctrl /dev/nvme0n1 -H NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S499NX0M113634 mn : SAMSUNG MZVLB256HAHQ-000H2 fr : EXD71HAQ rab : 2 ieee : 002538 cmic : 0 [2:2] : 0 PCI [1:1] : 0 Single Controller [0:0] : 0 Single Port mdts : 9 cntlid : 4 ver : 10200 rtd3r : 186a0 rtd3e : 7a1200 oaes : 0 [8:8] : 0 Namespace Attribute Changed Event Not Supported oacs : 0x17 [15:4] : 0x1 Reserved [3:3] : 0 NS Management and Attachment Not Supported [2:2] : 0x1 FW Commit and Download Supported [1:1] : 0x1 Format NVM Supported [0:0] : 0x1 Sec. Send and Receive Supported acl : 7 aerl : 7 frmw : 0x16 [4:4] : 0x1 Firmware Activate Without Reset Supported [3:1] : 0x3 Number of Firmware Slots [0:0] : 0 Firmware Slot 1 Read/Write lpa : 0x3 [1:1] : 0x1 Command Effects Log Page Supported [0:0] : 0x1 SMART/Health Log Page per NS Supported elpe : 255 npss : 4 avscc : 0x1 [0:0] : 0x1 Admin Vendor Specific Commands uses NVMe Format apsta : 0x1 [0:0] : 0x1 Autonomous Power State Transitions Supported wctemp : 354 cctemp : 355 mtfa : 50 hmpre : 0 hmmin : 0 tnvmcap : 256060514304 unvmcap : 0 rpmbs : 0 [31:24]: 0 Access Size [23:16]: 0 Total Size [5:3] : 0 Authentication Method [2:0] : 0 Number of RPMB Units sqes : 0x66 [7:4] : 0x6 Max SQ Entry Size (64) [3:0] : 0x6 Min SQ Entry Size (64) cqes : 0x44 [7:4] : 0x4 Max CQ Entry Size (16) [3:0] : 0x4 Min CQ Entry Size (16) nn : 1 oncs : 0x1f [5:5] : 0 Reservations Not Supported [4:4] : 0x1 Save and Select Supported [3:3] : 0x1 Write Zeroes Supported [2:2] : 0x1 Data Set Management Supported [1:1] : 0x1 Write Uncorrectable Supported [0:0] : 0x1 Compare Supported fuses : 0 [0:0] : 0 Fused Compare and Write Not Supported fna : 0 [2:2] : 0 Crypto Erase Not Supported as part of Secure Erase [1:1] : 0 Crypto Erase Applies to Single Namespace(s) [0:0] : 0 Format Applies to Single Namespace(s) vwc : 0x1 [0:0] : 0x1 Volatile Write Cache Present awun : 1023 awupf : 0 nvscc : 1 [0:0] : 0x1 NVM Vendor Specific Commands uses NVMe Format acwu : 0 sgls : 0 [0:0] : 0 Scatter-Gather Lists Not Supported subnqn : ps 0 : mp:7.02W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:6.30W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.50W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0760W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- nvme id-ctrl /dev/nvme1n1 -H NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S498NA0M403426 mn : SAMSUNG MZVLB512HAJQ-000H2 fr : EXA71HAQ rab : 2 ieee : 002538 cmic : 0 [2:2] : 0 PCI [1:1] : 0 Single Controller [0:0] : 0 Single Port mdts : 9 cntlid : 4 ver : 10200 rtd3r : 186a0 rtd3e : 7a1200 oaes : 0 [8:8] : 0 Namespace Attribute Changed Event Not Supported oacs : 0x17 [15:4] : 0x1 Reserved [3:3] : 0 NS Management and Attachment Not Supported [2:2] : 0x1 FW Commit and Download Supported [1:1] : 0x1 Format NVM Supported [0:0] : 0x1 Sec. Send and Receive Supported acl : 7 aerl : 7 frmw : 0x16 [4:4] : 0x1 Firmware Activate Without Reset Supported [3:1] : 0x3 Number of Firmware Slots [0:0] : 0 Firmware Slot 1 Read/Write lpa : 0x3 [1:1] : 0x1 Command Effects Log Page Supported [0:0] : 0x1 SMART/Health Log Page per NS Supported elpe : 255 npss : 4 avscc : 0x1 [0:0] : 0x1 Admin Vendor Specific Commands uses NVMe Format apsta : 0x1 [0:0] : 0x1 Autonomous Power State Transitions Supported wctemp : 354 cctemp : 355 mtfa : 50 hmpre : 0 hmmin : 0 tnvmcap : 512110190592 unvmcap : 0 rpmbs : 0 [31:24]: 0 Access Size [23:16]: 0 Total Size [5:3] : 0 Authentication Method [2:0] : 0 Number of RPMB Units sqes : 0x66 [7:4] : 0x6 Max SQ Entry Size (64) [3:0] : 0x6 Min SQ Entry Size (64) cqes : 0x44 [7:4] : 0x4 Max CQ Entry Size (16) [3:0] : 0x4 Min CQ Entry Size (16) nn : 1 oncs : 0x1f [5:5] : 0 Reservations Not Supported [4:4] : 0x1 Save and Select Supported [3:3] : 0x1 Write Zeroes Supported [2:2] : 0x1 Data Set Management Supported [1:1] : 0x1 Write Uncorrectable Supported [0:0] : 0x1 Compare Supported fuses : 0 [0:0] : 0 Fused Compare and Write Not Supported fna : 0 [2:2] : 0 Crypto Erase Not Supported as part of Secure Erase [1:1] : 0 Crypto Erase Applies to Single Namespace(s) [0:0] : 0 Format Applies to Single Namespace(s) vwc : 0x1 [0:0] : 0x1 Volatile Write Cache Present awun : 1023 awupf : 0 nvscc : 1 [0:0] : 0x1 NVM Vendor Specific Commands uses NVMe Format acwu : 0 sgls : 0 [0:0] : 0 Scatter-Gather Lists Not Supported subnqn : ps 0 : mp:7.02W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:6.30W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.50W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0760W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:-
After one reboot, the
lsblk
command reversed the order of the listing, but still had the drives identified correctly:NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:0 0 477G 0 disk |-nvme1n1p1 259:2 0 499M 0 part |-nvme1n1p2 259:3 0 100M 0 part |-nvme1n1p3 259:4 0 16M 0 part |-nvme1n1p4 259:5 0 341.2G 0 part `-nvme1n1p5 259:6 0 135.1G 0 part nvme0n1 259:1 0 238.5G 0 disk
(I did not run the nvme commands in this state)
After one more reboot, the drives switched:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 477G 0 disk |-nvme0n1p1 259:2 0 499M 0 part |-nvme0n1p2 259:3 0 100M 0 part |-nvme0n1p3 259:4 0 16M 0 part |-nvme0n1p4 259:5 0 341.2G 0 part `-nvme0n1p5 259:6 0 135.1G 0 part nvme1n1 259:1 0 238.5G 0 disk Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 S498NA0M403426 SAMSUNG MZVLB512HAJQ-000H2 1 149.49 GB / 512.11 GB 512 B + 0 B EXA71HAQ /dev/nvme1n1 S499NX0M113634 SAMSUNG MZVLB256HAHQ-000H2 1 2.95 GB / 256.06 GB 512 B + 0 B EXD71HAQ > nvme id-ctrl /dev/nvme0n1 -H NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S498NA0M403426 mn : SAMSUNG MZVLB512HAJQ-000H2 fr : EXA71HAQ rab : 2 ieee : 002538 cmic : 0 [2:2] : 0 PCI [1:1] : 0 Single Controller [0:0] : 0 Single Port mdts : 9 cntlid : 4 ver : 10200 rtd3r : 186a0 rtd3e : 7a1200 oaes : 0 [8:8] : 0 Namespace Attribute Changed Event Not Supported oacs : 0x17 [15:4] : 0x1 Reserved [3:3] : 0 NS Management and Attachment Not Supported [2:2] : 0x1 FW Commit and Download Supported [1:1] : 0x1 Format NVM Supported [0:0] : 0x1 Sec. Send and Receive Supported acl : 7 aerl : 7 frmw : 0x16 [4:4] : 0x1 Firmware Activate Without Reset Supported [3:1] : 0x3 Number of Firmware Slots [0:0] : 0 Firmware Slot 1 Read/Write lpa : 0x3 [1:1] : 0x1 Command Effects Log Page Supported [0:0] : 0x1 SMART/Health Log Page per NS Supported elpe : 255 npss : 4 avscc : 0x1 [0:0] : 0x1 Admin Vendor Specific Commands uses NVMe Format apsta : 0x1 [0:0] : 0x1 Autonomous Power State Transitions Supported wctemp : 354 cctemp : 355 mtfa : 50 hmpre : 0 hmmin : 0 tnvmcap : 512110190592 unvmcap : 0 rpmbs : 0 [31:24]: 0 Access Size [23:16]: 0 Total Size [5:3] : 0 Authentication Method [2:0] : 0 Number of RPMB Units sqes : 0x66 [7:4] : 0x6 Max SQ Entry Size (64) [3:0] : 0x6 Min SQ Entry Size (64) cqes : 0x44 [7:4] : 0x4 Max CQ Entry Size (16) [3:0] : 0x4 Min CQ Entry Size (16) nn : 1 oncs : 0x1f [5:5] : 0 Reservations Not Supported [4:4] : 0x1 Save and Select Supported [3:3] : 0x1 Write Zeroes Supported [2:2] : 0x1 Data Set Management Supported [1:1] : 0x1 Write Uncorrectable Supported [0:0] : 0x1 Compare Supported fuses : 0 [0:0] : 0 Fused Compare and Write Not Supported fna : 0 [2:2] : 0 Crypto Erase Not Supported as part of Secure Erase [1:1] : 0 Crypto Erase Applies to Single Namespace(s) [0:0] : 0 Format Applies to Single Namespace(s) vwc : 0x1 [0:0] : 0x1 Volatile Write Cache Present awun : 1023 awupf : 0 nvscc : 1 [0:0] : 0x1 NVM Vendor Specific Commands uses NVMe Format acwu : 0 sgls : 0 [0:0] : 0 Scatter-Gather Lists Not Supported subnqn : ps 0 : mp:7.02W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:6.30W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.50W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0760W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- > nvme id-ctrl /dev/nvme1n1 -H NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S499NX0M113634 mn : SAMSUNG MZVLB256HAHQ-000H2 fr : EXD71HAQ rab : 2 ieee : 002538 cmic : 0 [2:2] : 0 PCI [1:1] : 0 Single Controller [0:0] : 0 Single Port mdts : 9 cntlid : 4 ver : 10200 rtd3r : 186a0 rtd3e : 7a1200 oaes : 0 [8:8] : 0 Namespace Attribute Changed Event Not Supported oacs : 0x17 [15:4] : 0x1 Reserved [3:3] : 0 NS Management and Attachment Not Supported [2:2] : 0x1 FW Commit and Download Supported [1:1] : 0x1 Format NVM Supported [0:0] : 0x1 Sec. Send and Receive Supported acl : 7 aerl : 7 frmw : 0x16 [4:4] : 0x1 Firmware Activate Without Reset Supported [3:1] : 0x3 Number of Firmware Slots [0:0] : 0 Firmware Slot 1 Read/Write lpa : 0x3 [1:1] : 0x1 Command Effects Log Page Supported [0:0] : 0x1 SMART/Health Log Page per NS Supported elpe : 255 npss : 4 avscc : 0x1 [0:0] : 0x1 Admin Vendor Specific Commands uses NVMe Format apsta : 0x1 [0:0] : 0x1 Autonomous Power State Transitions Supported wctemp : 354 cctemp : 355 mtfa : 50 hmpre : 0 hmmin : 0 tnvmcap : 256060514304 unvmcap : 0 rpmbs : 0 [31:24]: 0 Access Size [23:16]: 0 Total Size [5:3] : 0 Authentication Method [2:0] : 0 Number of RPMB Units sqes : 0x66 [7:4] : 0x6 Max SQ Entry Size (64) [3:0] : 0x6 Min SQ Entry Size (64) cqes : 0x44 [7:4] : 0x4 Max CQ Entry Size (16) [3:0] : 0x4 Min CQ Entry Size (16) nn : 1 oncs : 0x1f [5:5] : 0 Reservations Not Supported [4:4] : 0x1 Save and Select Supported [3:3] : 0x1 Write Zeroes Supported [2:2] : 0x1 Data Set Management Supported [1:1] : 0x1 Write Uncorrectable Supported [0:0] : 0x1 Compare Supported fuses : 0 [0:0] : 0 Fused Compare and Write Not Supported fna : 0 [2:2] : 0 Crypto Erase Not Supported as part of Secure Erase [1:1] : 0 Crypto Erase Applies to Single Namespace(s) [0:0] : 0 Format Applies to Single Namespace(s) vwc : 0x1 [0:0] : 0x1 Volatile Write Cache Present awun : 1023 awupf : 0 nvscc : 1 [0:0] : 0x1 NVM Vendor Specific Commands uses NVMe Format acwu : 0 sgls : 0 [0:0] : 0 Scatter-Gather Lists Not Supported subnqn : ps 0 : mp:7.02W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:6.30W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.50W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0760W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:-
-
@tlehrian said in Error Restoring GPT Partition Tables:
The initial issue was that we were receiving the error restoring GPT partition tables sometimes on these particular machines. We have two NVME M.2 drives on these machines, …
Right yeah. I just totally lost the string here. When reading some of the messages I had the impression that there was another poblem causing this. No worries.
-
@tlehrian Thank you for taking the time to help us debug this. Without the hardware the developers rely on good quality feedback from you guys.
Ok so we are down to firmware on the computer and firmware on the drives them selves (if that’s a thing).
Something else I’m thinking about (just out loud at the moment) does a commercial linux distribution like ubuntu or centos do the same thing? If it does then it should fail to boot every second time or so if the drives are swapping position. If it boots correctly every time how do it do that? Just thinking about a computer when it cold boots it looks to the bios to find which drive to boot. The reference for uefi is disk/path/file. It doesn’t (at the point of cold boot) know anything about uuid so how does it find the boot drive? (assuming that it works every time)
-
@george1421 If you actually install Linux, it will generally mount it using the UUID, which should always be the same.
As for the BIOS, the PCI path doesn’t change for NVME drives, just which one initalizes first can be volatile (and thus the block naming scheme in Linux gets messed up). BIOS looks for the boot file on the PCI path afaik.
However, I’d be interested in seeing if this issue occurs on successive Ubuntu/Centos LiveCD boots
-
@Quazz I was curious about the LiveCD thing, so I tried it. I can confirm that it DOES reorder the drives on successive LiveCD boots.
lsblk
switched the drive order after one successive boot.In fact I think I ran into this when first installing the OSes on these machines, and thought it was weird when it happened, but didn’t think much of it at the time.
-
@tlehrian Ok so this IS something that the Linux kernel developers are going to have to address. Its not something that only impacts FOG, but all distros of Linux.
-
@george1421 said in Error Restoring GPT Partition Tables:
Its not something that only impacts FOG, but all distros of Linux.
As @Quazz said, distros don’t have that issue because they mostly use UUIDs to identify partitions. Once the identifier is set and configured in your grub.conf/fstab there is no issue finding the right one again. But we can’t do that as we need to identify the whole disk, one that we possibly have never seen before (fresh machine).
So I kind of understand why this topic is not being discussed in the Linux world too much. But I am still wondering why this is the case for NVMe drives and if there is a way to query the controller itself. Within FOG we can do a lot of things. We can even implement our very own low level tool in C to query that information for us.