Kernel temporarily loosing partition information when running partprobe


  • Developer

    Trying to push an image to the clients I ran into a problem which took me several hours to dig into. Hope this can be of help to someone else too and maybe FOG could be updated to circumnavigate this issue as well. The client comes up with this error message:

    [CODE]* Looking for Hard Disks…Done

    • Using Hard Disk: /dev/sda
    • Erasing current MBR/GPT Tables…Done
    • Restoring MBR…Done
    • Extended partitions…
      Error: Error informing the kernel about modifications to partition
      /dev/sda5 – Device or resource busy. This means Linux won’t know
      about any changes you made to /dev/sda5 until you reboot – so you
      shouldn’t mount it or use it in any way before rebooting.
      Error: Failed to add partition 5 (Device or resource busy)[/CODE]

    sda5 (or sda as a whole) CANNOT be busy as there is only FOG running and nothing else! The error is not a show stopper straight away. FOG starts to write the partition images to disk. sda1, sda2, sda3 all went fine [B]but sda5 and sda6 failed[/B]!!

    Trying to find what’s causing this I stumbled upon this: [url]https://github.com/Excito/parted/blob/master/tests/t2310-dos-extended-2-sector-min-offset.sh[/url]

    [CODE]…

    Ensure that parted leaves at least 2 sectors between the beginning

    of an extended partition and the first logical partition.

    Before parted-2.3, it could be made to leave just one, and that

    would cause trouble with the Linux kernel.

    …[/CODE]

    Looking at my partition table I knew what’s causing the trouble:

    [CODE]# partition table of /dev/sda
    unit: sectors

    /dev/sda1 : start= 2048, size= 206797, Id= 7, bootable
    /dev/sda2 : start= 210944, size=257535916, Id= 7
    /dev/sda3 : start=257748991, size=224474114, Id= f
    /dev/sda4 : start= 0, size= 0, Id= 0
    /dev/sda5 : start=257748992, size=150334138, Id= 7
    /dev/sda6 : start=408084480, size= 74138625, Id= 7[/CODE]

    See the start of sda3 and sda5 are only one sector apart!

    So what is really happening?? FOG zaps the old partition table on disk (sgdisk -Z), restores MBR (dd), creates partitions (sfdisk) and then forces the kernel to re-read the partition table (partprobe). Partprobe doesn’t like the partition scheme and leaves the kernel with a only halfway populated list of partitions. Looks like this:

    [CODE]# ls -al /dev/sdb*
    brw------- 1 root root 8, 0 Jan 21 03:33 /dev/sda
    brw------- 1 root root 8, 1 Jan 21 03:33 /dev/sda1
    brw------- 1 root root 8, 2 Jan 21 03:33 /dev/sda2
    brw------- 1 root root 8, 3 Jan 21 03:33 /dev/sda3
    brw------- 1 root root 8, 5 Jan 21 03:33 /dev/sda5
    brw------- 1 root root 8, 6 Jan 21 03:33 /dev/sda6

    cat /proc/partitions

    major minor #blocks name

    8 0 241107738 sda
    8 1 103398 sda1
    8 2 128767958 sda2
    8 3 1 sda3
    8 5 75167069 sda5
    8 6 37069312 sda6

    partprobe /dev/sda

    Error: Error informing the kernel about modifications to partition
    /dev/sda5 – Device or resource busy. This means Linux won’t know
    about any changes you made to /dev/sda5 until you reboot – so you
    shouldn’t mount it or use it in any way before rebooting.
    Error: Failed to add partition 5 (Device or resource busy)

    ls -al /dev/sdb*

    brw------- 1 root root 8, 0 Jan 21 03:33 /dev/sda
    brw------- 1 root root 8, 1 Jan 21 03:33 /dev/sda1
    brw------- 1 root root 8, 2 Jan 21 03:33 /dev/sda2
    brw------- 1 root root 8, 3 Jan 21 03:33 /dev/sda3

    cat /proc/partitions

    major minor #blocks name

    8 0 241107738 sda
    8 1 103398 sda1
    8 2 128767958 sda2
    8 3 1 sda3
    #[/CODE]

    The partition table on disk is still alright. It’s just the kernel not knowing about it! There are several commands (sfdisk -R, hdparm -z, blockdev --rereadpt) which make the kernel re-read this information. For example if I run ‘hdparm -z /dev/sda’ I get sda5 and sda6 back and can run partclone (by hand) without any further trouble!

    I know this is a very special (and stupid) problem and I really hope that no one else is running into this. [B]But could FOG prevent people from this by using a tool other than partprobe to make the kernel re-read the partition information???[/B]


  • Senior Developer

    Here’s the same, albeit slightly modified, patch:

    [code]Index: src/buildroot/package/fog/scripts/usr/share/fog/lib/funcs.sh

    — src/buildroot/package/fog/scripts/usr/share/fog/lib/funcs.sh (revision 2932)
    +++ src/buildroot/package/fog/scripts/usr/share/fog/lib/funcs.sh (working copy)
    @@ -106,6 +106,12 @@
    fi
    }

    $1 is the partition

    +getPartType()
    +{

    •   parttype=`blkid -po udev $1 | grep PART_ENTRY_TYPE | awk -F'=' '{print $2}'`;
      
    •   echo $parttype;
      

    +}
    +# $1 is the partition

    Returns the size in bytes.

    getPartSize()
    {
    @@ -774,9 +780,9 @@

    $1 is the drive

    runPartprobe()
    {

    •   hdparm -z $1 &>/dev/null;
      
    •   if [ ! -f "${1}1" ]; then
      
    •           partprobe $1 &>/dev/null;
      
    •   partprobe $1 &>/dev/null || hdparm -z $1 &>/dev/null;
      
    •   if [ "$?" != "0" ]; then
      
    •           handleError "Failed to read back partitions";
        fi
      

    }

    @@ -1023,6 +1029,7 @@
    local imgPartitionType="$6";
    local partNum="";
    local fstype="";

    •   local parttype="";
        local imgpart="";
      
        partNum=${part:$diskLength};
      

    @@ -1030,7 +1037,8 @@
    mkfifo /tmp/pigz1;
    echo " * Processing Partition: $part ($partNum)";
    fstype=fsTypeSetting $part;

    •           if [ "$fstype" != "swap" ]; then
      
    •           parttype=`getPartType $part`;
      
    •           if [ "$fstype" != "swap" ] || [ "$parttype" != "0x5" -a "$parttype" != "0xf" ]; then
                        echo -n " * Using partclone.";
                        echo $fstype;
                        sleep 5;
      

    @@ -1042,8 +1050,12 @@
    clear;
    echo " * Image uploaded";
    else

    •                   echo " * Not uploading swap partition";
      
    •                   saveSwapUUID "${imagePath}/d${intDisk}.original.swapuuids" "$part"; 
      
    •                   if [ "$parttype" == "0x5" -o "$parttype" == "0xf" ]; then
      
    •                           echo " * Not uploading content of extended partition";
      
    •                   else
      
    •                           echo " * Not uploading swap partition";
      
    •                           saveSwapUUID "${imagePath}/d${intDisk}.original.swapuuids" "$part"; 
      
    •                   fi
                fi
                rm /tmp/pigz1;
        else[/code]

  • Developer

    Here is a patch. Only tested in QEMU so far!!! Use with care.

    [url="/_imported_xf_attachments/1/1638_partype.patch.txt?:"]partype.patch.txt[/url]


  • Developer

    [QUOTE]Not copying it wouldn’t be a bad idea though either as it doesn’t contain anything of importance, but again I can’t see how the copying of this would cause problems.[/QUOTE]
    Because the OLD start sector of the logical partitions is stored in that small junk of data and will overwrite the numbers that sfdisk has created when writing the whole NEW partition table.

    [QUOTE]Do you know what filetype it comes up as?[/QUOTE]
    In my case partclone.imager (raw) is being used when storing this partition. As far as I remember ‘blkid’ is used to discover the filesystem on partitions. We probably could use this to skip taking an image on extended partitions as well.

    [CODE]blkid -po udev /dev/sda3
    ID_PART_TABLE_TYPE=dos
    ID_PART_ENTRY_SCHEME=dos
    ID_PART_ENTRY_TYPE=0x5

    [/CODE]
    Type 5 (and also f) are used for extended partitions.


  • Senior Developer

    I don’t imagine it would.

    The extended partition contains no data, only the logical partition does. However, that doesn’t make sense that restoring a 1 or 2 sector sized file (about 4k of data) would “break” anything.

    Not copying it wouldn’t be a bad idea though either as it doesn’t contain anything of importance, but again I can’t see how the copying of this would cause problems.

    Do you know what filetype it comes up as? I forget what command line of the top of my head reports this to us.


  • Developer

    [QUOTE]So my guess is that Windows is messing up things after the deploy.[/QUOTE]
    I was wrong!! The partition table is ruined even before FOG reboots the machine! :(

    Adding some more debugging steps into the download script I found out that deploying sda3 (extended partition) with partclone is messing up things!! It turns out that in some cases Linux has a different understanding of extended partitions than Windows has. In my case the number of the start sector of sda6 is slightly smaller when being captured with ‘sfdisk -d’ while uploading an image. This wouldn’t be a problem IF we wouldn’t write sda3 (and with that the old start sector of sda6) back to disk!

    One solution would be to go back and only allow Linux to have extended partitions but I doubt this is what users would like to see. Another way would be to check the partition type and skip partclone.restore if it is an extended partition. Could that cause trouble somewhere else??


  • Developer

    As I don’t have access to those computers on the weekend I simulate things with QEMU at home. It’s great for testing this kind of partitioning issues as running through an upload/download cycle only takes a couple of minutes with a small test container image!

    But I wasn’t able to reproduce the problem at home as I don’t have a full Windows installation running in QEMU here. So my guess is that Windows is messing up things after the deploy. But I really wonder why it does when it is actually happy with things before I try to image it to another computer!?

    I’ll do some more research next week and will let you know as soon as possible.


  • Senior Developer

    I do want to figure out what is and why it is happening. I just don’t know where to begin. I am extremely appreciative on the assist to bug track and narrow the problem down. Hopefully we can get it “just working” across the board for you and any others who may run into the same type of problems.


  • Developer

    Deploying the image works now (no error messages) but somehow the second logical partition is screwed. I didn’t have the time to look into this so far. Just a couple of things I remember (noted down):

    1. Deploy task with partclone went through (including sda5 and sda6) without error
    2. Running debug download I had a quick look on the partition table (fdisk -l) wich seamed alright at that point
    3. Windows 7 boots up and is happy but is completely missing sda6 (2nd logical partition)!!
    4. So I quickly booted a debian live CD to check on the partition table (fdisk -l) and it looked a bit crooked:
      [CODE]…
      /dev/sda6 671145921 671147201 640+ 1 FAT12[/CODE]
      All the other partitions are good! But sda6 should look like this:
      [CODE]…
      /dev/sda6 408084480 445153792 37069312 7 HPFS/NTFS/exFAT[/CODE]
      As you can see those numbers are well beyond the disk capacity. Trying to dump the partition table with sfdisk I got the following error:
      [CODE]ERROR: sector 408084417 does not have an msdos signature[/CODE]

    Anyone got an idea? [B]ntfsfix[/B] is being called after restoring the partitions. Could that cause trouble? I am still not sure if all this is a very rare problem (caused by a stupid partition layout) or if others might also run into this. I’ll do some more tests next week and hopefully will be able to get it all sorted.

    Any comments on this are more than welcome!

    PS: We are happily using FOG for some time now and this problem is not a real issue as we fixed our partition layout already. I just want to make sure that no one else is also having trouble with this.


  • Developer

    Thanks!! I get to try it on Friday and will report back then…


  • Senior Developer

    If you could update to 2919. You should receive the updated init files that are now using blockdev --rereadpt <device> to have the kernel reread the partition information.


Log in to reply
 

380
Online

38980
Users

10712
Topics

101676
Posts

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.