FOG 1.5.6: Auto resize is unpredictable



  • Basically, when using FOG to deploy Windows 10, I’m running into some issues with the partition resize mechanism.

    On previous versions of FOG (Sorry don’t have version numbers handy.), the auto-resize would only go after the last partition on the machine. So for example if you have this layout:
    partition 1: 500MB Windows Reserved
    partition 2: 10GB Windows Recovery
    partition 3: XXXGB Operating System

    And you capture this, with a previous verison of FOG, it would automatically not touch the first two partitions, capturing them as is, no resize. And the final partition would be shrunk to minimum size before capture. On deploying this image to a new computer, the first two partitions are put back exactly as they were captured, while the last partition would be resized to fill the rest of the target disk.

    In this version of FOG (1.5.6), this is not what happens. In the layout given, it would resize both partition 2 and 3 to minimum size and capture. When deploying this image, the result would be unpredictable. I have seen it leave the second partition at minimum size, or expand it to 30GB, why I have no idea. But the worst of it, is it doesn’t resize the final partition to fill the entire drive’s remaining space. Sometimes it does. Sometimes it doesn’t. It’s completely unpredictable.

    Now I have discovered a work-around for this issue. By booting a linux rescue CD and flagging my recovery partition as either Hidden NTFS (MBR partition table) or Windows Reserverd (GPT partition table), the FOG capture will notice this and not attempt to resize my recovery partition. Under these conditions, it seems to properly resize the final partition to fill the remainder of the drive.

    What would be best, is something in the image specification that allows us to control which partitions are resized.


  • Developer

    @Quazz Thanks heaps for digging into this even more. I guess it will be very hard to actually track down when and how this was introduced.

    This problem is rare enough on its own already (personally haven’t come across it at all)

    I would have said so myself a day or two ago. But then I even ran into the same issues when debugging a different partition layout problem for another user in my VM test setup (VirtualBox).


  • Moderator

    @Sebastian-Roth As far as I understand, any tool that uses BLKRRPART could be prevented access because of udev event handling.

    It’s usually very fast and therefore unlikely to cause issues, but since it is executed parallel to everything else it can potentially get in the way from time to time.

    As far as I know, other programs that interact with the storage device act in the same way and also use BLKRRPART (such as fdisk and hdparm)

    For now, I’d say, leave it as is and monitor for reported problems. This problem is rare enough on its own already (personally haven’t come across it at all) that it doesn’t really warrant changing everything with the possibility of the affecting something else. (doesn’t look like it should, but never know)

    edit: Important to note that Kernel version could affect how BLKRRPART functions and what not plus the udev events would be handled by the kernel as well afaik, so perhaps it’s as simple as the kernel versions?


  • Developer

    @Eric-Johnson I don’t want to take over this topic as it’s been mainly you and @Quazz who figured this one out! But I did add the changes and would be thankful if you’d give it a try in your setup. @Cheetah2003 and all the others are welcome too!

    Download 64 Bit and 32 Bit and put those in /var/www/html/fog/service/ipxe/ dir on your FOG server. Probably good to rename the original ones instead of overwriting.

    And I am still wondering if we need to use flock for fdisk (not sfdisk) calls and maybe others as well. I have a feeling that maybe some change in buildroot did cause this random failure in the first place. I am fairly sure this is not something we had in the inits for months and years. But I guess it’s not worth spending the time to track it down to the line of code that is causing it. Maybe a combination of udev and sfdisk - very likely one or both are involved.


  • Moderator

    @Sebastian-Roth said in FOG 1.5.6: Auto resize is unpredictable:

    @fry_p On first read this doesn’t sound related to me but we’ll see. I know you are very busy right now but would you mind opening a new topic with this one as well and post more information there? Picture of the error would be great!

    I figured a new topic may be in order. I will open one up if I have time the beginning of next week when I would be able to replicate. If I am able to get some time I will gather as much info as possible including error messages and pictures.


  • Developer

    @fry_p On first read this doesn’t sound related to me but we’ll see. I know you are very busy right now but would you mind opening a new topic with this one as well and post more information there? Picture of the error would be great!


  • Moderator

    Just want to chime in to say good job everyone! I would have thrown my hat in for testing fixes if I wasn’t sweating my buns off in hot school buildings reassembling classroom technology. I have experienced a pattern of right after uploading an image, as the upload is complete and the machine attempts to come out of sysprep, I get an error. Now I am not sure it is related to this, but I heard someone else having the same issue. After re-deploying to the very same machine with the image that was just captured, everything works just fine. Regardless it looks like you guys already have a good idea of what is going on, so I’ll shut up now :D


  • Developer

    @Eric-Johnson @Quazz You guys rock!! Thank you very much for digging that deep and finding this!

    From the sound of what you describe it pretty much is some kind of timing issue and therefore perfectly matches the findings on the device being busy/locked at that stage. While I still don’t really understand why this is happening now but has not occurred (or not being reported) in the past much I am pretty confident that you have hit the nail on the head with this.

    Looking through the scripts I see even more places where sfdisk might bite us and I have a feeling that some other issue reports we currently have might be caused by this same busy/lock problem with sfdisk somewhere else in the scripts. Here is a full (?) list of the places:

    roth@x220:~/workspace/fog/fos/Buildroot/board/FOG/FOS/rootfs_overlay$ find . -type f -exec grep "sfdisk " {} /dev/null \; | grep -v -e ":[[:space:]]*#" -e majorDebugEcho
    ./usr/share/fog/lib/partition-funcs.sh:    sfdisk -d $disk 2>/dev/null > $file
    ./usr/share/fog/lib/partition-funcs.sh:    sfdisk $disk < $file >/dev/null 2>&1
    ./usr/share/fog/lib/partition-funcs.sh:    sfdisk -d $disk 2>/dev/null | egrep '(Id|type)=\ *[5f]' | wc -l
    ./usr/share/fog/lib/partition-funcs.sh:            parttype=$(sfdisk -d $disk 2>/dev/null | awk -F[,=] "/^$escape_part/{print \$6}")
    ./usr/share/fog/lib/partition-funcs.sh:            sfdisk -d $disk
    ./usr/share/fog/lib/funcs.sh:    local count=$(sfdisk -d $disk 2>/dev/null | awk /start=\ *[1-9]/'{print $4+0}' | sort -n | head -n1)
    ./usr/share/fog/lib/funcs.sh:    [[ $hasgpt -eq 0 ]] && have_extended_partition=$(sfdisk -l $disk 2>/dev/null | egrep "^${disk}.* (Extended|W95 Ext'd \(LBA\))$" | wc -l)
    ./usr/share/fog/lib/funcs.sh:            sfdisk -d $disk 2>/dev/null > $sfdiskfilename
    ./usr/share/fog/lib/funcs.sh:            sfdisk -d $disk 2>/dev/null > $sfdiskfilename
    ./usr/share/fog/lib/funcs.sh:            sfdisk $disk < $sfdiskoriginalpartitionfilename >/dev/null 2>&1
    ./usr/share/fog/lib/funcs.sh:            sfdisk $disk < $sfdisklegacyoriginalpartitionfilename >/dev/null 2>&1
    ./usr/share/fog/lib/funcs.sh:    fsid="$(sfdisk -d "$disk" |  grep "$part" | sed -n 's/.*Id=\([0-9]\+\).*\(,\|\).*/\1/p')"
    

    Question: Is there a good reason that we shouldn’t add flock ... to all those calls as well? @Eric-Johnson @Quazz @Tom-Elliott

    Though I don’t know if the same can even happen if sfdisk is just reading (-d for dumping the table) instead of it writing to disk but on the other hand I reckon adding flock ... wouldn’t hurt it in any case?!?! On the other hand it sounds a bit like “we don’t know what’s happening here so let’s just try to hit everything with a rock that looks like it might be dangerous” - not my kind of style.



  • Found a couple more sfdisk that might need protecting…

    [root@fog ipxe]# diff -ur /mnt/init-orig/usr/share/fog/lib /mnt/init/usr/share/fog/lib
    diff -ur /mnt/init-orig/usr/share/fog/lib/funcs.sh /mnt/init/usr/share/fog/lib/funcs.sh
    --- /mnt/init-orig/usr/share/fog/lib/funcs.sh   2019-05-04 17:58:07.000000000 -0400
    +++ /mnt/init/usr/share/fog/lib/funcs.sh        2019-07-12 12:57:23.000000000 -0400
    @@ -1977,7 +1977,7 @@
             sfdiskLegacyOriginalPartitionFileName "$imagePath" "$disk_number"
             if [[ -r $sfdiskoriginalpartitionfilename ]]; then
                 dots "Inserting Extended partitions (Original)"
    -            sfdisk $disk < $sfdiskoriginalpartitionfilename >/dev/null 2>&1
    +            flock $disk sfdisk $disk < $sfdiskoriginalpartitionfilename >/dev/null 2>&1
                 case $? in
                     0)
                         echo "Done"
    @@ -1988,7 +1988,7 @@
                 esac
             elif [[ -e $sfdisklegacyoriginalpartitionfilename ]]; then
                 dots "Inserting Extended partitions (Legacy)"
    -            sfdisk $disk < $sfdisklegacyoriginalpartitionfilename >/dev/null 2>&1
    +            flock $disk sfdisk $disk < $sfdisklegacyoriginalpartitionfilename >/dev/null 2>&1
                 case $? in
                     0)
                         echo "Done"
    diff -ur /mnt/init-orig/usr/share/fog/lib/partition-funcs.sh /mnt/init/usr/share/fog/lib/partition-funcs.sh
    --- /mnt/init-orig/usr/share/fog/lib/partition-funcs.sh 2019-05-04 17:58:07.000000000 -0400
    +++ /mnt/init/usr/share/fog/lib/partition-funcs.sh      2019-07-12 10:26:47.000000000 -0400
    @@ -73,7 +73,7 @@
         local file="$2"
         [[ -z $disk ]] && handleError "No disk passed (${FUNCNAME[0]})\n   Args Passed: $*"
         [[ -z $file ]] && handleError "No file to receive from passed (${FUNCNAME[0]})\n   Args Passed: $*"
    -    sfdisk $disk < $file >/dev/null 2>&1
    +    flock $disk sfdisk $disk < $file >/dev/null 2>&1
         [[ ! $? -eq 0 ]] && majorDebugEcho "sfdisk failed in (${FUNCNAME[0]})"
     }
     # $1 is the name of the disk drive
    


  • Two 4 stations multicasts. No resize problems. Below diff is only change. I think this one is done! :)

    diff -u /mnt/init-orig/usr/share/fog/lib/partition-funcs.sh /mnt/init/usr/share/fog/lib/partition-funcs.sh 
    --- /mnt/init-orig/usr/share/fog/lib/partition-funcs.sh 2019-05-04 17:58:07.000000000 -0400
    +++ /mnt/init/usr/share/fog/lib/partition-funcs.sh      2019-07-12 10:26:47.000000000 -0400
    @@ -73,7 +73,7 @@
         local file="$2"
         [[ -z $disk ]] && handleError "No disk passed (${FUNCNAME[0]})\n   Args Passed: $*"
         [[ -z $file ]] && handleError "No file to receive from passed (${FUNCNAME[0]})\n   Args Passed: $*"
    -    sfdisk $disk < $file >/dev/null 2>&1
    +    flock $disk sfdisk $disk < $file >/dev/null 2>&1
         [[ ! $? -eq 0 ]] && majorDebugEcho "sfdisk failed in (${FUNCNAME[0]})"
     }
     # $1 is the name of the disk drive
    

  • Senior Developer

    This is really good information. I suppose it could be useful especially with the speed of NVMe usurping so many other disk access things, maybe udev is the issue the whole time in regards to “sporadic access” and usability?

    Thank you all for the debugging and communication.



  • @Quazz

    Reading the man page eh? Silly me downloading util-linux and tracing through source… LOL. (Hadn’t really started that except the download part :) )

    Three times with flock $disk sfdisk $disk < $file now, no resize errors.

    Gonna remove all the debug code and let em run to completion now.

    I think ya got it though! :) Thanks!


  • Moderator

    @Eric-Johnson Very interesting, here’s what sfdisk man page has to say about it.

    sfdisk uses BLKRRPART (reread partition table) ioctl to make sure
           that the device is not used by system or another tools (see also
           --no-reread).  It's possible that this feature or another sfdisk
           activity races with udevd.  The recommended way how to avoid possible
           collisions is to use exclusive flock for the whole-disk device to
           serialize device access.  The exclusive lock will cause udevd to skip
           the event handling on the device.  For example:
    
                  flock /dev/sdc sfdisk /dev/sdc
    
           Note, this semantic is not currently supported by udevd for MD and DM
           devices.
    

    So perhaps adding flock $disk before sfdisk $disk < $file will resolve it

    Making the line read as flock $disk sfdisk $disk < $file



  • Bad news. On the 3rd deploy with sync;sync;sync one failed.

    Good news:

    I moved all the metering to AFTER the sfdisk (and didn’t send the sfdisk output to /dev/null as was suggested by @Sebastian-Roth ). Found something interesting. Twice on one of the 4 machines the sfdisk is failing with a disk in use error. The input is good, no issue with the tmp file. But sfdisk seems to think the disk is in use. I think we have the smoking gun.

    Adding some df and swapon -s now to see if this I can see the disk in use by some other tools. Next step if to dig up sfdisk source and see what makes it think it is use.

    Maybe --force or --no-reread as suggested by the error mesage as a BFMI workaround? :)

    --- /mnt/init-orig/usr/share/fog/lib/partition-funcs.sh 2019-05-04 17:58:07.000000000 -0400
    +++ /mnt/init/usr/share/fog/lib/partition-funcs.sh      2019-07-12 08:37:10.000000000 -0400
    @@ -73,7 +73,14 @@
         local file="$2"
         [[ -z $disk ]] && handleError "No disk passed (${FUNCNAME[0]})\n   Args Passed: $*"
         [[ -z $file ]] && handleError "No file to receive from passed (${FUNCNAME[0]})\n   Args Passed: $*"
    -    sfdisk $disk < $file >/dev/null 2>&1
    +#    sfdisk $disk < $file >/dev/null 2>&1
    +    sfdisk $disk < $file 
    +echo " "
    +echo $file
    +ls -l $file
    +cat $file
    +Pause #ESJ
    +
         [[ ! $? -eq 0 ]] && majorDebugEcho "sfdisk failed in (${FUNCNAME[0]})"
     }
     # $1 is the name of the disk drive
    @@ -403,6 +410,8 @@
         #[[ $status -eq 0 ]] && applySfdiskPartitions "$disk" "$tmp_file1" "$tmp_file2"
         processSfdisk "$minf" filldisk "$disk" "$disk_size" "$fixed" "$orig" > "$tmp_file2"
         status=$?
    +#Pause #ESJ
    +
         if [[ $ismajordebug -gt 0 ]]; then
             echo "Debug"
             majorDebugEcho "Trying to fill with the disk with these partititions:"
    

    alt text


  • Moderator

    @Sebastian-Roth From what I gather, it’s an issue that isn’t happening under every circumstance (personally haven’t ran into it, having deployed to hundreds of devices over the course of each new init release), but does happen reliably under certain conditions.

    It seems odd that a momentary delay in the script or otherwise a sync of the “disk” (as in the ramdisk) would prevent this issue from appearing again.

    From what I can tell, people did start reporting such issues after a certain version, perhaps a buildroot update (Just looked and Buildroot version 2019.02 has bugfixes up to 2019.02.4 now! changes list: https://git.busybox.net/buildroot/plain/CHANGES?id=2019.02.4) ? People tend to wait a while before updating, so we may have to look a bit further back.


  • Developer

    @Quazz While I am not against your idea of trying to move this from file to variable I really wonder why we see this issue right here lately. Did we always have this issue and people just didn’t report? I doubt that! Or is it a change we pushed out in the last weeks? I have gone through the commit and nothing really makes sense from my point of view.

    I just try to take one step back before we head into trying to fix this. Is this hardware specific? @Eric-Johnson
    Should we try a 1.5.5 or even 1.5.4 FOG server and see if it’s the same?!


  • Moderator

    @Eric-Johnson I mostly wanted to see it before and after your cat and ls commands and such, but I don’t think it matters, since 2000+ should be more than enough for just about anything anyway.

    I was thinking that perhaps certain hardware wouldn’t generate enough entropy in time and that that would cause issues.

    @Sebastian-Roth @Tom-Elliott Do you reckon it would be good idea to try and store the output in a variable as opposed to a file and use it as such?


  • Developer

    @Tom-Elliott While you might be pointing to the right spot here I don’t thing adding a time based sleep is a great idea.


  • Senior Developer

    @Eric-Johnson what about replacing sync with sleep 5? Sync is just ensuring cached items in wait get pushed to disk, maybe the scripts are working just a slight bit too fast?



  • Current status:

    Tried clever suggestion from @Quazz … it didn’t pan out.

    But inspired me to think outside the box. Tried using /root/sfblahblah.$$ instead of /tmp (since tmpfs is different than ramdisk…) That was just as unsuccessful.

    Just adding a “sync;sync;sync” (cause anything I tell you three times is true) seems successful. On two tests so far.

    Maybe?

    Current working diff…

    [root@fog ipxe]# diff -u /mnt/init-orig/usr/share/fog/lib/partition-funcs.sh /mnt/init/usr/share/fog/lib/partition-funcs.sh
    --- /mnt/init-orig/usr/share/fog/lib/partition-funcs.sh	2019-05-04 17:58:07.000000000 -0400
    +++ /mnt/init/usr/share/fog/lib/partition-funcs.sh	2019-07-11 18:10:20.000000000 -0400
    @@ -401,8 +401,17 @@
         #    majorDebugPause
         #fi
         #[[ $status -eq 0 ]] && applySfdiskPartitions "$disk" "$tmp_file1" "$tmp_file2"
    +#    processSfdisk "$minf" filldisk "$disk" "$disk_size" "$fixed" "$orig" 
         processSfdisk "$minf" filldisk "$disk" "$disk_size" "$fixed" "$orig" > "$tmp_file2"
         status=$?
    +sync;sync;sync
    +#	echo $tmp_file2
    +#	ls -l $tmp_file2
    +#        cat $tmp_file2
    +#echo "after"
    +#cat /proc/sys/kernel/random/entropy_avail
    +#Pause #ESJ
    +	
         if [[ $ismajordebug -gt 0 ]]; then
             echo "Debug"
             majorDebugEcho "Trying to fill with the disk with these partititions:"
    [root@fog ipxe]#
    

Log in to reply
 

382
Online

6.2k
Users

13.5k
Topics

127.5k
Posts