FOG 1.5.6: Auto resize is unpredictable
-
@Cheetah2003 I am totally with you that the “auto-detection” and the resizing algorithms are not perfect at all! It’s quite difficult if not impossible to suite every situation/partition layout. So on the one hand side I do understand you are asking for more control over this via settings. On the other hand we have had this in place for many years now and were mostly able to help people to make it work for them nevertheless it having drawbacks!
I’d be very happy to have a couple of weeks to sit down and re-write the whole resize and partition layout handling code and make it all more robust and customizable. But right now the FOG dev team is very small and I myself have hardly enough time to answer all questions and fix things for people in the forums and on github. Can’t see me re-doing the partition handling or even just adding the asked settings in the image definition.
Don’t get me wrong. I am not saying we shouldn’t do it. I’d love to but we just won’t find the time for such a major change (I reckon it is). Would you be willing to dive in and work on this? I’ll surely assist as much as I can.
What should it be doing when it encounters more than one resizeable partition? How does it decide how to expand these on a target disk? No one seems interested in answering this?
Please go ahead, dive in the code and find out. I have played with this part of FOG a fair bit but it’s not code I wrote and I have never got to the point where I would say that I fully understand it. But from what I have figured out from playing with and reading the code I’d really wonder if this has random behavior. Sure it does not always do what I expect but I never got to the point where I have seen a single partition layout to be deployed randomly. But hey, what do I know.
Start out here: https://github.com/FOGProject/fos/blob/master/Buildroot/board/FOG/FOS/rootfs_overlay/usr/share/fog/lib/procsfdisk.awk
-
@Sebastian-Roth said in FOG 1.5.6: Auto resize is unpredictable:
Please go ahead, dive in the code and find out.
Yeah, I understand you’re just an open source project, and most likely short handed. If time permits, I would love to study under the hood and figure out how it all works.
But like everyone else, I have to survive first. So we’ll see. I have non-production fog server on my desktop at home, already (I set that up to get all those screen shots for you guys!), so maybe as time and interest permits, I will do just that!
And I apologize for the bold. Just been asking that question ALL WEEK and never got a response until I asked again in bold. So… hard to buy ‘shouting not needed’ when that’s what finally got a reply.
Anyway, thanks for all your help, I do appreciate it!
-
@Cheetah2003 said:
Just been asking that question ALL WEEK and never got a response until I asked again in bold.
I get your point here. But you need to understand as well that this is not about how often and how loud you ask but simply about how much time and willingness we/I have to answer all questions. I tend to delay questions that I can’t answer from the top of my head without letting people know. Maybe not the nicest way, so sorry for that.
I got a little bit more time today that I had on this very busy week. So I will try to give you some tools to look into this.
- Using a host/client in debug mode is usually the best way because there might be minor differences in tool versions and functionality compared to a day to day Linux system that turn out when you make extensive use if those as we do in the scripts. -> Schedule a debug task (deploy or capture whichever you are after) and boot the client till you get to the shell.
- Working on the client/host console really sucks. So using SSH connection to be able to copy&paste stuff is really helpful! -> When the client/host is up and on the shell run
passwd
command to set a root password andip a s
to get the IP address if you don’t already know. With that you should be able to connect to that client via SSH. - Preparing the environment like mounting the NFS share from your FOG server might still be inconvenience. -> Start out the task using the command
fog
, step to the point where NFS share is mounted and directory prepared (e.g. upload task creates directory to upload to) and then cancel the whole thing (ctrl+c) to get back to the shell and work on it. - The most important piece of the resize logic is understanding the partition layouts. There is no way around it other than starting to read and understand as much of
d1.partitions
(simplysfdisk -d /dev/...
output of the original layout),d1.minimum.partitions
(same as before but after it was shrunk down) andd1.fixed_size_partitions
(enumeration of partitions that won’t be resized or moved). - Now if you want to play with the partition resize logic I suggest you run that magic script manually. Follow the above steps and when you have things ready you can simply run the following command to let it calculate the partition table it would use to deploy to a target disk:
/usr/share/fog/lib/procsfdisk.awk -v SECTOR_SIZE=512 -v CHUNK_SIZE=512 -v MIN_START=2048 -v action=filldisk -v target=/dev/sda -v sizePos=187136208 -v diskSize=187136208 -v fixedList=1:2 /images/d1.minimum.partitions /images/d1.partitions
Hints:
SECTOR_SIZE
andCHUNK_SIZE
are 512 in very much all cases. New disks with real 4096 sector size are around but we have not seen much of those yet.MIN_START
is the start sector of the first partition, quite often 2048 but can be different!action=filldisk
is just what you usually want.diskSize
is the full sector count of the target disk you want to deploy to - you can find out the sector count by runningsfdisk -d /dev/...
on the target disk and look at thelast-lba
line.fixedList
is the list of partitions that should not be touched. And finally you tell it to read the rest of the rest of the information fromd1.minimum.partitions
andd1.partitions
. Running the command will print out a new partition layout that FOS would use to deploy to the target disk.I’d say, play with this for a bit and let us know what you find. I am fairly sure there are things in this that don’t add up. Possibly you’ll even find a bug in there that we have not come across over all the years.
-
@Sebastian-Roth Out of curiosity I tried that out and got some strange results. (resizable gets early “start positions”) I then ran
gawk /usr/share/fog/lib/procsfdisk.awk --lint -v SECTOR_SIZE=512 -v CHUNK_SIZE=512 -v MIN_START=2048 -v action=filldisk -v target="/dev/sda" -v diskSize=187136208 -v fixedList="1:2" d1.minimum.partitions d1.partitions
(gawk has a --lint option apparentally)
this threw a slew of warnings, haven’t had the oppertunity to go through them (I suspect most of them to be irrelevant), but figured I’d mention this option, could help in figuring this out.
EDIT: Aside two (minor) issues (such as line 545 (unquoted gpt label check)), couldn’t find anything through the lint option personally. (although I do wonder why non-fixed partitions seem to not be passed their original size. EDIT2: figured out this bit, corrected it, but still strange output, will paste below)
# Partition table is consistent. label: gpt label-id: 6D7D4E9F-F276-4554-945E-D42EF1DB667D device: /dev/sda unit: sectors first-lba: 34 last-lba: 1871362046 /dev/sda1 : start= 1870042624, size= 1083392, type=DE94BBA4-06D1-4D40-A16A-BFD50179D6AC, uuid=0E09A256-6313-43EA-9C45-1BDB234A17A3, name="Basic data partition", attrs="RequiredPartition GUID:63" /dev/sda2 : start= 1871126016, size= 202752, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=E004F3EB-3497-45AC-8BC2-40BF62ECF868, name="EFI system partition", attrs="GUID:63" /dev/sda3 : start= 1871328768, size= 32768, type=E3C9E316-0B5C-4DB8-817D-F92DF00215AE, uuid=B4553686-67E2-4177-BC7D-AC092860D2CF, name="Microsoft reserved partition", attrs="GUID:63" /dev/sda4 : start= 2048, size= 188101120, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=0005FBFB-A630-456B-9938-D501F6F70B00, name="Basic data partition" /dev/sda5 : start= 188103168, size= 1681939456, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=5F76FA4B-76D2-43B0-8ECB-F3EB8596E490, name="Basic data partition"
Problem appears to steam from the for loop at line 562
-
@Quazz Looks really strange the output. Three things I noticed:
- For whatever reason I have missed one parameter:
sizePos=...
- as far as I see it shouldn’t be relevant in the case where we useaction=filldisk
but in the scripts it’s set to the same value as disk size. So you might try and see if it makes a difference adding that (I updated my post). - Again something I might have messed up. I used quotes for the two parameters
target
andfixedList
although it’s not in the original scripts. Shouldn’t make a difference but we’ll see. - You are using
gawk
- is this for a good reason? Do you run the script on a FOS machine or some other Linux OS?
- For whatever reason I have missed one parameter:
-
@Sebastian-Roth Corrected the cli as per the info given, same results however.
I am using gawk because I’m running the tests on my Centos machine and if I don’t explicitily call gawk it will run awk (which for some reason gawk isn’t symlinking too on this system) which misses out on a variety of the requirements used in the script (and lint)
Forcing a skip in the for loop at line 563 delivers the output makes it look more normal, but then the partition table doesn’t make sense since the starts aren’t properly recalculated.
My gawk version seems to be an older version than the one Buildroot has been running for a while, so I’ll see if a newer version delivers better output.
Gawk version was the issue indeed. (was 4.0.2, now 4.2.1)
New output:
# Partition table is consistent. label: gpt label-id: 6D7D4E9F-F276-4554-945E-D42EF1DB667D device: /dev/sda unit: sectors first-lba: 34 last-lba: 1871362046 /dev/sda1 : start= 2048, size= 1083392, type=DE94BBA4-06D1-4D40-A16A-BFD50179D6AC, uuid=0E09A256-6313-43EA-9C45-1BDB234A17A3, name="Basic data partition", attrs="RequiredPartition GUID:63" /dev/sda2 : start= 1085440, size= 202752, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=E004F3EB-3497-45AC-8BC2-40BF62ECF868, name="EFI system partition", attrs="GUID:63" /dev/sda3 : start= 1288192, size= 32768, type=E3C9E316-0B5C-4DB8-817D-F92DF00215AE, uuid=B4553686-67E2-4177-BC7D-AC092860D2CF, name="Microsoft reserved partition", attrs="GUID:63" /dev/sda4 : start= 1320960, size= 188101120, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=0005FBFB-A630-456B-9938-D501F6F70B00, name="Basic data partition" /dev/sda5 : start= 189422080, size= 1681939456, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=5F76FA4B-76D2-43B0-8ECB-F3EB8596E490, name="Basic data partition"
By the way, @Cheetah2003 looking over the code, I think what it tries to do is assign the new size as a the same percentage as it took on the previous image if the partition is resizable. At least that’s what the intention is supposed to be.
The output here looks valid and more or less what we expect the script in its current iteration to do; but I could be missing something of course.
-
@Quazz said in FOG 1.5.6: Auto resize is unpredictable:
Gawk version was the issue indeed. (was 4.0.2, now 4.2.1)
Not that I really expected this but had a feeling somehow that using some other environment could give different results. Thanks for testing and verifying.
-
@Cheetah2003 Are you still keen to look into this?
-
@Sebastian-Roth said in FOG 1.5.6: Auto resize is unpredictable:
@Cheetah2003 Are you still keen to look into this?
I’d be happy to. What do you want me to do?
Also, for what it’s worth, I’m not sure multi-partition resizing is really necessary. I can’t really think of any use cases for this ‘feature.’
The percentage thing described earlier sounds pretty dubious, especially if you’re capturing 5 partitions from a 50GB disk… and the recovery partition is 20% of that space (10GB)… you don’t need that taking 20% of a target drive. That would be kinda crazy.
So really, IMHO, a percentage of the original drive captured from seems kinda not-useful. I still think this should be controllable entirely from the image specification. But I think that would require the image specification to actually pull info out of the captured image to offer the user options for how to handle the partitions contained within that image. Probably a pretty big rewrite of that entire part of the system. I’d love to see this, but yeah, it’s going to be a big task from my perspective.
So I’ll be happy to peek/test whatever you need help with, as time permits, but I’m a little unsure of the goal.
-
@Cheetah2003 A couple of posts down the road (four days earlier) I offered instructions on how to manually run the re-size calculation script. This is a good start to play with and get to see how this is all working. I am fairly sure this is not without flaw and it would be great if you are keen to look into it and suggest things you find.
-
Im just joining in here. But we are seeing the something like the same problem here. Fog 1.5.6. We have Dell 3430’s we are getting ready for deployment this fall.
A 3430 is a new model for us, and the first we have that doesn’t let you have a MBR boot disk, just GPT. Got everything working with a GPT clonemaster which for various ugly reasons has partitions like this:
[root@fog clonemaster10-lab-gpt]# cat d1.minimum.partitions
label: gpt
label-id: 701D9ABD-7D9A-11E9-B9AE-5254009E1079
device: /dev/sda
unit: sectors
first-lba: 34
last-lba: 257228766/dev/sda1 : start= 2048, size= 1124352, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=701D9AB9-7D9A-11E9-B9AE-5254009E1079
/dev/sda2 : start= 1126400, size= 234728416, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=701D9ABA-7D9A-11E9-B9AE-5254009E1079
/dev/sda3 : start= 255332352, size= 204800, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=701D9ABB-7D9A-11E9-B9AE-5254009E1079, name=“attrs=\x22GUID:63”sda2 is the real windows 10 partition…
cat d1.partitions
label: gpt
label-id: 701D9ABD-7D9A-11E9-B9AE-5254009E1079
device: /dev/sda
unit: sectors
first-lba: 34
last-lba: 257228766/dev/sda1 : start= 2048, size= 1124352, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=701D9AB9-7D9A-11E9-B9AE-5254009E1079
/dev/sda2 : start= 1126400, size= 254204148, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=701D9ABA-7D9A-11E9-B9AE-5254009E1079
/dev/sda3 : start= 255332352, size= 204800, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=701D9ABB-7D9A-11E9-B9AE-5254009E1079, attrs=“GUID:63”cat d1.fixed_size_partitions
:3:3All seemed great till we tried it on some new machines to deploy and after fog/oobe/namechange/domainjoin SOME of them wouldn’t let anyone log in. Turns out the middle partition didn’t get extended correctly in some cases so windows was out of disk.
I can multicast to 4 identical machines and on 3 of them /dev/sda2 gets resized correctly, but one it doesn’t. And the one it fails on is not always the same… Funky eh?
When I do a debug deploy with ismajordebug=9 it always works…
Was going to go digging into my memory to rebuild a init.xz that has ismajordebug=9 next. See if that makes 4 host multicast work. Or points to the problem.
Oh, and a manual run of /usr/share/fog/lib/procsfdisk.awk in debug mode seems to be producing the correct output.
vaguely wondering if $tmp_file2 is getting hosed some how before fillSfdiskWithPartitions calls applySfdiskPartitions… But like i said I can not get problem to replicate in majordebug mode yet.
Would be glad to instrument out fog.download in any way you suggest.
More tomorrow if I find anything useful.
E
-
@Cheetah2003 OS-built recovery partitions have the partition flags that keeps them fixed size.
The reason for multi partition resize is in case you have your normal partition layout (fixed size 1-3 + Windows partition) and an additional data partition.
eg
/dev/sda 1 200mb
/dev/sda2 800mb
/dev/sda3 200mb
/dev/sda4 30GB
/dev/sda5 200GBYou can’t automagically know which of the last 2 partitions to resize and which to ignore. Windows needs room to breathe, but if you deploy this to a 2TB drive then having a 1.8TB windows partition and 200GB data partition feels silly.
I agree that the current method isn’t good enough, of course, but it’s not without its logic.
Back to the topic of trying to figure this (this being why sometimes partitions don’t resize) out, as far as I can tell, these resize issues only occur on GPT based layouts.
I’ll be looking over partition-funcs.sh in that sort of a direction.
-
@Quazz Do we know enough about the problem to say… The problem started with Windows 10 version XXXX yet? I’m a bit suprised that if this is a GPT disk layout issue we haven’t had this problem before now? Or is it related to changes in FOS that caused this issue to come up (like building FOS from a newer release of buildroot causing packages to be updated)?
-
@george1421 There were some changes to GPT related stuff, not a lot, but some
I also think I remember a case where an existing image only started showing odd issues after updating FOG, so I’m currently leaning towards FOS, especially since I have experienced no problems on the latest Windows 10 versions at all.
So I’m guessing there’s something funky going on under certain conditions, but not sure what. Given the ambiguity it might not even have anything to do with GPT, but since those were the only relevant changes to the files currently being examined it seems the most likely path all the same.
-
I think I have found HOW the problem (or at least my problem) is happening, but still not clear on WHY…
/usr/share/fog/lib/partition-funcs.sh line 76 in restoreSfdiskPartitions
is where the resize occurs.sfdisk $disk < $file >/dev/null 2>&1
[[ ! $? -eq 0 ]] && majorDebugEcho “sfdisk failed in (${FUNCNAME[0]})”$file is a sfdisk input built in processSfdisk via /usr/share/fog/lib/procsfdisk.awk and stored in $tmp_file2 = /tmp/sfdisk2.$$
But if $tmp_file2 is empty $? from that sfdisk is still 0 (ie silent error) This I found via testing in a debug deploy.
Not sure why /tmp/sfdisk2.$$ is getting empty semi-randomly . Still tracking that down. /tmp is tmpfs filesystem, target machine has 16G ram. Doubt it is flling up…
-
@Eric-Johnson Just to collect a bit more data. In your FOG ui FOG Configuration->FOG Settings->TFTP Server->KERNEL RAMDISK SIZE What is the value there 127000? If so does it change the reliability if you change it to 255000? This ups the amount of virtual disk FOS Linux has available during imaging.
-
@Quazz said in FOG 1.5.6: Auto resize is unpredictable:
@Cheetah2003 OS-built recovery partitions have the partition flags that keeps them fixed size.
@Quazz Argh. As I said several times, this isn’t a OS built recovery partition. I built it myself. Are you even reading my posts???
@Eric-Johnson Welcome. And yeah, what you’re describing sounds very similar to the issue I had with the previous version of FOG that required I move my recovery partition to be before the OS partition, making the OS partition last on the disk for resize to work properly.
@Sebastian-Roth Sure sure. I’ll do some experiments and report back any findings. Might be a few days, so I hope you’re not in a hurry.
-
This other thread issue seems to be related to (maybe as a cousin) to this issue. In that thread the drive is not being expanded again after its being captured by FOG.
ref: https://forums.fogproject.org/topic/13479/install-windows-error-after-capturing-image
-
@george1421 it was indeed 127000… And of course since bigger is better I set to 511000. Will report back on the effect! Thanks!
-
TLDR: still not 100% convinced it is fixed…
Todays fun start with metering init.xz like this:
diff -u /mnt/init-orig/usr/share/fog/lib/funcs.sh /mnt/init/usr/share/fog/lib/funcs.sh --- /mnt/init-orig/usr/share/fog/lib/funcs.sh 2019-05-04 17:58:07.000000000 -0400 +++ /mnt/init/usr/share/fog/lib/funcs.sh 2019-07-10 12:29:53.000000000 -0400 @@ -1690,6 +1690,10 @@ runPartprobe "$disk" } # Waits for enter if system is debug type +Pause() { + echo " * Press [Enter] key to continue" + read -p "$*" +} debugPause() { case $isdebug in [Yy][Ee][Ss]|[Yy]) diff -u /mnt/init-orig/usr/share/fog/lib/partition-funcs.sh /mnt/init/usr/share/fog/lib/partition-funcs.sh --- /mnt/init-orig/usr/share/fog/lib/partition-funcs.sh 2019-05-04 17:58:07.000000000 -0400 +++ /mnt/init/usr/share/fog/lib/partition-funcs.sh 2019-07-10 15:39:58.000000000 -0400 @@ -401,8 +401,15 @@ # majorDebugPause #fi #[[ $status -eq 0 ]] && applySfdiskPartitions "$disk" "$tmp_file1" "$tmp_file2" + processSfdisk "$minf" filldisk "$disk" "$disk_size" "$fixed" "$orig" + Pause processSfdisk "$minf" filldisk "$disk" "$disk_size" "$fixed" "$orig" > "$tmp_file2" status=$? + echo $tmp_file2 + ls -l $tmp_file2 + cat $tmp_file2 + Pause + if [[ $ismajordebug -gt 0 ]]; then echo "Debug" majorDebugEcho "Trying to fill with the disk with these partititions:"
Printing out the output of processSfdisk, pausing, then doing it for real and ls’ing $tmp_file2 and printing it out, then pausing again.
In one sense this worked. Multicast to 4 machines and all came out right. Previous multicasts to the same four machines would have 1 or 2 displaying the failure… But with the metering… No failures.
Ok, so I decide, lets just quit if $tmp_file2 is zero… Next version of init.xz had this diff
diff -u init-orig/usr/share/fog/lib/partition-funcs.sh init/usr/share/fog/lib/partition-funcs.sh --- init-orig/usr/share/fog/lib/partition-funcs.sh 2019-05-04 17:58:07.000000000 -0400 +++ init/usr/share/fog/lib/partition-funcs.sh 2019-07-10 14:29:47.000000000 -0400 @@ -73,6 +73,7 @@ local file="$2" [[ -z $disk ]] && handleError "No disk passed (${FUNCNAME[0]})\n Args Passed: $*" [[ -z $file ]] && handleError "No file to receive from passed (${FUNCNAME[0]})\n Args Passed: $*" + [[ ! -s $file ]] && handleError "in /usr/share/fog/lib/partition-funcs.sh fillSfdiskWithPartitions $tmp_file2 is zero length" #ESJ sfdisk $disk < $file >/dev/null 2>&1 [[ ! $? -eq 0 ]] && majorDebugEcho "sfdisk failed in (${FUNCNAME[0]})" } @@ -403,6 +404,9 @@ #[[ $status -eq 0 ]] && applySfdiskPartitions "$disk" "$tmp_file1" "$tmp_file2" processSfdisk "$minf" filldisk "$disk" "$disk_size" "$fixed" "$orig" > "$tmp_file2" status=$? + + [[ ! -s $tmp_file2 ]] && handleError "in /usr/share/fog/lib/partition-funcs.sh fillSfdiskWithPartitions $tmp_file2 is zero size" #ESJ + if [[ $ismajordebug -gt 0 ]]; then echo "Debug" majorDebugEcho "Trying to fill with the disk with these partititions:"
Checked $tmp_file2 in two places, once when created and once right before being used. It would exit if $tmp_file2 was zero size right?
Did a clone to the four machines… all worked well…
Did another clone… Crapola… One of the four didn’t resize. So it wasn’t zero length… But the other metering was not there so I don’t know what was there…Have done a bunch of metered multicast since then. All with no errors. A Heisenberg bug. If you look to close it always works… Sigh…
Am going trying a few with no metering and KERNEL RAMDISK SIZE set ot 511000 per my supersize of @george1421 's suggestion.
But I am feeling like I am missing something…