Dedupe storage - how to best bypass pigz packaging?
-
Uploaded images are always passed through PIGZ regardless of the number for the compression. 0 does not be “no” compression, it just mins minimal.
We don’t, currently, have a means for allowing uncompressed images and for most this is more than suitable.
As far as it being captured zipped or not, I still think de-duplication engines would still have a hard time ensuring data is not copied twice. While it is true that Partclone is a block imaging utility, the way partclone stores the image is totally separate.
That said,
You could give a shot at editing the capture utility to not enforce compression at all. In particular, you would edit the /usr/share/fog/lib/funcs.sh file.
In particular, as you’d be removing the zipped nature of the files you would need to either add a “image type” in the form of partclone, partimage, and uncompressed partclone.
To edit You would (under current status of the files) edit lines 678 and 681 changing removing the pigz -d -c </tmp/pigz1 to contain cat /tmp/pigz1.
For upload you’d edit lines:
1544 to read as:
cat $fifo | split -a 3 -d -b 200m - ${file}. &
1547 to read as:
cat $fifo > ${file}.000 &
The nice part, with postinitscripts now, you can edit this file and have it copy the modified file into place before the main tasking begins.
-
@Tom-Elliott Tom, as a test, would it be possible to just use clonezilla to capture a disk image and store it on the target storage array? What I’m interested in is it even possible to dedup a huge binary file? Since both clonezilla and FOG use partclone to clone the image would both programs provide a similar file in structure? If it would, it would give the OP a way to test without hacking too much with the fog init scripts.
-
@george1421 I’m working a prototype init that will enable use without compression.
-
Prototype is up and appears to be working properly. With this new change it should theoretically be possible to use clonezilla images within fog, once files are named into fog formats. Same goes in reverse, fog images with uncompressed format should be able move into clonezilla provided the images are renamed to clonezilla naming standards.
-
@Tom-Elliott Just so I’m clear, this feature is only available with the latest release of FOG (1.3.5.rc5)?
-
@Tom-Elliott Thanks Tom! I will test out the funcs.sh change to see how the upload results change.
@george1421 Yes, a clonezilla image directly uploaded will dedupe (with no compression.) The file is chunked into pieces and deduped based on those chunks. The art is to match the dedupe chunk size with the data inside the image and to match the chunk boundaries between the algorithm and the incoming data.
-
@george1421 Correct. Really the working branch, it will be available for re-installs of rc5 but not in the GUI.
-
1.3.5 RC 6 has been released and should have this ‘uncompressed’ capability coded more properly for it.
-
@Tom-Elliott Excitedly downloads new RC
-
sorry i didn’t see this thread earlier, but i have experimented with dedup of uncompressed fog images on a zfs filesystem to see if it was worth it. i saw far less gains than when using compression. but i really can’t say i’m highly experienced in dedup, so maybe i did something wrong. let us know how your experiments go
-
@Junkhacker I to am interested in seeing how the dedup rate compares to the same file compressed at a level 6. If for just information only. It would be interesting to know.
-
I think de-duplication from multiple images (like 10 images) would be much more suitable than the same 10 images with compression. But again, this means it will be seen when you are dealing with multiple images. For one or two images it’s probably not going to be worth the gain.
-
@Tom-Elliott it was quite a while ago i did my testing (and of course i didn’t actually document anything…) but i was working with about 3 or 5 images seeing how much they shared so i could estimate from there, and it wasn’t good. very little duplication detected in spite of the obvious duplication that was taking place.
someone who knows what they’re doing might have much different results