UNSOLVED Dedupe storage - how to best bypass pigz packaging?

  • Server
    • FOG Version: 1.2.0
    • OS: RHEL 6.8
    • Service Version: NA
    • OS: Windows (Various)

    In working with a deduplication storage location for images, I’ve found that pigz prevents any realistic dedupe results. (Even the same image backed up twice is not recognized by the dedupe engine.)

    I’ve attempted to modify the uploadFormat function in init’s usr/share/fog/lib/funcs.sh file to change pigz behavior (-i and -0 hardcoded) without any useful results. I’m now considering how best to remove pigz from from uploadFormat. But this will potentially impact image pushes, since pigz is called for decompression.

    1. Has the FOG team looked at/considered dedupe storage locations in the past?
    2. Am I approaching this in a good/backwards way?
    3. Any other suggestions on how to collect a ‘clean’ IMG upload without any gzip packaging?


            if [ ! -n "$1" ]; then
                    echo "Missing Cores";
            elif [ ! -n "$2" ]; then
                    echo "Missing file in file out";
            elif [ ! -n "$3" ]; then
                    echo "Missing file name to store";
            if [ "$imgFormat" == "2" ]; then
                    # pigz -p $1 $PIGZ_COMP < $2 | split -a 3 -d -b 200m - ${3}. &
                    pigz -i -p 1 -0 < $2 | split -a 3 -d -b 200m - ${3}. &
                    if [ "$imgType" == "n" ]; then
                            # pigz -p $1 $PIGZ_COMP < $2 > ${3}.000 &
                            pigz -i -p 1 -0 < $2 > ${3}.000 &
                            # pigz -p $1 $PIGZ_COMP < $2 > $3 &
                            pigz -i -p 1 -0 < $2 > $3 &
  • Developer

    @Tom-Elliott it was quite a while ago i did my testing (and of course i didn’t actually document anything…) but i was working with about 3 or 5 images seeing how much they shared so i could estimate from there, and it wasn’t good. very little duplication detected in spite of the obvious duplication that was taking place.
    someone who knows what they’re doing might have much different results

  • Senior Developer

    I think de-duplication from multiple images (like 10 images) would be much more suitable than the same 10 images with compression. But again, this means it will be seen when you are dealing with multiple images. For one or two images it’s probably not going to be worth the gain.

  • Moderator

    @Junkhacker I to am interested in seeing how the dedup rate compares to the same file compressed at a level 6. If for just information only. It would be interesting to know.

  • Developer

    sorry i didn’t see this thread earlier, but i have experimented with dedup of uncompressed fog images on a zfs filesystem to see if it was worth it. i saw far less gains than when using compression. but i really can’t say i’m highly experienced in dedup, so maybe i did something wrong. let us know how your experiments go

  • Developer

    @Tom-Elliott Excitedly downloads new RC

  • Senior Developer

    1.3.5 RC 6 has been released and should have this ‘uncompressed’ capability coded more properly for it.

  • Senior Developer

    @george1421 Correct. Really the working branch, it will be available for re-installs of rc5 but not in the GUI.

  • @Tom-Elliott Thanks Tom! I will test out the funcs.sh change to see how the upload results change.

    @george1421 Yes, a clonezilla image directly uploaded will dedupe (with no compression.) The file is chunked into pieces and deduped based on those chunks. The art is to match the dedupe chunk size with the data inside the image and to match the chunk boundaries between the algorithm and the incoming data.

  • Moderator

    @Tom-Elliott Just so I’m clear, this feature is only available with the latest release of FOG (1.3.5.rc5)?

  • Senior Developer

    Prototype is up and appears to be working properly. With this new change it should theoretically be possible to use clonezilla images within fog, once files are named into fog formats. Same goes in reverse, fog images with uncompressed format should be able move into clonezilla provided the images are renamed to clonezilla naming standards.

  • Senior Developer

    @george1421 I’m working a prototype init that will enable use without compression.

  • Moderator

    @Tom-Elliott Tom, as a test, would it be possible to just use clonezilla to capture a disk image and store it on the target storage array? What I’m interested in is it even possible to dedup a huge binary file? Since both clonezilla and FOG use partclone to clone the image would both programs provide a similar file in structure? If it would, it would give the OP a way to test without hacking too much with the fog init scripts.

  • Senior Developer

    Uploaded images are always passed through PIGZ regardless of the number for the compression. 0 does not be “no” compression, it just mins minimal.

    We don’t, currently, have a means for allowing uncompressed images and for most this is more than suitable.

    As far as it being captured zipped or not, I still think de-duplication engines would still have a hard time ensuring data is not copied twice. While it is true that Partclone is a block imaging utility, the way partclone stores the image is totally separate.

    That said,

    You could give a shot at editing the capture utility to not enforce compression at all. In particular, you would edit the /usr/share/fog/lib/funcs.sh file.

    In particular, as you’d be removing the zipped nature of the files you would need to either add a “image type” in the form of partclone, partimage, and uncompressed partclone.

    To edit You would (under current status of the files) edit lines 678 and 681 changing removing the pigz -d -c </tmp/pigz1 to contain cat /tmp/pigz1.

    For upload you’d edit lines:
    1544 to read as:
    cat $fifo | split -a 3 -d -b 200m - ${file}. &
    1547 to read as:
    cat $fifo > ${file}.000 &

    The nice part, with postinitscripts now, you can edit this file and have it copy the modified file into place before the main tasking begins.

  • Hi George,

    I’ve tested extensively with ‘compression 0’ for pigz but cannot get any deduplication since (I assume) the data is still being chunked at 128K and placed in a gzip wrapper. Once the .img is written, the file command shows the data as ‘gzip compressed data.’ As shown in the uploadFormat code, I have hard-set pigz to -0 compression (as well as set the same within the FOG client properties.)

    Also, this dedupe storage is used for other data storage successfully (VM datastores, general fileserver data, etc.) (I’m an ‘old-hand’ with dedupe storage systems.)

    Thanks for the info about dev review/focus on dedupe filesystems. So with that in mind, I am only looking for suggestions/guidance on how best to bypass pigz (or if that is a large task affecting huge parts of FOG.)

  • Moderator

    AFAIK: The dev team isn’t even looking at dedup storage, even in 1.3.x. That is outside the scope of FOG imaging.

    What I can tell you is that the data compression that FOG uses (or any type of data compression) will mess with the dedup algorithm. If you want to use storage deduplication change the image compress factor to 0 and then let your storage device manage the image. Or increase your image compression value and don’t worry about dedup’ing the image.