deduplication of images files possible yet?

  • There are a couple of posts in the past couple years about deduplication, partclone 0.3.x, pigz vs gzip with the --rsyncable parameter to stop the rolling checksum.


    From these posts there seemed to be some success, but required versions of partclone, etc that were not yet in FOG.

    Is it possible to configure recent FOG versions so that the images can be successfully deduplicated yet? If so, what needs to be configured to make it possible? I am running FOG 1.5.8.

  • Developer

    @george1421 i meant to reply to this a long time ago, but here goes.

    testing on deduping of those images has been done. they dedup quite well. the dedup changes affect zstd and pigz compressed images. pigz compressed images actually dedup better, but the the compression and performance are worse. it’s a tradeoff to be evaluated by the individual.

    dedup is only possible with the newer version of partclone due to a rolling checksum integraed into the image format on earlier versions. the newer version lets us choose no checksum.

    the compressed binary file is dedupable thanks to the --rsyncable flag on compression that is supported by both pigz and zstd.

    like george said, any deduping would be the responsibility of the underlying filesystem or storage, not built into fog itself.

  • Moderator

    @mfinn999 Deduplication really hasn’t been studied on the FOG captured images. On the two links you provided there was discussion about adding certain options to the utilities that capture the image. Those options were added to the FOG code base.


    and here:

    Beyond that no other testing have been done. Also realize that FOG does nothing in regards to dedup, that role should be done by the host OS or host hardware of the FOG server.

    Also understand that the options that were added to the image capture do not specifically address dedup operations. Those (new) settings will only impact newly captured images in zstd format and not gzip.

    How FOG captures images is that it uses a utility called partclone to read the disk blocks on the target computer. Then it directs those read blocks through a compressor (zstd, or gzip) before being sent to the FOG server. The FOG server takes the compressed blocks from the target computer and writes them unaltered to the FOG server’s disk. So what’s written to the fog server’s disk is a packed (compressed) binary file. I can’t see how two images would have a lot of duplicating blocks to make dedup even effective here.

    @Junkhacker @Quazz Do you know anything I’m missing here?

    TBH I wonder if the -B (block size) option for zstd would have an impact on the dedup’d image. But also it would require the FOG developers to have access to dedup storage (and the desire) to see if there were any improvements that could be made in this area.

    [donald@duckserver html]# zstdmt --help
    *** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
    Usage :
          zstdmt [args] [FILE(s)] [-o file]
    FILE    : a filename
              with no FILE, or when FILE is - , read standard input
    Arguments :
     -#     : # compression level (1-19, default: 3)
     -d     : decompression
     -D file: use `file` as Dictionary
     -o file: result stored into `file` (only if 1 input file)
     -f     : overwrite output without prompting and (de)compress links
    --rm    : remove source file(s) after successful de/compression
     -k     : preserve source file(s) (default)
     -h/-H  : display help/long help and exit
    Advanced arguments :
     -V     : display Version number and exit
     -v     : verbose mode; specify multiple times to increase verbosity
     -q     : suppress warnings; specify twice to suppress errors too
     -c     : force write to standard output, even if it is the console
     -l     : print information about zstd compressed files
    --exclude-compressed:  only compress files that are not previously compressed
    --ultra : enable levels beyond 19, up to 22 (requires more memory)
    --long[=#]: enable long distance matching with given window log (default: 27)
    --fast[=#]: switch to very fast compression levels (default: 1)
    --adapt : dynamically adapt compression level to I/O conditions
    --stream-size=# : optimize compression parameters for streaming input of given number of bytes
    --size-hint=# optimize compression parameters for streaming input of approximately this size
    --target-compressed-block-size=# : make compressed block near targeted size
     -T#    : spawns # compression threads (default: 1, 0==# cores)
     -B#    : select size of each job (default: 0==automatic)
    --rsyncable : compress using a rsync-friendly method (-B sets block size)
    --no-dictID : don't write dictID into header (dictionary compression)
    --[no-]check : integrity check (default: enabled)
    --[no-]compress-literals : force (un)compressed literals
     -r     : operate recursively on directories
    --output-dir-flat[=directory]: all resulting files stored into `directory`.
    --format=zstd : compress files to the .zst format (default)
    --test  : test compressed file integrity
    --[no-]sparse : sparse mode (default: enabled on file, disabled on stdout)
     -M#    : Set a memory usage limit for decompression
    --no-progress : do not display the progress bar
    --      : All arguments after "--" are treated as files
    Dictionary builder :
    --train ## : create a dictionary from a training set of files
    --train-cover[=k=#,d=#,steps=#,split=#,shrink[=#]] : use the cover algorithm with optional args
    --train-fastcover[=k=#,d=#,f=#,steps=#,split=#,accel=#,shrink[=#]] : use the fast cover algorithm with optional args
    --train-legacy[=s=#] : use the legacy algorithm with selectivity (default: 9)
     -o file : `file` is dictionary name (default: dictionary)
    --maxdict=# : limit dictionary to specified size (default: 112640)
    --dictID=# : force dictionary ID to specified value (default: random)
    Benchmark arguments :
     -b#    : benchmark file(s), using # compression level (default: 3)
     -e#    : test all compression levels from -bX to # (default: 1)
     -i#    : minimum evaluation time in seconds (default: 3s)
     -B#    : cut file into independent blocks of size # (default: no block)
    --priority=rt : set process priority to real-time

  • After reading the posts I listed, and seeing what was in 1.5.8, I know that most of the programs should be capable. I am wondering how to “enable” it. We have 226 images using 5.4TB on an XFS partition on the FOG server:

    #df -h
    Filesystem Size Used Avail Use% df -h
    Mounted on
    /dev/mapper/cl_fog2-images 28T 5.4T 22T 20% /images

    I used rsync to make a copy of that to a CentOS 8 VDO volume with compression and deduplication enabled:

    #vdostats --hu
    Device Size Used Available Use% Space saving%
    /dev/mapper/vdo1 6.0T 5.0T 1.0T 83% 2%
    #df -h
    Filesystem Size Used Avail Use% Mounted on
    /dev/mapper/vdo1 6.0T 5.2T 833G 87% /backup

    Compression seems to have saved a small amount, but as most of the images are based on the the same “base” image, dedupe should have reduced it a great deal more. The images are compressed on the FOG server with default settings.

  • Senior Developer

    @mfinn999 As far as I know we have all that in FOG 1.5.8 already. Though I have to say that it’s mostly @Junkhacker who’s pushed this forward and knows all the details. But as far as I am concerned I would say we have all the tools and command line options in 1.5.8.

    Can you see it’s not working as intended? Please provide evidence we can work on.