I am a fan of the ZFS file system because it has a number of features that are difficult to find in a regular filesystem (i.e. not one for expensive storage solutions).
The most tricky to use is probably the data deduplication (or dedup for short) feature. If used in the wrong way, performance can suffer greatly, so the usual suggestion for using it is “don’t use it” :-).
In particular it is important to have a good amount of memory available to avoid trashing the disk at every moment. For example according to the FreeBSD ZFS tuning guide you should have at least 2GB per TB of storage to properly use deduplication (The calculation is a bit more complicated, you can check those two articles to get a more detailed overview of the issue).
If after doing your tests (see here for a few ideas on what to check) you still want to use it, you will find that there aren’t many tools available to evaluate the status and effectiveness of deduplication, in fact the only information you can get is the deduplication ratio of a whole pool (not even a single file system) with the help of a zpool list. All the rest is hidden in the output of zdb.
In particular there are two answers that are not easy to get
What files are being deduped ? How much of a file is being deduped ? This first question is interesting because when we enable dedup on a filesystem the file currently in there won’t be touched and only new writes will be deduped. Similarly when we disable the feature in a filesystem the deduped files will remain so unless we copy them.
The second question is interesting when we have a filesystem with mixed content and we would like to understand what kind of files are better deduped.
Luckily zdb can give us enough insight that we can extract such information. We’ll do this in two steps: first we’ll extract the list of blocks for every file in a filesystem, and then we’ll get the list of blocks in a given pool.
To make things clearer, I will make a practical example with real output. Note that the output of zdb is not stable and may change between revisions. The output of this post was obtained with the version of ZFS available on Ubuntu 16.04, YMMV.
So first of all let’s create a test filesystem
$ sudo dd if=/dev/zero of=/tmp/test.img bs=1M count=512 $ sudo zpool create test /tmp/test.img $ sudo zfs create test/dedup
These three commands create a 512MB file, create a pool out of it named test and a filesystem inside it named dedup. By default we’ll have it mounted right away on /test/dedup
We’ll create then a number of small files inside it: a regular file, a deduped file (but whose content is not shared with any other file) and two deduped files with the same content. We can do this as follows (I am assuming we gave ourselves the right to write on /test/deduped as a regular user):
$ echo hello > /test/dedup/regular.txt $ sudo zfs set dedup=on /test/dedup $ echo hello > /test/dedup/undeduped.txt $ echo world > /test/dedup/deduped1.txt $ echo world > /test/dedup/deduped2.txt
Getting the blocks used by a file For this we’ll use the output of
$ sudo zdb -ddddd test/dedup
(Yes, those are 5 ds !). With this we get a quite verbose output on all the content on the given filesystem. In particular for a plain file we get an output like the following:
Object lvl iblk dblk dsize lsize %full type 12 1 16K 512 512 512 100.00 ZFS plain file 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED dnode maxblkid: 0 path /regular.txt uid 1000 gid 1000 atime Sun Dec 18 21:46:34 2016 mtime Sun Dec 18 21:46:34 2016 ctime Sun Dec 18 21:46:34 2016 crtime Sun Dec 18 21:46:34 2016 gen 69708 mode 100664 size 6 parent 4 links 1 pflags 40800000004 Indirect blocks: 0 L0 0:2c00:200 200L/200P F=1 B=69708/69708 segment [0000000000000000, 0000000000000200) size 512
In particular we are interested in the part after “Indirect blocks:” where we have the list of all of the blocks used by the file, in this case since the file is very small a single block of 512 bytes. Of that line, we are in particular interested in the first entry, 0:2c00:200 which is DVA (Data Virtual Address), i.e. the (device,offset) pair that identifies the block of data. For the various files we have the following list of DVAs:
regular.txt 0:2c00:200 undeduped.txt 0:9600:200 deduped1.txt 0:c200:200 deduped2.txt 0:c200:200
You can see that if we are only looking to see which file share part of their blocks we already have all the information we need. In the list above we can easily see that as expected deduped1.txt and deduped2.txt share the same block since they have the same content. We still don’t have a way to distinguish between regular.txt and undeduped.txt: they look the same but the second in fact occupies for no gain an entry in the dedup index.
Getting the list of deduped blocks With another invocation of zdb
$ sudo zdb -DDDDD test
(Yes, 5 Ds this time) we have, after an initial summary of the deduplicated blocks, the whole list of them. In our case the full output is the following:
DDT-sha256-zap-duplicate: 1 entries, size 3072 on disk, 8192 in core bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 2 1 512 512 512 2 1K 1K 1K DDT-sha256-zap-duplicate contents: index 19088446f454e refcnt 2 single DVA=<0:c200:200> [L0 deduplicated block] sha256 uncompressed LE contiguous unique single size=200L/200P birth=69713L/69713P fill=1 cksum=9088446f454e715d:ad00401e6c2eda38:46c61cd7df7ba60d:7172e03a4d3fe3c6 DDT-sha256-zap-unique: 1 entries, size 3072 on disk, 8192 in core bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 1 512 512 512 1 512 512 512 DDT-sha256-zap-unique contents: index 164295552eb62 refcnt 1 single DVA=<0:9600:200> [L0 deduplicated block] sha256 uncompressed LE contiguous unique single size=200L/200P birth=69713L/69713P fill=1 cksum=64295552eb6223c5:ea60da3f466b170e:11a3a7b1a22f09bd:e09fc373baee012 DDT histogram (aggregated over all DDTs): bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 1 512 512 512 1 512 512 512 2 1 512 512 512 2 1K 1K 1K Total 2 1K 1K 1K 3 1.50K 1.50K 1.50K dedup = 1.50, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.50
We can see that we have a total of two deduped blocks, one which is used a single time (the content of undeduped.txt) and one which is used for two files (deduped1.txt an deduped2.txt). In the first case the relevant line has a “refcnt 1” indicating that the block is not shared, in the second “refcnt 2” means that the same block is used in two different places. No references to the blocks used by regular.txt since the file was written before we enabled deduplication and as such they are not in the index.
I have written a simple tool in Python that does the following: At this point we have all the information we need and we can automate the following algorithm:
Scan the output of zdb -ddddd and get the list of DVAs for every file Scan the output of zdb -DDDDD and get the list of DVAs in the DDT For every file check if its blocks appear in the DDT, and if they are unique to the file or shared with others Print a list of all the file that have at least one block in the DDT, and print the number of block uniques to the file and the number of blocks shared with others You can find on github. For our test fs, the output will be the following:
$ sudo python3 zfs_find_dedups.py test/dedup Scanning filesystem test/dedup to gather file and block list... Scanning pool test to gather dedup block list... List of files with dedup indexes: 0 1 /deduped1.txt 0 1 /deduped2.txt 1 0 /undeduped.txt
We can see that deduped1.txt and deduped2.txt share one block, while undeduped.txt is just wasting space in the DDT for nothing. regular.txt doesn’t appear since its only block is not present in the DDT.
Don’t forget to delete the test fs :-)
$ sudo zfs destroy test/dedup $ sudo zpool destroy test