Finding duplicate files with Bash

Someone recently asked, in #bash on Freenode, how to find duplicate files with Bash. Several options were suggested, and the user ended up installing and running “fdupes“. However, this sort of thing should be reasonably easy to do using “find” and a few pipes.

As a quick overview, what you want to achieve is to find all the files, list their sizes and names followed by a NUL ($’\0′) separator to allow for wonky filenames. You start with this, as sizes are the quickest values to compare. You then get all the duplicates that have at least one other file with the same size, and discard the rest of them. Then we want to run some sort of a checksum tool. I chose sha1sum. Again we discard the files with no duplicates. And that’s really all there is to it.

Feeling bored one night, I decided to try this. As usual, the whole thing grew entirely out of proportion, and I ended up duplicating most of fdupes’ functionality in Bash. Still, the resulting script is surprisingly fast, looks nice, has a decent help menu and doesn’t rely on anything not found on virtually every GNU/Linux machine out there. It also supports a few things fdupes does not, such as (as of 2017-07-01) null terminated output. Thus it found its way into my toolbox.

Let me present “cdupes”:

$ cdupes
cdupes: no directories specified

Description:
  Bash script that's functionally similar to the tool "fdupes"

Usage: cdupes [options] <dir> [dir] ...
Options:
  -0
    Null termination. Allows piping filenames with weird characters into other tools
  -5
    md5sum. Uses md5sum instead of the default sha1sum
  -A
    NoHidden. Excludes files which names start with --> . <--
  -c
    Checksum. Show checksum of duplicate files
  -f
    Omit first match from each group. Useful with -m for cleanup scripts
  -l
    Hard link support. Do not consider hard linked files duplicates of eachother
  -m
    Machine readable. No empty lines between groups
    Probably only useful in conjunction with -c or -f
  -n
    NoEmpty. Ignore empty files (size 0)
  -p
    Permissions. Don't consider files with different owner/group or permission bits as duplicates
  -r
    Recurse. For every directory given, follow subdirectories encountered within.
  -S
    Size. Show size of duplicate files
  -q
    Quiet. Hides progress indicators
  -Q
    Quiet errors. Errors will not be printed. Does not hide progress indicators

$ cdupes -r /tmp
Files: 21
Same size: 13
Checksum: 13

/tmp/.fp1
/tmp/fp3

/tmp/foo/arse
/tmp/oh/file-OICAwX
/tmp/oh/file-WUJvo7

Duplicate search exited with error status: 1

ERRORS:
find: ‘/tmp/nonono’: Permission denied
find: ‘/tmp/root’: Permission denied
find: ‘/tmp/.cathedral’: Permission denied
sha1sum: /tmp/.startup.lock: Permission denied
sha1sum: /tmp/dabba: Permission denied
sha1sum: /tmp/fabba: Permission denied

$ sudo cdupes -rcSq /tmp
2ab06f95377aecc42e5a0e85573a3e7e3efa0961 157286400 /tmp/.fp1
2ab06f95377aecc42e5a0e85573a3e7e3efa0961 157286400 /tmp/fp3
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/.startup.lock
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/dabba
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/fabba
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/foo/arse
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/test/file-3muc5s
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/test/file-8ZjpE9
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/root/file-ox8qJb

If you’re interested in trying my script, or including it in your own set of SysOp tools,
you can –> find it here <–
Have fun!

Leave a Reply

Your email address will not be published. Required fields are marked *