Someone recently asked, in #bash on Freenode, how to find duplicate files with Bash. Several options were suggested, and the user ended up installing and running “fdupes“. However, this sort of thing should be reasonably easy to do using “find” and a few pipes.
As a quick overview, what you want to achieve is to find all the files, list their sizes and names followed by a NUL ($’\0′) separator to allow for wonky filenames. You start with this, as sizes are the quickest values to compare. You then get all the duplicates that have at least one other file with the same size, and discard the rest of them. Then we want to run some sort of a checksum tool. I chose sha1sum. Again we discard the files with no duplicates. And that’s really all there is to it.
Feeling bored one night, I decided to try this. As usual, the whole thing grew entirely out of proportion, and I ended up duplicating most of fdupes’ functionality in Bash. Still, the resulting script is surprisingly fast, looks nice, has a decent help menu and doesn’t rely on anything not found on virtually every GNU/Linux machine out there. It also supports a few things fdupes does not, such as (as of 2017-07-01) null terminated output. Thus it found its way into my toolbox.
Let me present “cdupes”:
$ cdupes cdupes: no directories specified Description: Bash script that's functionally similar to the tool "fdupes" Usage: cdupes [options] Options: -0 Null termination. Allows piping filenames with weird characters into other tools -5 md5sum. Uses md5sum instead of the default sha1sum -A NoHidden. Excludes files which names start with --> . <-- -c Checksum. Show checksum of duplicate files -f Omit first match from each group. Useful with -m for cleanup scripts -m Machine readable. No empty lines between groups Probably only useful in conjunction with -c or -f -n NoEmpty. Ignore empty files (size 0) -p Permissions. Don't consider files with different owner/group or permission bits as duplicates -r Recurse. For every directory given, follow subdirectories encountered within. -S Size. Show size of duplicate files -q Quiet. Hides progress indicators -Q Quiet errors. Errors will not be printed. Does not hide progress indicators $ cdupes -r /tmp Files: 21 Same size: 13 Checksum: 13 /tmp/.fp1 /tmp/fp3 /tmp/foo/arse /tmp/oh/file-OICAwX /tmp/oh/file-WUJvo7 Duplicate search exited with error status: 1 ERRORS: find: ‘/tmp/nonono’: Permission denied find: ‘/tmp/root’: Permission denied find: ‘/tmp/.cathedral’: Permission denied sha1sum: /tmp/.startup.lock: Permission denied sha1sum: /tmp/dabba: Permission denied sha1sum: /tmp/fabba: Permission denied $ sudo cdupes -rcSq /tmp 2ab06f95377aecc42e5a0e85573a3e7e3efa0961 157286400 /tmp/.fp1 2ab06f95377aecc42e5a0e85573a3e7e3efa0961 157286400 /tmp/fp3 da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/.startup.lock da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/dabba da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/fabba da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/foo/arse da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/test/file-3muc5s da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/test/file-8ZjpE9 da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/root/file-ox8qJb
If you’re interested in trying my script, or including it in your own set of SysOp tools,
you can –> find it here <–