Weblog entry #8 for ajt

Finding duplicate files and Skein
Posted by ajt on Thu 18 Jun 2009 at 21:55
Tags:

Some time ago I cobbled together a small application to find duplicate files. It uses hashing functions to help it decided if identical sized files may be duplicates. The current iteration uses an implementation of the NIST SHA1 algorithm. Depending on the number and size of the files to hash it can get quite slow. I picked SHA1 rather than MD5 which other duplicate finders tend to use, because it's supposed to be less susceptible to collisions and it's only marginally slower.

It would be nice to allow the user pick the algorithm to use, and let Perl dynamically load the required module on demand. Previously the Digest module would pick a much slower implementations of the SHA1 algorithm by default so I've not added the feature.

Recently I spotted on Bruce Schneier's blog that he has contributed to a new algorithm for the NIST competition to create SHA3 to replace the now ageing SHA1 and SHA2 algorithms. It's called Skein and there is already a Perl module of it. It's very fancy and apparently quite secure, but from my perspective it's blazingly fast on a 64-bit system.

Digest has now been fixed (at my request - thanks to Gisle) so I may start the process of migrating my code to a generic any module interface.