war story: parallel(1) command

Sun Jul 28 02:02:50 UTC 2013

> Sidebar: to find files that are duplicates in a large collection, you
> could compare each file with each other (n * (n-1) comparisions).  Or you
> could sort a list of the files themselves (not a normal operation) and
> compare adjacent files (n log n comparisons).  But the easiest (heuristic)
> way is to hash each file and reduce the problem to looking for duplicate
> hashes.
>
> Most developed hash programs seem to be for cryptographic hashes.  They
> are designe (even if they sometimes fail) to work in the face of
> adversaries, something I didn't need.  Oh well.
>
> Because of the stringent requirements on cryptographic hashes, I've always
> thought of them as slow.  md5sum is apparently one of the faster ones that
> has a linux command.  Too bad it has been compromised for crypto work.
>

For your information,  git handle this the same way, all files are hashed
and only one copy is kept if any share a hash. They use SHA-1, apparently
because its more collision resistant without being too CPU intensive

> To hash all the files, I could do:
>         find . -type f -xargs0 | xargs -0 -r md5sum
>
> Tricky bit one: the -xars0 and -0 are used so that NUL is used as an arg
> delimiter by find and xargs so that no (other) characters in a filename
> cause problems (think whitespace).
>
> Tricky bit two: -r is used so that an empty list does not fire md5sum with
> no arguments.  I always try to use -r with xargs because its behaviour
> otherwise seem surprising and usually wrong.  In this case, it would not
> be a problem.
>
> Anyway, I thought that this operation would be CPU-bound (I thought
> md5sum computation would take more time than reading the files from
> the disk).  So I wanted a way to exploit the four CPU cores on my
> machine.  After much thinking and googling, I came upon the parallel
> command which does exactly what I thought I needed.  And it is a great
> substitute for xargs:
>         find . -type f -print0 | parallel -0 -r md5sum
>
> This would run an md5sum on each file, arranging to have 4 going at once
> (it figures out how many processors are on the machine).
>
> When I did this, I soon found that there was plenty of CPU left over,
> so apparently md5sum is disk-bound on my machine (Core Quad Duo 6600,
> 2.5" external drive connected via USB 3.0).
>
> This demonstrates again assumptions about performance characteristics
> are often wrong.
>
Interesting, I would have guessed its CPU bound too. Goes a long way to
show it don't help buying cutting edge CPU now unless its for energy
efficient.

What filesystem is on the USB drive? NTFS by any chance? I have found that
git seem more responsive on Linux than windows. Either the windows port
sucks or ntfs is just too slow compared to ext4.

> I was previously unaware of the Gnu parallel command.  Pretty neat.
>
Nice sharing this, first time I heard it. Appreciate you sharing your
discoveries a lot. Learned s couple of things from your ddrescue the other
day.
> I guess not many people use it, at least on Fedora.  Evidence:
> <https://bugzilla.redhat.com/show_bug.cgi?id=988987>
>
William
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gtalug.org/pipermail/legacy/attachments/20130727/dca1e4e6/attachment.html>