war story: parallel(1) command

Mon Jul 29 21:47:43 UTC 2013

On Fri, Jul 26, 2013 at 11:03:39PM -0400, D. Hugh Redelmeier wrote:
> I wanted to hash all the files in a 700M filesystem (to efficiently find 
> differences and similarities).
> 
> Sidebar: to find files that are duplicates in a large collection, you 
> could compare each file with each other (n * (n-1) comparisions).  Or you 
> could sort a list of the files themselves (not a normal operation) and 
> compare adjacent files (n log n comparisons).  But the easiest (heuristic) 
> way is to hash each file and reduce the problem to looking for duplicate 
> hashes.
> 
> Most developed hash programs seem to be for cryptographic hashes.  They 
> are designe (even if they sometimes fail) to work in the face of 
> adversaries, something I didn't need.  Oh well.
> 
> Because of the stringent requirements on cryptographic hashes, I've always 
> thought of them as slow.  md5sum is apparently one of the faster ones that 
> has a linux command.  Too bad it has been compromised for crypto work.
> 
> To hash all the files, I could do:
> 	find . -type f -xargs0 | xargs -0 -r md5sum
> 
> Tricky bit one: the -xars0 and -0 are used so that NUL is used as an arg 
> delimiter by find and xargs so that no (other) characters in a filename 
> cause problems (think whitespace).
> 
> Tricky bit two: -r is used so that an empty list does not fire md5sum with 
> no arguments.  I always try to use -r with xargs because its behaviour 
> otherwise seem surprising and usually wrong.  In this case, it would not 
> be a problem.
> 
> Anyway, I thought that this operation would be CPU-bound (I thought
> md5sum computation would take more time than reading the files from
> the disk).  So I wanted a way to exploit the four CPU cores on my
> machine.  After much thinking and googling, I came upon the parallel
> command which does exactly what I thought I needed.  And it is a great
> substitute for xargs:
> 	find . -type f -print0 | parallel -0 -r md5sum
> 
> This would run an md5sum on each file, arranging to have 4 going at once 
> (it figures out how many processors are on the machine).
> 
> When I did this, I soon found that there was plenty of CPU left over,
> so apparently md5sum is disk-bound on my machine (Core Quad Duo 6600,
> 2.5" external drive connected via USB 3.0).

So not using parallel would probably be faster, since the disk could
avoid the head having to jump around between files.

> This demonstrates again assumptions about performance characteristics
> are often wrong.
> 
> I was previously unaware of the Gnu parallel command.  Pretty neat.
> 
> I guess not many people use it, at least on Fedora.  Evidence:
> <https://bugzilla.redhat.com/show_bug.cgi?id=988987>

-- 
Len Sorensen
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists