war story: parallel(1) command

D. Hugh Redelmeier hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org
Sat Jul 27 03:03:39 UTC 2013


I wanted to hash all the files in a 700M filesystem (to efficiently find 
differences and similarities).

Sidebar: to find files that are duplicates in a large collection, you 
could compare each file with each other (n * (n-1) comparisions).  Or you 
could sort a list of the files themselves (not a normal operation) and 
compare adjacent files (n log n comparisons).  But the easiest (heuristic) 
way is to hash each file and reduce the problem to looking for duplicate 
hashes.

Most developed hash programs seem to be for cryptographic hashes.  They 
are designe (even if they sometimes fail) to work in the face of 
adversaries, something I didn't need.  Oh well.

Because of the stringent requirements on cryptographic hashes, I've always 
thought of them as slow.  md5sum is apparently one of the faster ones that 
has a linux command.  Too bad it has been compromised for crypto work.

To hash all the files, I could do:
	find . -type f -xargs0 | xargs -0 -r md5sum

Tricky bit one: the -xars0 and -0 are used so that NUL is used as an arg 
delimiter by find and xargs so that no (other) characters in a filename 
cause problems (think whitespace).

Tricky bit two: -r is used so that an empty list does not fire md5sum with 
no arguments.  I always try to use -r with xargs because its behaviour 
otherwise seem surprising and usually wrong.  In this case, it would not 
be a problem.

Anyway, I thought that this operation would be CPU-bound (I thought
md5sum computation would take more time than reading the files from
the disk).  So I wanted a way to exploit the four CPU cores on my
machine.  After much thinking and googling, I came upon the parallel
command which does exactly what I thought I needed.  And it is a great
substitute for xargs:
	find . -type f -print0 | parallel -0 -r md5sum

This would run an md5sum on each file, arranging to have 4 going at once 
(it figures out how many processors are on the machine).

When I did this, I soon found that there was plenty of CPU left over,
so apparently md5sum is disk-bound on my machine (Core Quad Duo 6600,
2.5" external drive connected via USB 3.0).

This demonstrates again assumptions about performance characteristics
are often wrong.

I was previously unaware of the Gnu parallel command.  Pretty neat.

I guess not many people use it, at least on Fedora.  Evidence:
<https://bugzilla.redhat.com/show_bug.cgi?id=988987>
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list