war story: parallel(1) command
Lennart Sorensen
lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Mon Jul 29 21:47:43 UTC 2013
On Fri, Jul 26, 2013 at 11:03:39PM -0400, D. Hugh Redelmeier wrote:
> I wanted to hash all the files in a 700M filesystem (to efficiently find
> differences and similarities).
>
> Sidebar: to find files that are duplicates in a large collection, you
> could compare each file with each other (n * (n-1) comparisions). Or you
> could sort a list of the files themselves (not a normal operation) and
> compare adjacent files (n log n comparisons). But the easiest (heuristic)
> way is to hash each file and reduce the problem to looking for duplicate
> hashes.
>
> Most developed hash programs seem to be for cryptographic hashes. They
> are designe (even if they sometimes fail) to work in the face of
> adversaries, something I didn't need. Oh well.
>
> Because of the stringent requirements on cryptographic hashes, I've always
> thought of them as slow. md5sum is apparently one of the faster ones that
> has a linux command. Too bad it has been compromised for crypto work.
>
> To hash all the files, I could do:
> find . -type f -xargs0 | xargs -0 -r md5sum
>
> Tricky bit one: the -xars0 and -0 are used so that NUL is used as an arg
> delimiter by find and xargs so that no (other) characters in a filename
> cause problems (think whitespace).
>
> Tricky bit two: -r is used so that an empty list does not fire md5sum with
> no arguments. I always try to use -r with xargs because its behaviour
> otherwise seem surprising and usually wrong. In this case, it would not
> be a problem.
>
> Anyway, I thought that this operation would be CPU-bound (I thought
> md5sum computation would take more time than reading the files from
> the disk). So I wanted a way to exploit the four CPU cores on my
> machine. After much thinking and googling, I came upon the parallel
> command which does exactly what I thought I needed. And it is a great
> substitute for xargs:
> find . -type f -print0 | parallel -0 -r md5sum
>
> This would run an md5sum on each file, arranging to have 4 going at once
> (it figures out how many processors are on the machine).
>
> When I did this, I soon found that there was plenty of CPU left over,
> so apparently md5sum is disk-bound on my machine (Core Quad Duo 6600,
> 2.5" external drive connected via USB 3.0).
So not using parallel would probably be faster, since the disk could
avoid the head having to jump around between files.
> This demonstrates again assumptions about performance characteristics
> are often wrong.
>
> I was previously unaware of the Gnu parallel command. Pretty neat.
>
> I guess not many people use it, at least on Fedora. Evidence:
> <https://bugzilla.redhat.com/show_bug.cgi?id=988987>
--
Len Sorensen
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list