war story: parallel(1) command

Mauro Souza thoriumbr-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Sun Jul 28 01:43:36 UTC 2013


I would say you should use crc32 instead of md5sum. But before saying that,
I made a simple test, hashing a 64MB video file:

time md5sum Paperman\ -\ Full\ Animated\ Short\ Film.mp4
1e6c3eaa0a7d62e162d8754f62935400  Paperman - Full Animated Short Film.mp4
real    0m0.150s
user    0m0.140s
sys    0m0.008s


time crc32 Paperman\ -\ Full\ Animated\ Short\ Film.mp4
39c84f09
real    0m0.160s
user    0m0.140s
sys    0m0.020s

time sha256sum Paperman\ -\ Full\ Animated\ Short\ Film.mp4
ae7130d77fc0d364637c6d512b90996902a3bd57072070f8ed465e79067af396  Paperman
- Full Animated Short Film.mp4
real    0m0.420s
user    0m0.408s
sys    0m0.008s

I ran it some times to keep the files on cache. The first crc32 check was 4
times slower than md5. The second and all others got almost the same time.
I ran md5sum a couple times too, with similar results.

And once more it demonstrates that performance assumptions are wrong
sometimes.

My explanation: reading the data is way more slow than hashing on md5 or
crc32. But sha256 was indeed slower.

I never heard of parallel before, may try it on my next scripts...


Mauro
http://mauro.limeiratem.com - registered Linux User: 294521
Scripture is both history, and a love letter from God.


2013/7/27 Christopher Browne <cbbrowne-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>

> I had a similarly pleasing experience with parallel a few months ago.
>
> Wanted to spread out database load for an import process.
>
> I:
>
> - Used split to break big files into 1000 tuple chunks
>
> - This gave me literally thousands of data files where there was filtering
> in addition to loading into DB.  (I'm renumbering various stuff, nicely
> handled via a small C program or two.)
>
> By using parallel, I could set up a series of concurrent processing
> streams covering both filtering and ultimately loading into DB.
>
> By having parallel restrict things to ~10 work processes, this could
> harness parallelism, as the servers do have multiple physical disks and
> CPUs.
>
> The restriction to 10 concurrent jobs kept it from bogging down.  It's
> obviously stupid to try to have 100 or 1000 processes fighting over CPUs.
>
> Prll looked like it might be easier to get installed on systems that might
> lack C compilers, but seemed a little more fragile otherwise, though that's
> a woefully vague impression.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gtalug.org/pipermail/legacy/attachments/20130727/2eec48e9/attachment.html>


More information about the Legacy mailing list