war story: parallel(1) command

Christopher Browne cbbrowne-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Wed Jul 31 14:51:52 UTC 2013


On Wed, Jul 31, 2013 at 9:24 AM, D. Hugh Redelmeier <hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org> wrote:
> | From: James Knott <james.knott-bJEeYj9oJeDQT0dZR+AlfA at public.gmane.org>
>
> | You're forgetting one key point.  A hash is a limited number of bits,
> | which means you cannot have a unique hash for every possible file.
> | Collisions, while not likely, are not impossible.
>
> The hashes we're talking about (long cryptographic hashes) make
> accidental collisions practically impossible.  Git, for example,
> assumes that.

I hope that comes with an "expect, but verify."

If it's a hard dependency, and there's no test, then your repository
might get destroyed if a (highly improbable) collision did took place.

It's tempting to say "no need to bother, [heat death of universe]...",
but depending on how bad it is to have a collision, it may be somewhat
important to check.

For Git, a collision would have pretty perverse effects; it would mean
two changes seem like they're the same, and they'll both be treated as
the parents of successor patches, which would be Mighty Destructive to
the repository, as it makes it stop making sense.  (Particularly if
you've been pruning out disconnected patches, so that it's pretty
certain that those that remain will be parents of something.)

I have a "don't care" case, myself; I have a script that I use to
purge mail out of my MH instance.  It pulls all the messages that
appear deleted (e.g. - a message that gets a comma prepended to the
filename) as well as those that have gotten archived by the way I use
Maildir, and stows them all into an MH "Deleted" folder.

There are expected to be a great deal of duplicates, as, in order to
be careful not to lose things as they get refiled to apropos places, I
tend to keep copies around.

I have a step that deduplicates the messages, where I compare via MD5
checksums, and throw away the dupes before taking what's left over in
~/Mail/Deleted and stowing that into a compressed tarball on the
possibility of need for future reference.

It's *possible* that I could lose a few messages to collisions, but
it's certainly no disaster, as this was mail I was not really planning
to ever do anything with again.  So I accept here the possibility of
there being a few losses, don't care.

If I were using this to dedupe, say, my photograph collection, I
wouldn't consider the checksum to be enough, as I don't want to
Perhaps, Randomly lose a few pictures rather magically.

Mind you, if false duplicates seem to be nearly impossible, people
will be liable to have an excessive level of trust.  Until the
plane/train crashes, or some other such disaster, and they'll swing
back in the other direction...

I'm slightly surprised that SCMs aren't using UUIDs instead; they tend
to have more suitable uniqueness guarantees.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list