war story: parallel(1) command

Wed Jul 31 15:20:11 UTC 2013

After reading this interesting discussion, I guess I would use the 
following algorithm:

Create a table with filename, its size, and MD5. Just that only.

If MD5 is the same, when checking, then check also a stronger hash on 
file, its size and file name.

I am right? Or perhaps I missed something?

zb.

On 31/07/13 18:51, Christopher Browne wrote:
> On Wed, Jul 31, 2013 at 9:24 AM, D. Hugh Redelmeier <hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org> wrote:
>> | From: James Knott <james.knott-bJEeYj9oJeDQT0dZR+AlfA at public.gmane.org>
>>
>> | You're forgetting one key point.  A hash is a limited number of bits,
>> | which means you cannot have a unique hash for every possible file.
>> | Collisions, while not likely, are not impossible.
>>
>> The hashes we're talking about (long cryptographic hashes) make
>> accidental collisions practically impossible.  Git, for example,
>> assumes that.
> I hope that comes with an "expect, but verify."
>
> If it's a hard dependency, and there's no test, then your repository
> might get destroyed if a (highly improbable) collision did took place.
>
> It's tempting to say "no need to bother, [heat death of universe]...",
> but depending on how bad it is to have a collision, it may be somewhat
> important to check.
>
> For Git, a collision would have pretty perverse effects; it would mean
> two changes seem like they're the same, and they'll both be treated as
> the parents of successor patches, which would be Mighty Destructive to
> the repository, as it makes it stop making sense.  (Particularly if
> you've been pruning out disconnected patches, so that it's pretty
> certain that those that remain will be parents of something.)
>
> I have a "don't care" case, myself; I have a script that I use to
> purge mail out of my MH instance.  It pulls all the messages that
> appear deleted (e.g. - a message that gets a comma prepended to the
> filename) as well as those that have gotten archived by the way I use
> Maildir, and stows them all into an MH "Deleted" folder.
>
> There are expected to be a great deal of duplicates, as, in order to
> be careful not to lose things as they get refiled to apropos places, I
> tend to keep copies around.
>
> I have a step that deduplicates the messages, where I compare via MD5
> checksums, and throw away the dupes before taking what's left over in
> ~/Mail/Deleted and stowing that into a compressed tarball on the
> possibility of need for future reference.
>
> It's *possible* that I could lose a few messages to collisions, but
> it's certainly no disaster, as this was mail I was not really planning
> to ever do anything with again.  So I accept here the possibility of
> there being a few losses, don't care.
>
> If I were using this to dedupe, say, my photograph collection, I
> wouldn't consider the checksum to be enough, as I don't want to
> Perhaps, Randomly lose a few pictures rather magically.
>
> Mind you, if false duplicates seem to be nearly impossible, people
> will be liable to have an excessive level of trust.  Until the
> plane/train crashes, or some other such disaster, and they'll swing
> back in the other direction...
>
> I'm slightly surprised that SCMs aren't using UUIDs instead; they tend
> to have more suitable uniqueness guarantees.

--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists