understanding probability[was Re: war story: parallel(1) command]

D. Hugh Redelmeier hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org
Sat Aug 3 17:03:11 UTC 2013


| From: Jamon Camisso <jamon.camisso-H217xnMUJC0sA/PxXw9srA at public.gmane.org>

| Absolutely. We invented digital computers to externalize computational
| work for us as a society because it is faster and more efficient than
| doing it by hand or with a slide rule.

We invented the computer to do calculations that were not feasible
otherwise.

I think that it is more of a surprise that we now make them so
inexpensively that they can take on computations just for our
convenience or entertainment.

The first computers cost much more than houses.  Now the cheapest ones
are throw-away.

| As such, we ought to use computers to do work, like checking the results
| of algorithms that are good enough for most intents and purposes, but
| not all, even accounting for an outlier in a 2^128 space.

That logic is a bit iffy.

The phrase "Good enough", in our world, often applies to 90%
solutions.  The chance of two distinct files having the same MD5 sum
in a pool of n files is
	n * (n-1) / 2 ^ 128
(assuming MD5 is "good").
I was testing 800 files.
    $ bc -l
    scale=50
    800 * (800 -1) / 2 ^ 128
    .00000000000000000000000000000000187843997261401543

That's qualitatively different from 10% (100%-90%).

Do you know how your crypto works?  It uses industrial-grade primes:
those are only very very probably primes.
  <https://en.wikipedia.org/wiki/Industrial-grade_prime>

In these probabilistic things, you get to turn a knob: the more work you 
do, the more probably correct the result is.  That's actually a great 
property.  But you can never get to 100% and there is rarely a need to.  
All physical world things are uncertain to a much greater extent.

| For example, a forensic analysis of a filesystem would have to show with
| absolute certainty that any hash collision, however remotely improbable
| in any infinite number of universes' lifetimes, has been documented,
| files compared and accounted for.

If you are going in front of a jury, probabilities are not going to be
easy to deal with.  Yet that is just what DNA evidence involves.

But I'm not going to a jury trial over catching duplicate files on my
disk drive.

When it comes down to it, it's a matter of intuition and abstraction.
It is a lot harder to keep uncertainty in ones mind.  I would claim
that .00000000000000000000000000000000187843997261401543 isn't
uncertain in a practical sense.  So I'm willing to classify that as
impossible in my mind.

The probablility of undetected read errors on a disk drive is much
much higher and hence eclipses this one.
<https://www.perform.csl.illinois.edu/Papers/USAN_papers/09ROZ01.pdf>

| Close enough is just not good enough, and as was originally pointed out,
| file I/O is the bottleneck, not CPU. So once a file is read, might as
| well account for even the remote possibilities with a few extra CPU loops.

Quibble: the work may be significantly greater.  For example, if every
file is duplicated once (a common case, not a worst case), you are
doubling the file I/O, which we agree is the bottleneck in my case.
Furthermore, the file I/O will no longer be purely sequential, so disk
seeking might become significant in this step.  And there is more
coding required.

I will admit sympathy for your position.  I'm not really comfortable
with floating point arithmetic for a similar reason.  A large part of
the field of numerical analysis is about figuring out how error bounds
accumulate.  Much calculation depends on the fact that the actual
errors are not nearly as bad as the worst-case.
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list