Reply to post:

This storage startup dedupes what to do what? How?

Frumious Bandersnatch Silver badge

OK, so I'm going back on my promise to not write any more, but ...

1. Does git's normal input type lead to fewer collisions?

Your line of reasoning about the structure of git's input leading to fewer collisions is pure conjecture. There is no "ipso facto" about your conclusions. You say that subspacing the input space discards potential collisions, but you neglect to consider that it also removes many non-colliding inputs, too. For your argument to work you'd have to explain why subspacing preferentially removes more colliding inputs, proportionately speaking.

In fact, the input to hash algorithms are (in the class of) infinite strings, so the normal rules of logic when dealing with finite numbers simply don't apply. Half of an infinite space is still an infinite space. In those terms, it's hard to see how your counting argument can hold any water on this point.

2. Is assuming hashes are collision-free a "reasonable" assumption in a block-centric app?

I believe that it is, but you have to plug the block size and digest size into the formulas for the birthday paradox so that you get an acceptable collision rate for a given capacity/occupancy (ie, how many blocks you intend to store).

A simple analysis should be able to convince you that doubling the number of possible hash buckets (ie, adding one more bit to the digest) will more than halve the collision rate for practical occupancy levels (which is a minuscule fraction of the hash space). This takes it out of the realm of being a question of pure mathematics and turns it into an applied maths/engineering question. And that's why I'm saying that it's an entirely reasonable assumption to make if you pick the right hash size for the problem at hand.

The actual formula to determine the probability that there is no collision, with a hash of h bits and an occupancy level o is:

(2 h Po) / 2 ho.

where xPy is the Permutation function.

I submit, without proof, that as h increases (with o fixed), the P term will increase faster than twice the rate of increase in the other term, assuming that o is sufficiently small. (in fact, it should increase much faster). Thus we can make the collision probability (1 minus the above) arbitrarily small by making a linear increase in the number of hash bits.

Basing the assumption that a well-chosen digest algorithm will be (practically speaking) collision-free on the above is, I am sure, completely justified.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

SUBSCRIBE TO OUR WEEKLY TECH NEWSLETTER

Biting the hand that feeds IT © 1998–2020