Reply to post:

This storage startup dedupes what to do what? How?

Frumious Bandersnatch Silver badge

OK, this is my final post on the question, Vic. I promise.

I understand better now the point you're making about "entropy". You're saying that, eg, if I hashed all the natural numbers, represented in ascii, in a range (let's say they're zero-padded up to some minimum block length to make a fair comparison) that there will be fewer collisions than if the blocks all contained random 8-bit data. I think that a well-designed hash won't have significantly different collision rates over the two kinds of data for a given capacity range (ie, the number of messages to be stored).

And I'm, saying that, in the general case, [that the risk is "vanishingly small enough"] is simply not true

Mathematically speaking, you're absolutely right. Any hash can and will have collisions as proved by the counting argument/pigeonhole principle (or birthday paradox). All I'm saying is that in practice assuming that the same hash implies the same contents can be a reasonable engineering assumption. As evidenced by how git does it.

The assumptions that git makes on the actual hash size (160 bits) and expected number of commits (and, if I were to accept it, a factor to account for "entropy") aren't going to hold for a massive block-based de-dupe system, but you can plug the numbers into some formulas to find the expected collision rate and choose your digest size to make the risk "low enough" that it's not worth worrying about (eg, a mean time to collision of 50 years, if that's what you want).

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon


Biting the hand that feeds IT © 1998–2020