This storage startup dedupes what to do what? How?

Frumious Bandersnatch Silver badge

I don't know what you're trying to say, there, Vic. By your logic, git having less entropy in the source should imply less entropy in the output hashes, which would in turn imply more collisions.

Anyway, entropy is irrelevant since the point of a good (cryptographic) message digest is to decorrelate patterns in the input from those in the output hash. Entropy measures over the set of output hashes should be the same regardless of input (again, for a "good" hash algorithm).

I'm just making the point that while you can't get away from the pigeonhole principle, if you have vastly more holes than pigeons, you can put some sort of bound on what's an acceptable probability of collision and design your application with the assumption that this risk is "vanishingly small enough".

It's all a trade-off, like Trevor explained in the article and in his post above.

