Reply to post:

This storage startup dedupes what to do what? How?

Vic

I think that a well-designed hash won't have significantly different collision rates over the two kinds of data for a given capacity range (ie, the number of messages to be stored).

But we already know that the hashes used in git were not specifically designed for text data; thus we can expect that the data blocks that will cause a hash collision with our desired block will be spread randomly throughout the possible input space. If we now dramatically subset that input space (by requiring our input to be text data), we have ipso facto discarded all those possible collisions form data blocks that were not text data - thus we have a much reduced number of possible collisions, and thus a much reduced probability of collision.

Within the filesystem view, we cannot do that input selection; any possible block variant is permitted. Thus we cannot achieve that probability reduction, and the process is much less safe.

Mathematically speaking, you're absolutely right.

I know.

All I'm saying is that in practice assuming that the same hash implies the same contents can be a reasonable engineering assumption

And I'm saying that is not a reasonable assumption in the general case. I have outlined my reasoning above.

you can plug the numbers into some formulas to find the expected collision rate and choose your digest size to make the risk "low enough" that it's not worth worrying about

For the general case without subsequent block-checking? The hash is required to be of at least the same size as the block. And that is not exceptionally useful...

Vic.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

SUBSCRIBE TO OUR WEEKLY TECH NEWSLETTER

Biting the hand that feeds IT © 1998–2020