Re: unique mathematical hash
> I must be missing something here. Surely a hash can only be unique if it's the same size as the block of data?
You're quite right that there are an infinite number of input documents, and only a finite number of hash values. What you are missing is probability.
With a well-designed hash function, the distribution of hashes over input documents is uniform and (apparently) random. The probability of finding two documents with the same hash is vanishingly small.
With a 256-bit hash, you would need to hash around 2^128 different documents before there is a reasonable chance that any two of your documents will have the same hash. (Google for "birthday paradox")
Even if you are hashing one trillion documents per second, it would be about 10^19 years before you had hashed 2^128 documents. Plus you need somewhere to store those 2^128 hashes to compare them.
So if you find two blocks in your storage array with the same hash, the probability is overwhelming that they *are* the same block. And if you are paranoid, you can always read them both and compare them before discarding the duplicate.
(With a broken hash function like MD5 or SHA1, it's possible for an evil user to intentionally submit two different blocks to your storage array which have the same hash)