Reply to post: Re: For example, if a developer defined MD5 as a hash ...

'Jarvis' brings AI to the Linux command line, without Iron Man

Frumious Bandersnatch

Re: For example, if a developer defined MD5 as a hash ...

For one thing, if it's a de-dupe problem, then it's much more efficient to use a hash of a file than to do a pairwise comparison of all files that could have the same contents. The problem of finding duplicates would be pretty intractable otherwise. Secondly, since the total number of duplicates will most likely be very small (compared to the full population) and the de-dupe step needs to be done only once. I can put up with a bit of extra overhead if it increases safety and finds me extra disk space.

As it happens, I actually use SHA-256 (using a tool similar to shatag in Debian), but notwithstanding that, I don't think that there's a problem using MD5 as a kind of heuristic to find identical files, so long as you have a second line of validation after it. In fact, you could use one or more different hashing functions as part of the validation step here, before you delete and create hard-linked copies...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2022