Some points regarding dedup
@David Halko, compression is not encryption. I would guess some crypto systems would not have the same output given the same input block (even for the UNIX passwd command, a salt is thrown in to make things more difficult for an attacker.)
Re: Nick Stallman etc. regarding Garbage Collection. Yes, it's a sort of garbage collection. That's the point -- if deleting some files doesn't reduce the reference count to zero, then there's no garbage to be collected and reclaimed as free space. The point isn't that these devices don't know how to free space, it's that sometimes people delete files and it turns out they're virtually all duplicates so very little space gets freed, or they forget about the reclamation step and wonder why space doesn't free immediately.
"the system must scan remaining data ... ? Why? If the reference count (to the file, block, or whatever you've deduped) is going from 1 to 0, you can delete the actual data, otherwise you need to keep it."
For newly written data, the system probably cannot dedupe it in real time, the write speeds would become far too low. So, when the reference count goes from 1 to 0, it could STILL be a duplicate for some of that new data.
"I'm similarly stumped by "reclamation is rarely run on a continuous basis on deduplication systems – instead, you either have to wait for the next scheduled process, or manually force it to start." What prevents reclamation happening as soon as the reference count hits zero?"
I'd guess the reference count list is too large to reasonably hold in RAM and process in anything like a timely manner, so it's run as a sort of batch process. Also, again, references that hit 0 may have to be compared with new data.
"The thing is, freeing up space (and chasing dead references?) based on occupied percentage seems wrong to me. It should be scheduled to low usage hours either dynamically or hard-coded to a given time (wee hours in the morning?), pretty much one more task to be added to a "defrag tool"."
It's possible it is. I get the impression with some of these dedup products, that they don't reclaim space on any sort of continuous basis, they will run a reclaim step, and I wouldn't be at all surrpised if on some of them it was like a cron job.
Anyway, in one sense this article states the obvious. But in another sense, it's easy to overlook the fact that in a dedupe system deleting a bunch of files may not free up a bunch of space. I think this was quite a good article.