Dedupe isn't always the same...
It’s good to see Microsoft raising the visibility of the data deduplication opportunity. As your article indicates, there are some key issues that one need be concerned about when considering deduplication. As I see it, they can be grouped into three areas: performance, scalability and efficiency. Performance is critical because it can limit dedupe’s use cases. Fortunately, there are faster processors constantly coming to market that also have multi-core capabilities which are addressing the issue. There have been recent indexing and memory management advances that provide additional incremental scalability, which addresses my second point. We are seeing rampant data growth, so the more dedupe can scale the better it will be able to keep up with the growth. Efficiency is also important, and is an often overlooked characteristic. Deduplication solutions that only work with large granularity (ex. 128K) are markedly less effective at saving space than those that can efficiently handle small chunk sizes. Furthermore, the amount of RAM required to perform efficient deduplication can make the process extremely inefficient, particularly if a GB (or more) of RAM is required to deduplicate a TB of storage. Remember, in 2011 both components cost about the same!
There are many flavors of dedupe available in today’s marketplace. As we’ve often seen in the past, the open source community steps up to the plate when there’s a real need that is not being filled, so it should be no surprise to see developers building their own solutions. Dedupe solutions are differentiated based on how they address performance, scalability and efficiency requirements. When your readers can find all of these requirements addressed in a single offering, they are looking at a leading dedupe technology.