I don't think the author of this article should be allowed to write about Apache Hadoop -it's painful to read. I hope nobody actually believes a word this person says.
1. The only official release of Apache Hadoop comes from the Apache Software Foundation, the last of the 0.20 releases, 0.20.203 came out yesterday with lots of bug fixes from Yahoo! and Cloudera in it.
2. Any other so called "distribution" of Hadoop is not "a distribution" unless it is just the Apache release packaged for easy installation (as Thomas Koch does for debian) -it is a derivative work, containing code that is not in the Apache release.
3. Such derivative works can be open source (Cloudera) or closed source (EMC, IBM).
4. Any closed source derivative work forces the distributor to maintain their branch indefinitely.
5. Any derivative work forces the developer to test at the same scale as Y! and Facebook (thousands of machines, tens of PB of storage), or they cannot claim that it scales up.
6. Any closed source derivative work will only support bug fixes and patches at a rate determined by the closed source developer team, and provided at a cost determined by the price of that developer team.
7. Apache only provide support for the official apache release. If you use Cloudera or EMC: go talk to them about problems.
8. People who are not part of the Apache developer and user community do not get their needs addressed in the Apache releases, because we are unaware of them.
9. We, the apache developer team, have no need to take on random patches from developers of closed source derivative works unless we can see tangible benefits.
10. Finally, any derivative work that pulls out large amounts of the Hadoop codebase (e.g Brisk, EMC Enteprise HD) cannot call themselves a version of Hadoop. They are not. We, the apache community define the interfaces and what "100% compatible" means. When someone like EMC declare their derivative work is "certified 100% compatible", that is a meaningless statement. Only the official Apache Hadoop release is, implicitly 100% compatible with Apache Hadoop.
11. We reserve the right to change the semantics and interfaces to meet the community needs, on the schedule that suits the development community.
12. The rules of using the term "Hadoop" are defined in the Apache license, and it is not legal to say "a distribution of Hadoop" if it is in fact a derivative work. This is why Cloudera call their software "Cloudera’s Distribution including Apache Hadoop". EMC, Brisk and others are sailing close to the wind here.
13. The fact that Oracle are now subpoenaing Apache in the Oracle/Google lawsuit mean that the relationship between Oracle and Apache have reached a low point -even after Apache left the Java Community Program due to Oracle's unwillingness to meet its legal requirements to provide the Testing Compatibility Kit without imposing Field of Use restrictions.
14. Because of (#13), it's hard to see a team of Oracle developers being trusted or welcome in the Hadoop community. You can't serve subpoenas on the ASF and then say "we'd like to help develop a technology of yours that threatens our entire business model and margins". They won't be trusted.
I have a term for the EMC-style not-quite-Hadoop products that use the same interfaces but offer unknown semantics and a cost model on a par with the vendor's existing enterprise product line. It is "Enterprisey Hadoop". This is not Apache Hadoop supported in the Enterprise, it is some derivative work that pretends to be Hadoop but misses the point about affordable scalability through commodity hardware and an open source codebase.
SteveL, Apache Hadoop Committer. All comments are personal opinions only, etc.