Thursday, February 27, 2014

Out of the Labyrinth of Hadoop Release Versions

Hadoop's release versions can be very confusing for beginners.  In several occasions,  the Hadoop project broke from the release/branch convention, which is the source of most confusion. Renumbering of the version number just made things more complicated.
The 1.x release series is a continuation of the 0.20 release series. Even though 0.20 was a branch, major features were still developed on that branch instead of trunk. Two key features were added: append and security. Eventually 0.20.205 was renumbered as 1.0. There is next to no functional difference between 0.20.205 and 1.0. This is just a renumbering.

I took the below graph from to visualize this mess.

The 0.23 is the another branch from the main trunk, dated Dec 10, 2011.  This is the first major release within 18 months time frame that contain all major features committed Hadoop. The major new features include HDFS federation and MapReduce 2 also known as YARN (Yet Another Resource Negotiator). YARN is a general resource management system for running distributed applications. After that, 0.23 releases series have a few point releases just for bug fixes. The most recent one is 0.23.10 on Dec 11, 2013.

The 2.x release started with 2.0.0-alpha on May 23, 2012. The major differences between the 0.23.x and 2.x release series include but not limit to:
  1. Support for name node HA (adding in 2.0.0-alpha)
  2. Support for running hadoop on MS Windows (added in 2.1.0 beta)
The 2.2 release (Oct 15, 2013) is the GA and current stable release of 2.x series. Release 2.3.0 (Feb 20, 2014) is the most current release.

The above discussion is about the release versions in Hadoop core project. The Hadoop core include MapReduce and HDFS, but the Hadoop ecosystem is much bigger. It includes a family of ASF projects such as Hive, Pig, HBase, ZooKeeper, Sqoop, Oozie, Avro, etc。 Each project has its own release schedule and versions.  it is a big challenge for customers to integrate all these components and make sure that versions from different projects can work together seamlessly and flawlessly. The commercial vendors such as Cloudera, HortonWorks and MapR fill this gap and provide their own distributions to solve the interoperability and compatibility issues. They also provide performance, management, and usability enhancement. There is also Apache BigTop project, which is the open source distribution of Hadoop stack. BigTop is called as "the Fedora of Hadoop" while Cloudera (CDH) / Hortonworks (HDP) / MapR (M5/M5/M7) are seen as "the Redhat of Hadoop". The BigTop project is built from the same code base as its upstream projects. Compared to distributions from commercial vendors, BigTop tracks more aggressively on the new versions of Hadoop-family projects. On the other hand, the commercial vendor distributions focused more on stability and backward compatibility for their customers. Some projects such as Hama and Giraphe are only included in BigTop but not commercial distribution like CDH. 


jack wilson said...

I get a lot of great information from this blog. Thank you for your sharing this informative blog. Just now I have completed hadoop certification course at a leading academy. If you are interested to learn Hadoop Course in Chennai visit FITA IT training and placement academy.

Jhon Abraham said...

You have certainly explained that cluster is the process of examining big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions..The big data analytics is the major part to be understood regarding training program. Via your quality content i get to know about that in deep.Thanks for sharing this here.

Big Data Training