Thursday, February 27, 2014

Out of the Labyrinth of Hadoop Release Versions

Hadoop's release versions can be very confusing for beginners.  In several occasions,  the Hadoop project broke from the release/branch convention, which is the source of most confusion. Renumbering of the version number just made things more complicated.
The 1.x release series is a continuation of the 0.20 release series. Even though 0.20 was a branch, major features were still developed on that branch instead of trunk. Two key features were added: append and security. Eventually 0.20.205 was renumbered as 1.0. There is next to no functional difference between 0.20.205 and 1.0. This is just a renumbering.

I took the below graph from to visualize this mess.

The 0.23 is the another branch from the main trunk, dated Dec 10, 2011.  This is the first major release within 18 months time frame that contain all major features committed Hadoop. The major new features include HDFS federation and MapReduce 2 also known as YARN (Yet Another Resource Negotiator). YARN is a general resource management system for running distributed applications. After that, 0.23 releases series have a few point releases just for bug fixes. The most recent one is 0.23.10 on Dec 11, 2013.

The 2.x release started with 2.0.0-alpha on May 23, 2012. The major differences between the 0.23.x and 2.x release series include but not limit to:
  1. Support for name node HA (adding in 2.0.0-alpha)
  2. Support for running hadoop on MS Windows (added in 2.1.0 beta)
The 2.2 release (Oct 15, 2013) is the GA and current stable release of 2.x series. Release 2.3.0 (Feb 20, 2014) is the most current release.

The above discussion is about the release versions in Hadoop core project. The Hadoop core include MapReduce and HDFS, but the Hadoop ecosystem is much bigger. It includes a family of ASF projects such as Hive, Pig, HBase, ZooKeeper, Sqoop, Oozie, Avro, etc。 Each project has its own release schedule and versions.  it is a big challenge for customers to integrate all these components and make sure that versions from different projects can work together seamlessly and flawlessly. The commercial vendors such as Cloudera, HortonWorks and MapR fill this gap and provide their own distributions to solve the interoperability and compatibility issues. They also provide performance, management, and usability enhancement. There is also Apache BigTop project, which is the open source distribution of Hadoop stack. BigTop is called as "the Fedora of Hadoop" while Cloudera (CDH) / Hortonworks (HDP) / MapR (M5/M5/M7) are seen as "the Redhat of Hadoop". The BigTop project is built from the same code base as its upstream projects. Compared to distributions from commercial vendors, BigTop tracks more aggressively on the new versions of Hadoop-family projects. On the other hand, the commercial vendor distributions focused more on stability and backward compatibility for their customers. Some projects such as Hama and Giraphe are only included in BigTop but not commercial distribution like CDH. 

Wednesday, February 19, 2014

Déjà Vu: TokuMX is to MongoDB what InnoDB is to MySQL

While I was writing about MongoDB lacks the granularity of document-level locking or even collection-level locking, I came across TokuMX. It is touted as an open source, high-performance distribution of MongoDB with 20x performance improvement. On top of the that, it almost corrects all major problems developer community complains about MongoDB.

Besides, the DB level locking, other major complaints about MongoDB are:
  1. Poor space management with fragmented data. You can run "compact" command to rewrite and defragment all data in a collection, as well as all of the indexes on that collection. However, it is a slow process and will introduce significant down time, because the operation blocks all activities to the same database. Prior to MongoDB 2.2, it was even worse -- all activities on the mongod were blocked. 
  2. Difficult to keep working set in the memory. If the search has to hit the disk, the performance degrades dramatically. Usually the idea is to buy more memory and/or use SSD, but that's just throwing money at the problem. Besides, because of the poor space management, the cost of SSD is going to be higher. 
The root cause of the above 2 problems was that the poor underlying storage layer implementation. The storage layer is linked-list BSON documents stored in data files and all data files are memory mapped to virtual memory by OS. MongoDB just reads/writes to RAM and let the OS take care of the rest. It relies on the kernel page caching algorithm like LRU, thus it is possible to evict hot data out of memory for less frequently-accessed data. Also LRU algorithm does to prioritize things like index. After all, OS does not know the underlying storage is for a database. In one of the presentation in MongoDB London 2013, "Understanding MongoDB storage for Performance and Data Safety", one of the pros that was touted was "No complex memory/disk code in MongoDB, huge win!". As a matter of fact, it is a huge win for MongoDB engineers who do not want to implement a state of art storage layer themselves. It is a huge loss to MongoDB customers.

That brings back the memory of MySQL. MySQL originally used MyISAM as its main storage implementation. It does not have transactional support. It maintains table level locking,. It is prone to data corruption and takes long repair time after crash. MySQL 5.5 started to use InnoDB as the default storage engine, which has transactional support, row level locking and faster performance. Only then MySQL became a much desired choice in many enterprises.

TokuMx is to MongoDB what InnoDB is to MySQL. It keeps the most appealing feature of MongoDB such as dynamic schema, JSON-style document format, rich document based query, sharding support, replication, etc. It also supports all the MongoDB drivers the same way thus it is a drop-in replacement for MongoDB to your application.  The key benefit of TokuMx is that it replace the underlying storage engine for MongoDB. It uses Fractal Tree indexes instead of B-tree index, thus delivering better and consistent performance when data size outgrows memory. It provides ACID transaction support. It has MVCC support and allows more concurrency read/write to the database with document-level locking. Compression of all data is by default thus save storage space without downtime (up to 90% space save). There is no data fragmentation and there is no need for "compact" command. The current 1.4 release is fully compatible with MongoDB 2.4, except no text indexes and search and no Geospatial index and query support.

It would be interesting to see how MongoDB and TokuMx relationship play out in the future. 

Monday, February 17, 2014

MongoDB DB level locking

MongoDB database-level locking probably is the most cited drawback in MongoDB.  Before version 2.2, there is only one “global” lock per mongod instance. For example, you have 5 databases on one mongod instance and one takes a write lock, the other 4 are not available for read or write. Anyone coming from RDBMS background is going to be shocked by such an awful design, when all RDBMS provide row level lock granularity.

Beginning with version 2.2, MongoDB implements locks on a per-database basis for most read and write operations. Of course it is still a step in the right direction. However, it is still desired to implement write lock in a more granular level such as collection level or document level. There is a a good explanation on stackoverflow about why db level locking is NOT a big a performance concern. There are many valid points in the post and I recommend people to read it. But the author did put some spin on the issue --- it is understandable since he works for MongoDB. You can see sentence like "with a properly designed schema and a typical workload, ...". Also the author stressed that MongoDB locking works better than RDBMS because it uses lightweight latch. As a matter of fact, RDBMS like Oracle has used latch long time ago. The author did admit that "This does mean that the writes to all documents within a single database get serialized.".

Take a look at the issue at There are 346 people to vote for this feature right now. It was created back in 2010 and it is still a problem today. It was originally said that the issue would be taken care in version 2.4, but it was not even planned for 2.6. Now the words are that the more granular locking will be definitely added to 2.8 release. We have to wait and see.

One hacking solution you can do at the meantime is to move your collections into a separate databases to simulate collection level locking. Still, the solution sounds clumsy. Or you can add more shards to make the write operation more scalable.