Monday, December 30, 2013

Unjustified MongoDB Complains

I have done some research on MongoDB recently. I took their online course for MongoDB developer and read the book MongoDB: the Definitive Guide. I also attended their certification program, took the test and became a Certified MongoDB developer.

There are lots of headed discussions of the drawbacks of MongoDB in the online forum. I am going to discuss about some complains that are not justified.

Complaint 1: MongoDB does not support join. That's a typical trait for document/column family NoSQL database. If we allow join between collections, the scalability is going to suffer. The key of MongoDB design is to understand the data usage pattern. We usually use "embedding" to store related data together into one document. Because these data are usually accessed together,  there is need for join operation during runtime. A typical example is order and line itmes. Since the data of order and line items are always accessed together, it makes sense to denormalize and store them in one single document. On the other hand, the order would have a customer which probably is a separate collection. The join has to be done at the client side. A client can read the custom_id from the order document, and then go fetch the data from customer collection as needed separately.

Complaint 2: MongoDB does not support transaction. First of all, let's clarify that mongodb does support atomic operation on one single document, but it does not support atomic transaction across more than one document or more than one collection. There is a way to do two-phase commit, as described in MongoDB documentation, but boy, it is really convoluted. The idea is in most cases, your data that accessed together should be stored in one document using "embedding", thus we do not need cross document transaction. 

Complaint 3: Map-Reduce in MongoDB is slow because it is single-threaded. That's old news. Since MongoDB 2.4, The SpiderMonkey JavaScript engine has been replaced by the V8 JavaScript engine. There is no longer a global JavaScript lock, which means that multiple Map/Reduce threads can run concurrently. Also MongoDB 2.2 introduced an aggregation framework that can be used to replace lots of map-reduce jobs. The aggregation framework run much faster than a similar map-reduce job.

Complaint 4: The default write behavior is unsafe. Again, this is old news. In the past, MongoDB's default write concern is unacknowledged, which means MongoDB does not acknowledge the receipt of write operation. It is a fire-and-forget operation. But this default behavior has been changed. MongoDB has a new connection class named MongoClient. The default write concern on the new MongoClient class will be to acknowledged all write operations

Thursday, November 28, 2013

Design considerations of MongoDB-Log4j appender

I recently implemented a MongoDB Log4j Appender. The drive behind it is that we are considering expanding the WebLogic cluster to more nodes. The current logging is file-based. When more and more nodes are added to cluster, it is hard to check log when they are distributed on multiple machines. Using MongoDB gives me a central location to check the log for the whole cluster. Besides, it is easier to query/analyze the log data in MongoDB than in flat files.

There are a few key design considerations I would like to share:

  1. Choose the write collection type. You can choose a capped collection or a TTL collection. The capped collection is a fixed-size collection that supports high-throughput operations that insert, retrieve, and delete documents based on insertion order. When the size is exceeded, the oldest document in the collection is removed automatically. A TTL collection make it possible to store data in MongoDB and remove outdated document automatically after a specified number of seconds or at a specific clock time. We choose the capped collection because our current file-based log4j logging rotation is filesize based. 
  2. Be careful with the schema design. We should extract the relevant information such as timestamp from the log data into individual fields in a json document. The timestamp should be stored as a Date type using ISODate(String) constructor. It will make your document size smaller and easier to be analyzed.
  3. Choose the right write-concerns. If you choose the write concern to be "unacknowledged", the DB write operation is asynchronous and return very fast. Write concern "acknowledged" is much safer, but the speed and throughput is worse. In the Log4j appender design, I decide to use unacknowledged write concern for info/debug logging and use "acknowledged" write concern for warn/error logging.
  4. Consider using AsyncAppender. Even though the acknowledged/unacknowledged write concern mix can achieve a good balance between data durability and write speed, it is still going to slow down the system a bit, in our case (around 10% overhead). You can consider attaching the mongodb appender to an AsyncAppender.