Thursday, August 25, 2016

Common Pitfalls When Writing Spark Applications (Part 1: Logging Side Effect)

As a Big Data Solutions Architect, I have done quite a few spark application reviews for my customers. I have the opportunity to witness some common pitfalls when writing spark applications first-hand. I decided to write a blog series on the subject, helping the spark community to avoid these costly mistakes.

Spark 101: operations can be divided into transformations and actions. All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action is invoked. This design enables Spark to optimize the RDD pipeline and run more efficiently. 

During one of my spark application review engagement, I noticed that the development team has done a great deal of logging through log4j API. There are many rdd.count() in the logging statements. Something like below"contract_devices:"+ contract_devices.count())

contract_devices is an RDD with long lineage. 

Even if the logger level is set to be warn or error, the action count on the RDD is still called. This is the logging side effect we need to watch out, because it will trigger all the previous transformation up to this RDD to be executed, including reading from HDFS at the very beginning. There are many such log statements throughout the codes, literally calling count() action on every RDDs in the long lineage. One can image how inefficient this becomes because the whole lineage is executed again and again. 

Instead, we should have written code like below:

if (logger.isInfoEnabled()) {"contract_devices:"+ contract_devices.count());

Because of the checking done by logger.isInfoEnabled(), the action on RDD will not be triggered when logger level is set properly. 

No comments: