Friday, February 10, 2017

The Gotcha Of Using Spark Dependency in Cloudera Maven Repository (Mac User Only)

One night I setup a basic word count spark application in IntelliJ IDE on my MacBook. Usually I would specify the spark core dependency in maven pom.xml as below:



But that night I decided to use the spark artifacts from Cloudera's maven repository.  It seems to be a good idea because ultimately my spark application is going to be deployed on a CDH cluster. Even though the CDH spark distribution is mostly identical to the upstream open source Apache Spark project, it contains patches and other tweaks so that it works well with other Hadoop components included in the Cloudera CDH distribution. I am a big believer that the build environment should be as identical as possible to the the runtime environment.  For details about how to setup Cloudera Maven repository, please follow the link here. 

My spark core dependency looks like below:



Then I ran into trouble. Within the IDE, the maven compile and package goals ran fine, but running the application hit a problem. The IDE spilled out the below exception:

java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:317)
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219)
......
Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at org.xerial.snappy.SnappyNativeLoader.loadLibrary(SnappyNativeLoader.java:52)
... 41 more

If I tried to use spark-submit to submit the packaged jar file to a running CDH cluster deployed in AWS (RedHat 7 as OS), the word count application ran fine. The problem only happened when I tried to run the spark application locally on my MacBook.  WEIRD!

After some digging, I found out the real cause. The snappyjava library cannot be found is actually caused the snappy-java version used by spark in CDH. If you check spark 1.6.0-cdh5.9 maven dependency, the snappy-java version is actually 1.0.4.1. That version contains a bug, which is described here: https://github.com/xerial/snappy-java/issues/6. It ONLY affect MAC OS. It is related to a Java call System.mapLibraryName(). It you call System.mapLibraryName("snappyjava"), it will prepend "lib" at the beginning of string and based on OS to choose extension. This will use .so on Linux and .dll on Windows. MAC OS support multiple extensions, and mapLibraryName can only support one by design. In Java 6 .jnilib is used and Java 7+ starts to use .dylib instead. The version 1.0.4.1 of snappy-java only packages file libsnappyjava.jnilib, hence the error. This problem does not exist if you use the open source spark, since the version of snappyjava is 1.1.2.6 according to spark github 1.6 branch. I chose 1.6 branch to check because that's the same spark version as what CDH 5.9 includes.

How do we get around this issue? Of course, switching to the spark dependency in Maven Central would work, but making the build and runtime environments identical is in general recommended.

Looking at implementation https://github.com/xerial/snappy-java/blob/master/src/main/java/org/xerial/snappy/SnappyLoader.java, I find that it tries to find a System property "org.xerial.snappy.lib.name" before calling mapLibraryName. Thus the easiest solution is to add -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib in the IDE Run configuration



Let's summarize. You only need the above solution if you meet ALL below conditions
  1. You are using a Mac, not Windows or Linux for development
  2. You are running the spark application locally either in IDE Run configure.
  3. You use spark artifacts from Cloudera's maven repository, not from Maven Central repository.
  4. The spark artifacts you use contains dependency for snappyjava version older than 1.0.5. If you use CDH 5.x release (the latest is CDH 5.10 at the time of the writing), you fall into that category. 
The good news is that Cloudera has decided to release Spark 2.x as a separate parcel.  For how to install spark 2 parcel to a CDH cluster, please refer to official online document from Cloudera. The snappy-java version included in spark 2.0 parcel is 1.1.2.4. Thus if you use spark 2.0, you won't run into this issue. Here is the spark 2 dependency in Cloudera Maven repository:


Hope this blog can save time for those who runs into the same issue.


No comments: