Tuesday, November 08, 2016

Hive on Spark Field Guide

Hive On Spark has become more and more popular. Hive has been in the Hadoop ecosystem for a long time. It is the most popular SQL on Hadoop option (not necessarily the best one). It is quite useful especially in ETL use cases. However, when running on MapReduce execution engine, it inherits the biggest drawback of MapReduce framework -- slow performance. Hive On Spark allows end user to switch the execution engine from MapReduce to Spark without rewriting your Hive scripts.

Because Hive on Spark is quite new, there are still some usability/manageability issues. This blog is going to explore some problems I ran into in the field and provide solutions and workarounds.

There is an important distinction between Hive On Spark and Hive on MapReduce.  This usually confuses people who use Hive on Spark the first time. With Hive on Spark, assuming your spark application running in YARN,  there is only one long running YARN application per user hive session. It is shared among all the queries submitted within the same Hive session. When the first query is executed in a Hive session (through Hue or beeline or any other Hiveserver2 client), a Hive on Spark YARN application is launched.  Hive on Spark uses yarn-cluster mode, and thus Spark driver is executed in AM. When the query finishes, Hive doesn't terminate this spark application. Instead, the Spark application would be kept running and used by subsequent queries submitted in the same Hive session, until the session is closed. Each Hive query is translated to at least one Spark job within that spark application.  On the contrary, when running Hive on MapReduce, each Hive query translates to a chain of MapReduce jobs, depending on the complexity. As soon as the Hive query is done, all MapReduce jobs are finished and not reused by the next Hive query.

Below I started a beeline shell. Within the same beeline shell, I executed two different Hive queries. As you can see from Cloudera Manager YARN application page, it shows there is only one application running and it shows 10% complete. But don't be fooled. Both queries have already been completed. This Hive on Spark application is going to keep running, until you quit your beeline shell.

If you go to the Spark History Server UI, you should find your application in the "incomplete application list". Click the uncompleted application link, you should see there are two spark jobs in the Completed Jobs list.

As soon as you exit your beeline session, the Hive on Spark application is done. The duration shown in the below page is how long the hive session has been kept open, not the duration of the queries that are executed.

Now here comes the questions I asked by customers in the field.

How can I find out the queries that are submitted by other hive users in Cloudera Manager YARN application page? 

When we run Hive on MR, the Hive query string can be found easily through CM YARN application page.  For Hive on Spark, it is not possible.  One Hive on Spark application corresponds to many Hive queries. A workaround is to use HiveServer 2 Web UI, which shows all the active sessions and active/past queries running.

Why Hive on Spark applications stay alive forever and they occupy almost all cluster resources and no more jobs can be run. What shall we do?

Hive on Spark application is long-running by design, as was explained at the beginning of the blog. The purpose of this design is that with AM and executor already launched,  the subsequent hive query can reuse them and run much faster.

However, if a user opens a beeline shell, submit a query and leave the shell open, the Hive on Spark application will hold YARN resources. To Hue user it is even worse. Because Hue is a web tool, currently it does not have an elegant way to close the Hive on Spark session to the HiveServer2. When you are using CDH 5.7, which comes with Hue 3.9, you would notice that the Hive on Spark session is kept open, even after you logout Hue. The end result is that the Hive on Spark YARN application is running forever, unnecessarily occupying YARN resources, even after users have long logout the Hue application and closed their browsers.  Very soon the YARN is going to run out of CPU and Memory and no more job can be submitted. The only way I found out that you can close the Hive on Spark session manually through Hue in CDH 5.7 is to execute "set hive.execution.engine=mr" in the Hive editor in Hue. This will essentially close the Hive on Spark session immediately. Of course, you can then execute "set hive.execution.engine=spark" again to switch back to spark engine.

The Hue comes with CDH 5.8 seems to try to address this issue. It adds a feature called "close session", which allows you to close the hive session manually. However, according to my own testing, this feature is still a little unstable. Sometimes the close session operation works fine, which terminates the Hive on Spark YARN application for you. But sometimes it does not work and throw error "Failed to close session, session handle may already be closed or timed out" error in Hue and the HoS session in YARN is still running forever.

There are two things you can do to help alleviate the issues.

1. We need to enable dynamic executor allocation for Hive On Spark. This allows Spark to add and remove executors dynamically to Hive jobs. This is done based on the workload. The spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.minExecutors by default are set to be 1 in Cloudera Manager.  It means by default every idle Hive on Spark session will take 2 containers. One is for the ApplicationMaster/Spark driver and the other is for the minimum executor. The executor usually takes more resources from YARN than AM.  For example, usually we recommend 4-6 CPU for executor container while AM container only takes 1 CPU. To reduce the waste, it might be wise to reduce the initial executor number and minimum executor number from 1 to 0. Thus while the Hive on Spark session is idle, only AM container hold the YARN resource. Of course this will add a delay for each query as new executors have to be launched first. Thus you have to decide which configuration fits your scenario better.

2. The hiveserver2 has a configuration called hive.server2.idle.session.timeout. After the timeout, the Hive on Spark session is closed and all YARN resources are released. By default this value is set to be 12 hours. We can reduce it to a smaller internal and allow idle session to be closed automatically when the timeout has passed. Please note that by default this idle time does not start counting while there is an active query running. The clock only starts when the last query result is returned.  hive.server2.idle.session.timeout_check_operation by default is set to be true.

The combination of above two methods can help preserve the YARN resources and support more users sharing the Hadoop cluster resource.