Saturday, February 20, 2016

5 Steps to get started running Spark on YARN with a Hadoop Cluster

Spark and Hadoop are the top topics these days.  The good news is Spark 1.5.2 runs on Hortonworks Hadoop (HDP) 2.3.  

This blog will run through the hello world style steps to run the Spark examples using YARN and HDP as well as explore the tools necessary to monitor the examples executing in ‘cluster’ mode.

Before we get too far into the actual steps this blog assumes the following pre-conditions:
  • You have a Hadoop cluster with both Spark and Ambari installed and you have access to the standard Spark Examples.  As a point of reference, this blog was created on a 5 node cluster running HDP 2.3 with 1 master node and 4 data nodes.
  • Your Hadoop cluster has multiple nodes.  The example will run on the Hortonworks Sandbox, but the monitoring capabilities described in this blog are much better if you have more than one node.

Running the Spark examples in YARN:
    1. Verify Spark has been installed on your HDP cluster instance and is accessible in the services drop down in the left side of the screen.

If Spark does not appear in the services list above, then click on the Actions button above, select Spark from the list of services as shown below and click the next button.

Follow the default instructions as presented by Ambari and then reset the other Ambari services as requested by Ambari.
At this point in time you should have a working Spark service on HDP with YARN.

    1. Go into HDP Ambari
      and check which servers are running Spark client

To run our example we need to identify which servers Spark is running.  You can do this by clicking on the ‘Spark’ service name and then selecting the ‘Spark Client’ option in the Summary section as shown below.

Once you have clicked on the Spark Client link, you will then see the hosts where the Spark Client is running.

In my sample, Spark Client is running on server5.hdp.  As every cluster will be different you will need to check for your server name and then in all of the subsequent steps, and replace ‘server5.hdp’ with your actual Spark Client server name for the subsequent steps contained in this blog.

Before moving to the next enter a terminal and then ssh into the Spark client machine:
ssh hdfs@server5.hdp

    1. Go to the Spark client directory

Our next step is to ‘cd’ into the ‘spark-client’ directory.
cd /usr/hdp/current/spark-client

At this directory are all of the standard spark distribution directories.

    1. Submit the SparkPi example over YARN

A good Spark example to checkout your Spark cluster instance is the SparkPi example.  The example is CP
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 4g \
   --executor-memory 2g \
   --executor-cores 4 \
   --queue default \
   lib/spark-examples*.jar \

The example above is the standard SparkPi example on YARN with one important change; I modified the task count from 10 up to 100 in order to have more monitoring data available for review.

The key configurations to run a Spark job on a YARN cluster are:
  • master – Determines how to run the job.  As we want for this blog review to execute Spark in YARN, the ‘yarn’ value has been selected for the example above.  The other options available include ‘mesos’ and ‘standalone’
  • deploy-mode – We selected ‘cluster’ to run the above SparkPi example within the cluster.  To run the problem outside of the cluster, then select the ‘client’ option.
  • driver-memory – The amount memory available for the driver process. In a YARN cluster Spark configuration the Application Master runs the driver.
  • executor-memory – The amount of memory allocated to the executor process
  • executor-cores – the total number of cores allocated to the executor process
  • queue – The YARN queue name on which this job will run.  If you have not already defined queues to your cluster, it is best to utilize the ‘default’ queue.

    1. SparkPi example monitoring

The SparkPi example we submitted in the last step can be monitored while executing by checking out the ResourceManager UI from the Ambari -> YARN -> QuickLinks

From the ResourceManager UI screen as shown below it is possible to see the current execution status and statistics.  For example we see that the job is consuming 47.5% of the cluster (nothing else was running at the time), and that 3 containers are being used to run the job.  Just as with any other YARN application we can click the Application ID or Tracking UI link associated with the job to get more information about the jobs progress.

Once the SparkPi job has completed execution you will see a FinalStatus of ‘SUCCEEDED’ if everything was successful or FAILED if the job did not complete.

Clicking on the ‘History’ link associated with this job takes you to the Spark History screen

Selecting the job’s description link will take you the Sparks Job Detail page.
The SparkPi example job is incredibly simplistic, but for a real world Spark application you would want to review this screen to better under stand how the job was allocating resources between the stages.  Then for more fine grained job results details, cluck on the Completed Stages Description ‘reduce at SparkPi.scala:36’.

We see in the example above that the SparkPi example utilized 2 of the 5 nodes defined to my test cluster, though it looks like most of the work was allocated to the server4.hdp instance.  Looking at the statistics we see that each of the servers processed 50 tasks, but server5.hdp in this example just required a little more time.  This could be attributable to the fact that the Spark client is also running on this node.  The good news is the tooling exists with Spark and HDP to dig deep into your Spark executed YARN cluster jobs to diagnosis and tune as required.

You have now run your first Spark example on a YARN cluster with Ambari. The process as you see is easy, so now you are ready to move forward with your own applications.