Apache Spark

APACHE SPARK

Apache Spark [4] is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics on all your data. 

SSH to hpcdata1 (login node for Hadoop cluster) for running spark using Cloudera Hadoop (you need to request the Hadoop account explicitly). To run as a stand-alone in the Rider cluster, load the module first. Check the version first.

module spider spark

module load spark

Interactive Job

Run Sparks in interactive Shell mode. You will get the scala prompt "scala>"

spark-shell

Create data

scala> val data = 1 to 10000

Create an RDD (Resilient Distributed Dataset), a fault-tolerant collection of elements that can be operated on in parallel, based on that data. 

scala> val distData = sc.parallelize(data)

Use the filter to select values less than 10

scala> distData.filter(_<10).collect()

output:

...

15/05/19 15:23:24 INFO DAGScheduler: Job 0 finished: collect at <console>:26, took 1.324873 s

res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)

Exit

scala> exit

SPARK STAND-ALONE CLI (in Rider cluster)

SPARK STAND-ALONE Batch Job (in Rider)

SPARK RAPID-INTEGRATION (in Rider)

Batch Job (in hpcdata1 - Hadoop Cluster)

In this how-to [6], you’ll learn how to write, compile, and run a simple Spark program written in Scala on CDH 5. The full code for the example is hosted at https://github.com/sryza/simplesparkapp.

Clone or copy the repository

git clone https://github.com/sryza/simplesparkapp.git 

or 

cp -r /home/sxg125/hadoop-projects/spark/simplesparkapp .

Change to directory simplesparkap

cd /home/</hadoop-projects/spark/simplesparkap

Set the MAVEN path:

export PATH=$PATH:/home/share/apache-maven-3.2.2/bin/

Compile:

 mvn package

output:

...

Building jar: /home/sxg125/hadoop-projects/spark/simplesparkapp/target/sparkwordcount-0.0.1-SNAPSHOT.jar

[INFO] ------------------------------------------------------------------------

[INFO] BUILD SUCCESS

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 04:22 min

[INFO] Finished at: 2015-05-20T11:37:16-04:00

[INFO] Final Memory: 32M/874M

[INFO] ----------------------------------------------

JAR is created at target directory.

Running Job

Create an input directory in HDFS (see above) and put the inputfile.txt there.

hadoop fs -put data/inputfile.txt /user/<user>/projects/spark/input

submit job

spark-submit --class com.cloudera.sparkwordcount.SparkWordCount --master yarn   target/sparkwordcount-0.0.1-SNAPSHOT.jar /user/<user>/projects/spark/input/inputfile.txt 2 > output

View the output

cat output

output:

(p,2), (t,2), (b,1), (h,1), (n,4), (f,1), (v,1), (r,2), (l,1), (e,6), (a,4), (i,1), (u,1), (o,2), (c,1)