SPARK

Install VirtualBox

sudo apt-get install virtualbox

Network settings:

From VirtualBox menu: choose Settings => NetWork

Select option Adapter 1, choose option "Bridged Adapter" and eth0 as network name from the dropdown list

Configure listening ports

1) Disable firewall from Host

$sudo ufw status

$sudo ufw disable

source from https://doc.ubuntu-fr.org/ufw

2) open port 9000

$sudo ufw allow 9000

Install Java 8

$sudo apt-add-repository ppa:webupd8team/java $sudo apt-get update $sudo apt-get install oracle-java8-installer

Install scala

$sudo apt-get install scala

Download and install SPARK

http://spark.apache.org/downloads.html

Unzip Spark

$tar -xvf scala...tgz

Copy the decrompressed folder to the home directory

$mv spark... ~/

$cd ~/spark...

Configure Spark

$cp conf/spark-env.sh.template conf/spark-env.sh

Edit options

SPARK_LOCAL_IP=192.168.1.20

SPARK_MASTER_HOST=192.168.1.20

SPARK_MASTER_PORT=9000

$cp conf/slaves.template conf/slaves

Add the ssh username and IP address(es) to this file, for example

duy@192.168.1.84

duy@192.168.1.50

...

Copy the configuration to every node

Add/Export the Spark home directory to the PATH

$emacs ~/.bashrc

export SPARK_PATH=~/spark-2...

or you can do better by writing and exporting a custom function inside the bashrc to run Spark

###########################

function pyspark() {

#Spark path (based on your computer)

SPARK_PATH=~/spark-2.2.1-bin-hadoop2.7

# For python 3 users, you have to add the line below or you will get an error

#export PYSPARK_PYTHON=python3

# Run spark (interactively) on the master cluster local using 2 threads

$SPARK_PATH/bin/pyspark --master local[2]

}

export pyspark

###########################

From the terminal, run spark

$pyspark

One could also start the master with

$~/spark-version/sbin/start-master.sh

or start-all.sh

To run the examples using spark

$~/spark-2.2.1-bin-hadoop2.7/bin/spark-submit ~/spark-2.2.1-bin-hadoop2.7/examples/src/main/python/pi.py 10

Page updated

Google Sites

Report abuse