Install VirtualBox
sudo apt-get install virtualbox
Network settings:
From VirtualBox menu: choose Settings => NetWork
Select option Adapter 1, choose option "Bridged Adapter" and eth0 as network name from the dropdown list
Configure listening ports
1) Disable firewall from Host
$sudo ufw status
$sudo ufw disable
source from https://doc.ubuntu-fr.org/ufw
or
2) open port 9000
$sudo ufw allow 9000
Install Java 8
$sudo apt-add-repository ppa:webupd8team/java $sudo apt-get update $sudo apt-get install oracle-java8-installer
Install scala
$sudo apt-get install scala
Download and install SPARK
http://spark.apache.org/downloads.html
Unzip Spark
$tar -xvf scala...tgz
Copy the decrompressed folder to the home directory
$mv spark... ~/
$cd ~/spark...
Configure Spark
$cp conf/spark-env.sh.template conf/spark-env.sh
Edit options
SPARK_LOCAL_IP=192.168.1.20
SPARK_MASTER_HOST=192.168.1.20
SPARK_MASTER_PORT=9000
$cp conf/slaves.template conf/slaves
Add the ssh username and IP address(es) to this file, for example
duy@192.168.1.84
duy@192.168.1.50
...
Copy the configuration to every node
Add/Export the Spark home directory to the PATH
$emacs ~/.bashrc
export SPARK_PATH=~/spark-2...
or you can do better by writing and exporting a custom function inside the bashrc to run Spark
###########################
function pyspark() {
#Spark path (based on your computer)
SPARK_PATH=~/spark-2.2.1-bin-hadoop2.7
# For python 3 users, you have to add the line below or you will get an error
#export PYSPARK_PYTHON=python3
# Run spark (interactively) on the master cluster local using 2 threads
$SPARK_PATH/bin/pyspark --master local[2]
}
export pyspark
###########################
From the terminal, run spark
$pyspark
One could also start the master with
$~/spark-version/sbin/start-master.sh
or start-all.sh
To run the examples using spark
$~/spark-2.2.1-bin-hadoop2.7/bin/spark-submit ~/spark-2.2.1-bin-hadoop2.7/examples/src/main/python/pi.py 10