Spark interactive: Scala, Python, R

Prepare

Setup env variables and start Spark cluster

Connect to running Spark cluster

Execute

Scala

Python

Prepare

Prepare directory

mkdir /scratch/$USER/spark_pycd /scratch/$USER/spark_py

Get Spark

- Go to https://spark.apache.org/downloads.html and choose a desired version of spark (pre-built for Hadoop)
- Click to download link - choose mirror - Copy URL, for example: https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

download

wget <url from above>

unzip

mkdir sbintar -xvf spark-*.tgz -C sbin --strip 1

Get compute resources using slurm

Note: make sure to request enough RAM

srun --cpus-per-task=2 --nodes 2 --mem=10GB --pty /bin/bash

Setup env variables and start Spark cluster

(take script from -> here <- and copy it to your directory)

source spark-prince-prepare.sh

Start SPARK within Slurm job environment

start_all# later, you can stop Spark cluster using command stop_all

Connect to running Spark cluster

Spark UI

- After you will start spark, script will tell you location of logs:

start_all
-------------------start spark
starting org.apache.spark.deploy.master.Master, logging to <PATH_TO_LOGS>

- Read this log file

cat <PATH_TO_LOGS>

- You will see the URL for web UI

INFO MasterWebUI: Bound MasterWebUI to c38-06-ib0, and started at http://<URL_WEB_UI>:8080

- Use this <URL_WEB_UI> to setup tunnel
- You can use putty or a separate terminal window for port forwarding

ssh -L 8080:<URL_WEB_UI>:8080 <NET_ID>@prince.hpc.nyu.edu

- Open http://localhost:8080/ in browser

Execute

Scala

spark-shell $SPARK_URL# or# ./sbin/bin/spark-shell $SPARK_URL

Python

Conda environment

module load anaconda3/2020.07## pythonconda create -p $(pwd)/cenv python=3.7conda activate /scratch/<NetID>/spark_py/cenv

Install packages as needed

## for pythonconda install -c conda-forge pyspark

Example

import osimport pysparkfrom pyspark import SparkContextsc =SparkContext(master = os.getenv('SPARK_URL'))## testnums= sc.parallelize([1,2,3,4]) nums.take(1) squared = nums.map(lambda x: x*x).collect()for num in squared: print('%i ' % (num))

R

Conda environment

module load anaconda3/2020.07## Rconda create -p $(pwd)/cenv r=4.1conda activate /scratch/<NetID>/spark_py/cenv

Install packages as needed

## for Rconda install -c r r-sparklyr r-tidyverseconda install -c conda-forge r-lahman r-nycflights13 ## for test example below

Example

library(sparklyr)conf <- spark_config()#conf$spark.dynamicAllocation.enabled <- "false"conf$sparklyr.connect.cores.local <- Sys.getenv('SLURM_JOB_CPUS_PER_NODE')sc <- spark_connect(master = Sys.getenv('SPARK_URL'))library(dplyr)iris_tbl <- copy_to(sc, iris)flights_tbl <- copy_to(sc, nycflights13::flights, "flights")batting_tbl <- copy_to(sc, Lahman::Batting, "batting")src_tbls(sc)flights_tbl %>% filter(dep_delay == 2)spark_disconnect_all()

Page updated

Report abuse