Prepare directory
mkdir /scratch/$USER/spark_pycd /scratch/$USER/spark_pyGet Spark
Go to https://spark.apache.org/downloads.html and choose a desired version of spark (pre-built for Hadoop)
Click to download link - choose mirror - Copy URL, for example: https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
download
wget <url from above>unzip
mkdir sbintar -xvf spark-*.tgz -C sbin --strip 1Get compute resources using slurm
Note: make sure to request enough RAM
srun --cpus-per-task=2 --nodes 2 --mem=10GB --pty /bin/bash(take script from -> here <- and copy it to your directory)
source spark-prince-prepare.shStart SPARK within Slurm job environment
start_all# later, you can stop Spark cluster using command stop_allSpark UI
After you will start spark, script will tell you location of logs:
Read this log file
You will see the URL for web UI
Use this <URL_WEB_UI> to setup tunnel
You can use putty or a separate terminal window for port forwarding
Open http://localhost:8080/ in browser
Conda environment
module load anaconda3/2020.07## pythonconda create -p $(pwd)/cenv python=3.7conda activate /scratch/<NetID>/spark_py/cenvInstall packages as needed
## for pythonconda install -c conda-forge pysparkExample
import osimport pysparkfrom pyspark import SparkContextsc =SparkContext(master = os.getenv('SPARK_URL'))## testnums= sc.parallelize([1,2,3,4]) nums.take(1) squared = nums.map(lambda x: x*x).collect()for num in squared: print('%i ' % (num))Conda environment
module load anaconda3/2020.07## Rconda create -p $(pwd)/cenv r=4.1conda activate /scratch/<NetID>/spark_py/cenvInstall packages as needed
## for Rconda install -c r r-sparklyr r-tidyverseconda install -c conda-forge r-lahman r-nycflights13 ## for test example belowExample
library(sparklyr)conf <- spark_config()#conf$spark.dynamicAllocation.enabled <- "false"conf$sparklyr.connect.cores.local <- Sys.getenv('SLURM_JOB_CPUS_PER_NODE')sc <- spark_connect(master = Sys.getenv('SPARK_URL'))library(dplyr)iris_tbl <- copy_to(sc, iris)flights_tbl <- copy_to(sc, nycflights13::flights, "flights")batting_tbl <- copy_to(sc, Lahman::Batting, "batting")src_tbls(sc)flights_tbl %>% filter(dep_delay == 2)spark_disconnect_all()