PySpark

Install pyspark

https://www.youtube.com/watch?v=wqY3Go7p0BA&feature=youtu.be

https://sites.google.com/a/ku.th/big-data/home/spark

Test on spark ALS:

https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html

RE job :

https://www.cpe.ku.ac.th/~cnc/recommendALS.tar.gz

Data set:

https://raw.githubusercontent.com/apache/spark/master/data/mllib/als/sample_movielens_ratings.txt

larger data:

https://grouplens.org/datasets/movielens/100k/

https://www.kaggle.com/grouplens/movielens-20m-dataset

Implicit ALS :

https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe

Data set:

https://www.cpe.ku.ac.th/~cnc/usersha1-artmbid-artname-plays.tsv.zip

Jupyter notebook

https://www.cpe.ku.ac.th/~cnc/ImpliciteALS.ipynb

Running spark multi nodes:

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b

https://github.com/ashishtam/apache-spark-multi-node-installation/blob/master/index.md

http://chennaihug.org/knowledgebase/spark-master-and-slaves-multi-node-installation/

Launching job: spark standalone cluster

eg. ./bin/spark-submit --master spark://master_ip:7077 --deploy-mode cluster --supervise - -driver-memory 5g --num-executors 20 --executor-memory 4g --executor-cores 4 examples/src/main/python/pi.py 1000

https://spark.apache.org/docs/2.0.2/submitting-applications.html

https://spark.apache.org/docs/2.0.2/spark-standalone.html

see spark-submit parameters:

https://www.alibabacloud.com/help/doc-detail/28124.htm

To run on server: python code

conf = SparkConf().setAppName(app_name) \

.setMaster('spark://sparkmaster_ip:7077')

ssh Tunnel forwarding

https://www.tecmint.com/create-ssh-tunneling-port-forwarding-in-linux/

Page updated

Report abuse