Sources:http://spark.apache.org/, eclsummer2017.blogspot.com, cloudera.com, jaceklaskowski.gitbooks.io
Downloaded version: 1.4.0
executed- make-distribution.sh
prompt for java .y
#post above steps we can start bin/spark-shell and can start writing scala code examples
#Spark distributed configuration for HADOOP and YARN
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
#ADD to spark-defaults.conf
HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
#ADD to spark-defaults.conf
spark.master yarn
#Start Spark master
$SPARK_HOME/sbin/start-all.sh
##spark-shell--------------Starts
#1: Year's max temperature finder.
a. Read a local text file.
val lines=sc.textFile("temperature.txt")
val records=lines.map(_.split("\t"))
val maxTemps = tuples.reduceByKey((a, b) => Math.max(a, b))
b. Filter invalid records.
val validRecs=records.filter(rec=>(rec(0)!="9999" && rec(1).matches("^[0-9]+$")))
c. Form map out of filtered records.
val tuples = validRecs.map(rec=> (rec(0).toInt, rec(1).toInt))
d. Reduce map .
val maxTemps = tuples.reduceByKey((a, b) => Math.max(a, b))
e. Print output.
maxTemps.foreach(println(_))
#2: Read file from s3.
val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", "id")
hadoopConf.set("fs.s3.awsSecretAccessKey", "secret-key")
val lines = sc.textFile("s3://<BUCKET>/<OBJECT>")
#3: Read file from web(http).
#4: Word counter.
val text = sc.textFile("some_text.txt")
val count = text.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
##spark-shell--------------Ends
Problems
#1: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
Soln: Sources: stack-overflow
Set value for SPARK_LOCAL_IP in <SPARK_DIR>/conf/spark-env.sh
SPARK_LOCAL_IP=127.0.0.1
#2: