spark

object WordCount { def main(args: Array[String]): Unit = { val inputPath = args(0) val outputPath = args(1) val sc = new SparkContext() val lines = sc.textFile(inputPath) val wordCounts = lines.flatMap {line => line.split(" ")} .map(word => (word, 1)) .reduceByKey(_ + _) // means reduceByKey((x,y)=> x + y) wordCounts.saveAsTextFile(outputPath) }}

val numbers = Array(1, 2, 3, 4, 5)val sum = numbers.reduceLeft[Int](_+_)

//val sum = numbers.reduceLeft((a:Int, b:Int) => a + b) //same as above

println("The sum of the numbers one through five is " + sum)

(_+_) is actually a Scala shorthand for an anonymous function taking

two parameters and returning the result obtained by invoking the + operator

on the first passing the second.

Windows

some Windows users may get this error when first trying to get a DataFrame working:

Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-

This error can be resolved by running your command prompt as administrator and then typing:

C:\winutils\bin\winutils.exe chmod 777 C:\tmp\hive

this basically gives \tmp\hive a 777 permission. This error has to do more with HADOOP/HIVE than with Spark itself. If you are still getting the error (sometimes really old versions of Windows, like Windows 7 still encounter issues), you may need to get a different winutils file, here is a link to one that may work for you:

https://issues.apache.org/jira/browse/SPARK-10528

more general info regarding this error:

http://stackoverflow.com/questions/34196302/the-root-scratch-dir-tmp-hive-on-hdfs-should-be-writable-current-permissions

c:\DEV\spark-2.0.0-bin-hadoop2.7>bin\run-example.cmd SparkPi

c:\DEV\spark-2.0.0-bin-hadoop2.7>bin\spark-shell

scala> :help

All commands can be abbreviated, e.g., :he instead of :help.

:edit <id>|<line> edit history

:help [command] print this summary or command-specific help

:history [num] show the history (optional num is commands to show)

:h? <string> search the history

:imports [name name ...] show import history, identifying sources of names

:implicits [-v] show the implicits in scope

:javap <path|class> disassemble a file or class name

:line <id>|<line> place line(s) at the end of history

:load <path> interpret lines in a file

:paste [-raw] [path] enter paste mode or paste a file

:power enable power user mode

:quit exit the interpreter

:replay [options] reset the repl and replay all previous commands

:require <path> add a jar to the classpath

:reset [options] reset the repl to its initial state, forgetting all session entries

:save <path> save replayable session to a file

:sh <command line> run a shell command (result is implicitly => List[String])

:settings <options> update compiler options, if possible; see reset

:silent disable/enable automatic printing of results

:type [-v] <expr> display the type of an expression without evaluating it

:kind [-v] <expr> display the kind of expression's type

:warnings show the suppressed warnings from the most recent line which had any

Windows 10

http://stackoverflow.com/questions/25481325/how-to-set-up-spark-on-windows

http://yokekeong.com/running-apache-spark-with-sparklyr-and-r-in-windows/ RStudio

https://nerdsrule.co/2016/06/15/ipython-notebook-and-spark-setup-for-windows-10/ IPython

https://hernandezpaul.wordpress.com/2016/01/24/apache-spark-installation-on-windows-10/

http://techgobi.blogspot.in/2016/08/configure-spark-on-windows-some-error.html

http://hadoopsupport.blogspot.com/2016/05/install-apache-spark-on-windows-10.html

https://www.youtube.com/watch?v=WlE7RNdtfwE

https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details

In Spark 2.0 DataFrame is just a type alias for Dataset[Row], so data sets are the way to go. Under the hood RDDs are used so knowing how they work is not essential but often useful.

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-overview.html

https://www.reddit.com/r/apachespark/comments/5078lw/apache_spark_developer_certification/

http://www.slideshare.net/Newprolab/data-science-week-2016-rambler-co-apache-spark

http://www.dattamsha.com/2014/09/hadoop-mr-vs-spark-rdd-wordcount-program/

https://habrahabr.ru/company/jugru/blog/309776/

https://www.youtube.com/watch?v=I5JtvpyM14U

http://beekeeperdata.com/posts/hadoop/2015/12/14/spark-scala-tutorial.html

https://www.youtube.com/watch?v=OheiUl_uXwo

http://www.kdnuggets.com/2016/09/7-steps-mastering-apache-spark.html

On Windows using vagrant and VirtualBox

1) https://atlas.hashicorp.com/paulovn/boxes/spark-base64

2) https://github.com/alexholmes/vagrant-hadoop-spark-hive

c:\DEV\vagrant_boxes>ls

spark-examples

c:\DEV\vagrant_boxes>vagrant box add centos65 https://github.com/2creatives/vagrant-centos/releases/download/v6.5.1/centos65-x86_64-20131205.box

==> box: Box file was not detected as metadata. Adding it directly...

==> box: Adding box 'centos65' (v0) for provider:

box: Downloading: https://github.com/2creatives/vagrant-centos/releases/download/v6.5.1/centos65-x86_64-20131205.box

box: Progress: 100% (Rate: 9596k/s, Estimated time remaining: --:--:--)

==> box: Successfully added box 'centos65' (v0) for 'virtualbox'!

c:\DEV\vagrant_boxes>ls

spark-examples

3) https://github.com/bethesdamd/spark-examples

c:\DEV\vagrant_boxes\spark-examples>vagrant up

Bringing machine 'default' up with 'virtualbox' provider...

==> default: Box 'precise32' could not be found. Attempting to find and install...

default: Box Provider: virtualbox

default: Box Version: >= 0

==> default: Box file was not detected as metadata. Adding it directly...

==> default: Adding box 'precise32' (v0) for provider: virtualbox

default: Downloading: http://files.vagrantup.com/precise32.box

https://www.sigmoid.com/integrating-spark-kafka-hbase-to-power-a-real-time-dashboard/

http://blog.madhukaraphatak.com/categories/spark-two/

http://mkuthan.github.io/

https://spark-summit.org/2016/schedule/

https://habrahabr.ru/company/wrike/blog/304570/

http://www.stratio.com/

http://www.zoomdata.com/

https://github.com/bethesdamd/spark-examples

Real-time analytics

https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-18-Neumann-Alla.pdf

https://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656 Book

https://www.infoq.com/articles/apache-spark-introduction

https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook

Online classes

http://bigdatauniversity.com/courses/spark-fundamentals/ Free training

https://www.mapr.com/services/mapr-academy/create-data-pipeline-applications-using-apache-spark-on-demand

https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x

https://www.dataquest.io/mission/123

https://university.cloudera.com/content/cca175

https://dl.dropboxusercontent.com/u/8252984/spark.html

https://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.html

http://technicaltidbit.blogspot.com/

https://thomaswdinsmore.com

SparkSandbox.scala

import org.apache.spark.{SparkConf, SparkContext}

object SparkSandbox extends App {

val conf = new SparkConf().setAppName("Spark Sandbox").setMaster("local[*]")

val sc = new SparkContext(conf)

val dataSource = sc.makeRDD(List(

Person("Mike",28), Person("Adam", 31), Person("John", 30)))

val output = dataSource.filter(person => person.age > 30) sc.stop()

}

$sbt compile

The SparkContext:

is the main entry point for Spark functionality
tells Spark how to access cluster
is limited to one active instance per JVM
is automatically created in interactive shells: sc

Interactive shell:

Scala: spark-shell
Python: pyspark

RDDs are:

resilient
distributed (i.e. partitioned across the nodes of a cluster)
data sets
that are immutable
can be cached or temporarily stored on disk
and may de depend on zero or more RDDs

Operations on RDDs:

transformations
- create/extend DAG
- are evaluated lazily
- do not return values
actions
- perform transformations and subsequent actions
- return value(s)

Lineage refers to a directed acyclic graph (DAG) from the root or any persisted (intermediate) RDD, from which any partition is reconstructed in case of failure

Examples of narrow (i.e. co-partitioned data) operations:

map, flatMap or mapValues
filter
sample
union

Examples of wide (i.e. non-co-partitioned data) operations:

groupByKey
reduceByKey
sortByKey
distinct
join

Shuffle combines data from all partitions to generate new key-value pairs: expensive operation

Job is a sequence of transformations initiated by an action

A job consists of stages; stages consist of tasks
Stage boundaries are defined by shuffle dependencies
Tasks are serialized and distributed to executors

Caching RDDs can be cached/persisted with:

cache(): deserialized in memory
persist():
- MEMORY_ONLY: deserialized in JVM
- MEMORY_ONLY_SER: serialized; space efficient but more CPU-intensive to read
- MEMORY_AND_DISK: deserialized in JVM unless partition(s) do not fit in memory
- MEMORY_AND_DISK_SER: serialized version of previous
- DISK_ONLY: stored only on disk
- OFF_HEAP: serialized in Tachyon with a shared pool of memory and less overhead from garbage collection

Objects are aged out of cache in a least-recently-used (LRU) fashion; manual removal can be done with unpersist()

Spark performs best when data fits in memory; best not to spill to disk unless absolutely necessary

Replicated persist() modes perform better after faults

Serialization is critical for network performance and memory use:

Java (default): slow but flexible
Kryo: fast but not all data types are supported
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses( Array(classOf[MyClass1], classOf[MyClass2]) )

Tasks must be serializable (typically < 20 kB on master)

Variables and methods in local vs cluster mode behave differently due to closures and serialization

classes must be serializable (i.e. extend Serializable)
methods can be made functions (i.e. transferred from classes to objects)
hygienic closures can be created with shim functions
classes that depend on non-serializable classes can use the @transient annotation

http://spark.apache.org/docs/latest/quick-start.html

https://databaseline.wordpress.com/2016/04/25/shell-scripts-to-ease-spark-application-development/

https://databaseline.wordpress.com/2016/04/01/a-quickie-on-spark-actions-laziness-and-caching/

https://thomaswdinsmore.com

http://blog.cloudera.com/blog/2016/06/how-to-detect-and-report-web-traffic-anomalies-in-near-real-time/

http://blog.cloudera.com/blog/2016/06/how-to-analyze-fantasy-sports-using-apache-spark-and-sql/

Spark on Mac

Option #1 Download Spark with Hadoop support

http://spark.apache.org/downloads.html

cd ~/Downloads/spark-2.1.0-bin-hadoop2.7

./bin/spark-shell

Option#2 - use brew - disadvantage: it does not include examples and MLLib

brew search spark

apache-spark

brew install apache-spark // version 2.1.0

ls /usr/local/Cellar/apache-spark/2.1.0/bin

find-spark-home pyspark spark-beeline spark-shell spark-submit

load-spark-env.sh run-example spark-class spark-sql sparkR

ls -l /usr/local/bin | grep spark

lrwxr-xr-x 1 39 Jan 31 14:01 sparkR -> ../Cellar/apache-spark/2.1.0/bin/sparkR

lrwxr-xr-x 1 45 Jan 31 14:01 spark-submit -> ../Cellar/apache-spark/2.1.0/bin/spark-submit

lrwxr-xr-x 1 42 Jan 31 14:01 spark-sql -> ../Cellar/apache-spark/2.1.0/bin/spark-sql

lrwxr-xr-x 1 44 Jan 31 14:01 spark-shell -> ../Cellar/apache-spark/2.1.0/bin/spark-shell

lrwxr-xr-x 1 44 Jan 31 14:01 spark-class -> ../Cellar/apache-spark/2.1.0/bin/spark-class

lrwxr-xr-x 1 46 Jan 31 14:01 spark-beeline -> ../Cellar/apache-spark/2.1.0/bin/spark-beeline

lrwxr-xr-x 1 44 Jan 31 14:01 run-example -> ../Cellar/apache-spark/2.1.0/bin/run-example

lrwxr-xr-x 1 40 Jan 31 14:01 pyspark -> ../Cellar/apache-spark/2.1.0/bin/pyspark

lrwxr-xr-x 1 50 Jan 31 14:01 load-spark-env.sh -> ../Cellar/apache-spark/2.1.0/bin/load-spark-env.sh

lrwxr-xr-x 1 48 Jan 31 14:01 find-spark-home -> ../Cellar/apache-spark/2.1.0/bin/find-spark-home

export SPARK_HOME=/usr/local/Cellar/apache-spark/2.1.0/libexec

export PYTHONPATH=/usr/local/Cellar/apache-spark/2.1.0/libexec/python/:$PYTHONPATH$

http://datafireball.com/2015/04/17/running-spark-locally-on-ios/

How to set $JAVA_HOME environment variable on Mac OSX.

http://include.aorcsik.com/2014/09/19/installing-java-on-mac-os-10-10-yosemite-beta/

Set the $JAVA_HOME variable to /usr/libexec/java_home just export $JAVA_HOME in file ~/. bash_profile or ~/.profile.

$ vim .bash_profile export JAVA_HOME=$(/usr/libexec/java_home) $ source .bash_profile $ echo $JAVA_HOME/Library/Java/JavaVirtualMachines/1.7.0.jdk/Contents/Home

Why /usr/libexec/java_home?

This java_home can return the Java version specified in Java Preferences for the current user. For examples,

/usr/libexec/java_home -V Matching Java Virtual Machines (3): 1.7.0_05, x86_64: "Java SE 7" /Library/Java/JavaVirtualMachines/1.7.0.jdk/Contents/Home 1.6.0_41-b02-445, x86_64: "Java SE 6" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home 1.6.0_41-b02-445, i386: "Java SE 6" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home

HADOOP on Mac

http://amodernstory.com/2014/09/23/installing-hadoop-on-mac-osx-yosemite/

brew list

apache-spark gdbm htop maven mysql opensslreadline scala tig

cctools git makedepend mongodb nvm pkg-config sbt sqlite xz

$ ls /usr/local/Cellar

apache-spark gdbm htop maven mysql openssl readline scala tig

cctools git makedepend mongodb nvm pkg-config sbt sqlite xz

spark-shell

Spark context Web UI available at http://17.114.22.50:4040

Spark context available as 'sc' (master = local[*], app id = local-1485900389662).

Spark session available as 'spark'.

scala>

pyspark

https://hadoopist.wordpress.com/category/apache-spark/page/2/?iframe=true&preview=true%2Ffeed%2F

[~/Downloads/SPARK/spark-2.0.0-bin-hadoop2.7/examples/src/main/scala/org/apache/spark/examples]

IPython+Spark

http://amodernstory.com/2015/03/05/installing-and-running-spark-with-python-notebook-on-mac/

https://www.codementor.io/spark/tutorial/spark-python-rdd-basics#/

http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/

https://github.com/jadianes/spark-py-notebooks

http://nbviewer.ipython.org/github/tfolkman/learningwithdata/blob/master/Getting_Started_With_Spark_DataFrame.ipynb

https://www.codementor.io/spark/tutorial/building-a-recommender-with-apache-spark-python-example-app-part1#/

http://blog.prabeeshk.com/blog/2015/04/07/self-contained-pyspark-application/

SPARK+CASSANDRA

http://rustyrazorblade.com/2015/01/introduction-to-spark-cassandra/

http://rustyrazorblade.com/2015/08/migrating-from-mysql-to-cassandra-using-spark/

http://rustyrazorblade.com/2015/07/cassandra-pyspark-dataframes-revisted/

Installing Spark from sources

$ brew install maven

$ tar -xvzf spark-1.4.1.tgz

$ cd spark-1.4.1

$ mvn -DskipTests clean package ( takes > 15 min to see following)

[INFO] ------------------------------------------------------------------------

[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ........................... SUCCESS [ 31.458 s]

[INFO] Spark Launcher Project ............................. SUCCESS [ 28.369 s]

[INFO] Spark Project Networking ........................... SUCCESS [ 8.593 s]

[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 4.597 s]

[INFO] Spark Project Unsafe ............................... SUCCESS [ 3.599 s]

[INFO] Spark Project Core ................................. SUCCESS [02:41 min]

[INFO] Spark Project Bagel ................................ SUCCESS [ 12.435 s]

[INFO] Spark Project GraphX ............................... SUCCESS [ 36.629 s]

[INFO] Spark Project Streaming ............................ SUCCESS [01:15 min]

[INFO] Spark Project Catalyst ............................. SUCCESS [01:05 min]

[INFO] Spark Project SQL .................................. SUCCESS [01:32 min]

[INFO] Spark Project ML Library ........................... SUCCESS [01:56 min]

[INFO] Spark Project Tools ................................ SUCCESS [ 9.808 s]

[INFO] Spark Project Hive ................................. SUCCESS [01:49 min]

[INFO] Spark Project REPL ................................. SUCCESS [ 21.156 s]

[INFO] Spark Project Assembly ............................. SUCCESS [01:02 min]

[INFO] Spark Project External Twitter ..................... SUCCESS [ 12.396 s]

[INFO] Spark Project External Flume Sink .................. SUCCESS [ 16.039 s]

[INFO] Spark Project External Flume ....................... SUCCESS [ 22.974 s]

[INFO] Spark Project External MQTT ........................ SUCCESS [ 22.455 s]

[INFO] Spark Project External ZeroMQ ...................... SUCCESS [ 11.765 s]

[INFO] Spark Project External Kafka ....................... SUCCESS [ 21.574 s]

[INFO] Spark Project Examples ............................. SUCCESS [01:34 min]

[INFO] Spark Project External Kafka Assembly .............. SUCCESS [ 19.052 s]

[INFO] ------------------------------------------------------------------------

[INFO] BUILD SUCCESS

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 17:42 min

[INFO] Finished at: 2015-08-04T15:33:27-07:00

[INFO] Final Memory: 101M/2126M

[INFO] ------------------------------------------------------------------------

mlubinsky$ bin/spark-shell

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

15/08/04 15:41:34 INFO SecurityManager: Changing view acls to: mlubinsky

15/08/04 15:41:34 INFO SecurityManager: Changing modify acls to: mlubinsky

15/08/04 15:41:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mlubinsky); users with modify permissions: Set(mlubinsky)

15/08/04 15:41:35 INFO HttpServer: Starting HTTP Server

15/08/04 15:41:35 INFO Utils: Successfully started service 'HTTP class server' on port 55888.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.4.1

/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)

Type in expressions to have them evaluated.

Type :help for more information.

15/08/04 15:41:37 INFO SparkContext: Running Spark version 1.4.1

15/08/04 15:41:37 INFO SecurityManager: Changing view acls to: mlubinsky

15/08/04 15:41:37 INFO SecurityManager: Changing modify acls to: mlubinsky

15/08/04 15:41:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mlubinsky); users with modify permissions: Set(mlubinsky)

15/08/04 15:41:38 INFO Slf4jLogger: Slf4jLogger started

15/08/04 15:41:38 INFO Remoting: Starting remoting

15/08/04 15:41:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@17.247.157.92:55889]

15/08/04 15:41:38 INFO Utils: Successfully started service 'sparkDriver' on port 55889.

15/08/04 15:41:38 INFO SparkEnv: Registering MapOutputTracker

15/08/04 15:41:38 INFO SparkEnv: Registering BlockManagerMaster

15/08/04 15:41:38 INFO DiskBlockManager: Created local directory at /private/var/folders/fw/pvmhqrrn1flgsq36wkd3s7ww0000gn/T/spark-0368f07e-3826-4b15-8872-03554f95d58a/blockmgr-f502a6e1-db07-4cf5-98cc-8f967b0ff926

15/08/04 15:41:38 INFO MemoryStore: MemoryStore started with capacity 265.1 MB

15/08/04 15:41:38 INFO HttpFileServer: HTTP File server directory is /private/var/folders/fw/pvmhqrrn1flgsq36wkd3s7ww0000gn/T/spark-0368f07e-3826-4b15-8872-03554f95d58a/httpd-878c8663-eb22-4787-bcf1-b472b5c014a1

15/08/04 15:41:38 INFO HttpServer: Starting HTTP Server

15/08/04 15:41:38 INFO Utils: Successfully started service 'HTTP file server' on port 55890.

15/08/04 15:41:38 INFO SparkEnv: Registering OutputCommitCoordinator

15/08/04 15:41:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.

15/08/04 15:41:38 INFO SparkUI: Started SparkUI at http://17.247.157.92:4040

15/08/04 15:41:38 INFO Executor: Starting executor ID driver on host localhost

15/08/04 15:41:38 INFO Executor: Using REPL class URI: http://17.247.157.92:55888

15/08/04 15:41:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55891.

15/08/04 15:41:38 INFO NettyBlockTransferService: Server created on 55891

15/08/04 15:41:38 INFO BlockManagerMaster: Trying to register BlockManager

15/08/04 15:41:38 INFO BlockManagerMasterEndpoint: Registering block manager localhost:55891 with 265.1 MB RAM, BlockManagerId(driver, localhost, 55891)

15/08/04 15:41:38 INFO BlockManagerMaster: Registered BlockManager

15/08/04 15:41:38 INFO SparkILoop: Created spark context..

Spark context available as sc.

15/08/04 15:41:39 INFO SparkILoop: Created sql context..

SQL context available as sqlContext.

scala>

On my comp I see 2 identical folders

./Documents/Downloads/SPARK/spark-1.4.1

./Downloads/SPARK/spark-1.4.1

brew info scala

scala: stable 2.11.5 (bottled)

http://www.scala-lang.org/

Not installed

From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/scala.rb

==> Options

--with-docs

Also install library documentation

--with-src

Also install sources for IDE support

==> Caveats

To use with IntelliJ, set the Scala home to:

/usr/local/opt/scala/idea

http://thomaswdinsmore.com/2015/07/13/big-analytics-roundup-july-13-2015/

http://radar.oreilly.com/data

http://www.ibmbigdatahub.com/technology/hadoop-and-spark

SPARK on windows

http://stackoverflow.com/questions/32721647/why-does-starting-spark-shell-fail-with-nullpointerexception-on-windows

http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path

http://nishutayaltech.blogspot.in/2015/04/how-to-run-apache-spark-on-windows7-in.html

http://ysinjab.com/2015/03/28/hello-spark/

SPARK

http://habrahabr.ru/post/274705/

https://www.supergloo.com/spark-tutorial/spark-tutorials-scala/

https://habrahabr.ru/company/retailrocket/blog/302828/

https://news.ycombinator.com/item?id=11126188 SPARK 2

http://fullstackml.com/2016/01/19/how-to-check-hypotheses-with-bootstrap-and-apache-spark/

http://www.teachingmachines.io/blog/2016/1/1/6ywzqtxq79ajny7xkf8jkd13k4t7bu

https://github.com/jpzk/cookiecutter-scala-spark

https://spark.apache.org/docs/latest/

https://medium.com/tag/spark

https://dl.dropboxusercontent.com/u/8252984/spark.html#1 Installing Spark

http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

https://github.com/cloudera/spark-timeseries

https://medium.com/@rickynguyen/getting-started-with-spark-day-4-3ad5e5d2d5fd#.41t3wu2ar

https://medium.com/@rickynguyen/getting-started-with-spark-day-5-36b62a6d13bf#.1lliya1hf

http://habrahabr.ru/company/bitrix/blog/273279/

http://www.allitebooks.com/?s=spark BOOKS

https://www.youtube.com/watch?v=y8WSPyeF73g&feature=share

http://www.it-ebooks.org/book/oreilly/advanced_analytics_with_spark

http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

http://habrahabr.ru/company/targetix/blog/266009/ SPARK

http://baweaver.com/blog/2015/06/21/intro-to-spark/

https://www.quora.com/How-do-I-learn-Apache-Spark

https://www.youtube.com/watch?v=RgBzVHe2D84

http://www.insightdataengineering.com/blog/Working-With-Apache-Spark.html

http://www.slideshare.net/felixcss/sparkly-notebook-interactive-analysis-and-visualization-with-spark

val textFile = sc.textFile("README.md")

val wordCounts =

textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

wordCounts.collect()

val linesWithSpark = textFile.filter(line => line.contains("Spark"))

----------

JavaRDD<String> lines = sc.textFile("data.txt"); JavaRDD<Integer> lineLengths = lines.map(s -> s.length());int totalLength = lineLengths.reduce((a, b) -> a + b);

Understand what an RDD is.

1. Understand the difference between a driver, master and executor.
2. Understand the lazy execution model i.e difference between transformations and actions and when actual work is done.
3. How is data partitioned across executors? How does number of partitions and partitioning key affect a job running time, what is the right number of partitions for various jobs, how custom partitioning functions can be defined.
4. Learn to inspect RDDs from the UI to see # of partitions, how much fits in memory vs disk.
5. Learn about shuffles. Whats a shuffle, why is it expensive, which operations can cause shuffles, how can they be avoided (e.g use reduceByKey instead of groupByKey if you can)

http://habrahabr.ru/company/mlclass/blog/250811/

In Spark, a DataFrame is a distributed collection of data organized into named columns (see:https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html). Conceptually, DataFrames are similar to tables in a relational database except they are partitioned across multiple nodes in a Spark cluster.

It’s important to understand that Spark does not actually load the socialdata collection into memory at this point. We’re only setting up to perform some analysis on that data; the actual data isn’t loaded into Spark until it is needed to perform some calculation later in the job. This allows Spark to perform the necessary column and partition pruning operations to optimize data access into Solr.

Every DataFrame has a schema. You can use the printSchema() function to get information about the fields available for the tweets DataFrame:

http://habrahabr.ru/post/271375/

http://www.spark.tc/

https://www.reddit.com/r/apachespark

http://www.zdnet.com/article/hadoop-and-spark-a-tale-of-two-cities/

http://fullstackml.com/2015/11/10/beginners-guide-apache-spark-python-machine-learning-scenario-with-a-large-input-dataset/

http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/

http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/

we should think of Spark as superseding MapReduce and probably a couple other projects (Pig, and "classic" Mahout 0.9 by MLlib, and Giraph by GraphX).

Spark has no storage system of its own. Right off the bat, this means Spark has to be paired with something else to do any work. That's usually Hadoop-related storage like HDFS, or maybe HBase, but could also be things like Amazon S3. Spark has no security model of its own, so needs to lean on things like YARN / Hadoop mechanisms if that matters. There's no NoSQL-like engine in Spark like HBase / Cassandra, or interactive SQL like Drill /

Spark has a Map and a Reduce function like MapReduce, but it adds others like Filter, Join and Group-by, so it’s easier to develop for Spark. In fact, Spark provides for lots of instructions that are a higher level of abstraction than what MapReduce provided. You can think more about how you want the data processed, rather than about how to cajole MapReduce into doing what you want. This might not seem that important, until you look at this: MapReduce-Wordcount. This is the code to calculate a count of words in a text file, done in MapReduce (not Spark). It’s over 100 lines of code, and fairly unintuitive. The equivalent in Spark is found on this page: Spark Examples (look for the Word Count example). It’s four lines versus over 100.

Трансформации

Результатом применения данной операции к RDD является новый RDD. Как правило, это операции, которые каким-либо образом преобразовывают элементы данного датасета. Вот неполный самых распространенных преобразований, каждое из которых возвращает новый датасет (RDD):

.map(function) — применяет функцию function к каждому элементу датасета

.filter(function) — возвращает все элементы датасета, на которых функция function вернула истинное значение

.distinct([numTasks]) — возвращает датасет, который содержит уникальные элементы исходного датасета

Oперации над множествами:

.union(otherDataset)

.intersection(otherDataset)

.cartesian(otherDataset) — новый датасет содержит в себе всевозможные пары (A,B), где первый элемент принадлежит исходному датасету, а второй — датасету-аргументу

Действия

Действия применяются тогда, когда необходимо материализовать результат — как правило, сохранить данные на диск, либо вывести часть данных в консоль. Вот список самых распространенных действий, которые можно применять над RDD:

.saveAsTextFile(path) — сохраняет данные в текстовый файл (в hdfs, на локальную машину или в любую другую поддерживаемую файловую систему — полный список можно посмотреть в документации)

.collect() — возвращает элементы датасета в виде массива. Как правило, это применяется в случаях, когда данных в датасете уже мало (применены различные фильтры и преобразования) — и необходима визуализация, либо дополнительный анализ данных, например средствами пакета Pandas

.take(n) — возвращает в виде массива первые n элементов датасета

.count() — возвращает количество элементов в датасете

.reduce(function) — знакомая операция для тех, кто знаком с MapReduce. Из механизма этой операции следует, что функция function (которая принимает на вход 2 аргумента возвращает одно значение) должна быть обязательно коммутативной и ассоциативной

Это основы, которые необходимо знать при работе с инструментом. Теперь немного займемся практикой и покажем, как загружать данные в Spark и делать с ними простые вычисления

При запуске Spark, первое, что необходимо сделать — это создать SparkContext(если говорить простыми словами — это обьект, который отвечает за реализацию более низкоуровневых операций с кластером — подробнее — см. документацию), который при запуске Spark-Shell создается автоматически и доступен сразу (обьект sc)

Загружать данные в Spark можно двумя путями:

а). Непосредственно из локальной программы с помощью функции.parallelize(data)

localData = [5,7,1,12,10,25] ourFirstRDD = sc.parallelize(localData)

б). Из поддерживаемых хранилищ (например, hdfs) с помощью функции.textFile(path)

ourSecondRDD = sc.textFile("path to some data on the cluster")

В этом пункте важно отметить одну особенность хранения данных в Spark'e и в тоже время самую полезную функцию .cache() (отчасти благодаря которой Spark стал так популярен), которая позволяет закэшировать данные в оперативной памяти (с учетом доступности последней). Это позволяет производить итеративные вычисления в оперативной памяти, тем самым избавившись от IO-overhead'а. Это особенно важно в контексте машинного обучения и вычислений на графах, т.к. большинство алгоритмов итеративные — начиная от градиентных методов, заканчивая такими алгоритмами, как PageRank.

После загрузки данных в RDD мы можем делать над ним различные трансформации и действия, о которых говорилось выше. Например:

Посмотрим первые несколько элементов:

for item in ourRDD.top(10): print item

Либо сразу загрузим эти элементы в Pandas и будем работать с DataFrame'ом:

import pandas as pd pd.DataFrame(ourRDD.map(lambda x: x.split(";")[:]).top(10))

пример трансформации: вычислим максимальный и минимальный элементы нашего датасета. Как легко догадаться, сделать это можно, например, с помощью функции .reduce():

localData = [5,7,1,12,10,25] ourRDD = sc.parallelize(localData) print ourRDD.reduce(max) print ourRDD.reduce(min)

Итак, мы рассмотрели основные понятия, необходимые для работы с инструментом. Мы не рассматривали работу с SQL, работу с парами <ключ, значение> (что делается легко — для этого достаточно сначала применить к RDD, например, фильтр, чтобы выделить, ключ, а дальше — уже легко пользоваться встроенными функциями, вроде sortByKey, countByKey, join и др.)

Spark

http://www.infoq.com/articles/apache-spark-introduction

http://www.infoq.com/articles/apache-spark-sql

https://www.youtube.com/watch?v=65aV15uDKgA

https://www.youtube.com/watch?v=7ooZ4S7Ay6Y

http://www.reddit.com/r/apachespark

http://baweaver.com/blog/2015/06/21/intro-to-spark/

http://sparkhub.databricks.com/

http://www.insightdataengineering.com/blog/Working-With-Apache-Spark.html

https://www.youtube.com/watch?v=FjhRkfAuU7I

http://www.infoq.com/presentations/apache-spark-future

http://www.infoq.com/presentations/spark-cassandra

http://en.wikipedia.org/wiki/Spark_(cluster_computing_framework)

https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

http://horicky.blogspot.ru/2015/02/big-data-processing-in-spark.html

http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/

https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html

http://www.planetcassandra.org/blog/the-new-analytics-toolbox-with-apache-spark-going-beyond-hadoop/

http://habrahabr.ru/post/263491/

- Spark is only a computing engine, while Hadoop is a complete stack of storage, cluster management and computing tools, and Spark can run well on Hadoop. However, we do see many deployments that are not on Hadoop, including deployments on NoSQL stores (e.g. Cassandra) and deployments directly against cloud storage (e.g. Amazon S3, Databricks Cloud). In this sense Spark is reaching a broader audience than Hadoop users.
- Most of the development activity in Apache Spark is now in the built-in libraries, including Spark SQL, Spark Streaming, MLlib and GraphX. Out of these, the most popular are Spark Streaming and Spark SQL
- it is true that Spark uses a micro-batch execution model, I don't think this is a problem in practice, because the batches are as short as 0.5 seconds. In most applications of streaming big data, the latency to get the data in is much higher (e.g. you have sensors that send in data every 10 seconds, or something like that), or the analysis you want to do is over a longer window (e.g. you want to track events over the past 10 minutes), so it doesn't matter that you take a short amount of time to process it.
- The benefit of Spark's micro-batch model is that you get full fault-tolerance and "exactly-once" processing for the entire computation, meaning it can recover all state and results even if a node crashes. Flink and Storm don't provide this, requiring application developers to worry about missing data or to treat the streaming results as potentially incorrect. Again, that can be okay for some applications (e.g. just basic monitoring), but it makes it hard to write more complex applications and reason about their results

http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/

http://fullstackml.com/2015/10/29/beginners-guide-apache-spark-machine-learning-scenario-with-a-large-input-dataset/

http://marcobonzanini.com/2015/07/09/getting-started-with-apache-spark-and-python-3/

http://blog.brakmic.com/data-science-for-losers-part-3-scala-apache-spark/

https://news.ycombinator.com/item?id=10337154 SPARK vs HADOOP

http://multithreaded.stitchfix.com/blog/2015/10/06/spark-for-data-science/

http://www.oreilly.com/data/sparkcert?imm_mid=0da452

Page updated

Google Sites

Report abuse