https://www.amazon.com/s/ref=nb_sb_noss_2?url=node%3D154606011&field-keywords=hadoop+spark+interview
www.itversity.com/2018/04/19/setup-development-environment-big-data-hadoop-and-spark/
“Never drop the ball pursuing a goal just because of the time it takes. That time will pass by anyway.”
https://cloudxlab.com/blog/spark-interview-questions/
http://virtuslab.com/blog/spark-sql-hood-part-i/
https://www.coursera.org/learn/scala-spark-big-data
https://www.coursera.org/learn/scala-spark-big-data/home/week/1
https://www.reddit.com/r/apachespark/
https://habrahabr.ru/company/npl/blog/327556/
https://habrahabr.ru/company/jugru/blog/325070/
http://stackoverflow.com/questions/31872396/how-to-encode-categorical-features-in-apache-spark
http://www.jowanza.com/post/159128786624/a-gentle-intro-to-graph-analytics-with-graphframes
https://databricks.com/try-databricks
https://newcircle.com/s/post/1795/modern-spark-dataframe-dataset-tutorial
https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/
https://toree.incubator.apache.org/
val textFile = sc.textFile("README.md")
line with more words
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
map-reduce example
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
Online classes
https://bigdatauniversity.com/courses/spark-rdd/
https://www.youtube.com/playlist?list=PLf0swTFhTI8rjBS9zJGReO1IWLf7Lpi7g
https://bigdatauniversity.com/courses/what-is-spark/
https://www.coursera.org/specializations/scala
https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x
Books
http://www.informit.com/store/practical-data-science-with-hadoop-and-spark-designing-9780134024141
http://blog.matthewrathbone.com/2017/01/13/spark-books.html
https://habrahabr.ru/company/retailrocket/blog/258543/
http://blog.cloudera.com/blog/2017/02/working-with-udfs-in-apache-spark/ UDF
https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey
http://blog.madhukaraphatak.com/categories/spark-two/
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-sparksession.html
https://news.ycombinator.com/item?id=12889469
https://leanpub.com/powerjava Mark Watson
http://tekslate.com/spark-interview-questions/
http://blog.districtdatalabs.com/getting-started-with-spark-in-python
https://github.com/intel-analytics/BigDL
http://www.svds.com/building-a-prediction-engine-using-spark-kudu-and-impala/
http://feederio.com/bigdata-books
https://www.mapr.com/getting-started-apache-spark
http://www.sanfoundry.com/hadoop-questions-campus-interviews/
https://github.com/deanwampler/JustEnoughScalaForSpark
https://github.com/luisbelloch/data_processing_course/blob/master/README.md
http://www.jowanza.com/post/154680354949/a-gentle-intro-to-udafs-in-apache-spark
https://www.safaribooksonline.com/library/view/the-spark-video/9781491970355/
https://0x0fff.com/apache-spark-future
https://www.amazon.com/gp/product/B01AVJDE9I Apache Spark Scala Interview Questions
https://www.amazon.com/collection-Interview-Questions-Collection-Programming-ebook/dp/B01I02UDD8
http://beekeeperdata.com/posts/hadoop/2015/12/28/spark-java-tutorial.html
https://fossies.org/dox/spark-2.0.0/index.html
https://habrahabr.ru/post/316988/ Non-linear regression with spark
http://hortonworks.com/blog/try-apache-spark-2-1-zeppelin-hortonworks-data-cloud/
http://192.168.56.1:4040/jobs/
Sharing data between execution nodes:
broadcast variables
accumulator variables - can be modified by executers
object WordCount { def main(args: Array[String]): Unit = { val inputPath = args(0) val outputPath = args(1) val sc = new SparkContext() val lines = sc.textFile(inputPath) val wordCounts = lines.flatMap {line => line.split(" ")} .map(word => (word, 1)) .reduceByKey(_ + _) // means reduceByKey((x,y)=> x + y) wordCounts.saveAsTextFile(outputPath) }}
val numbers = Array(1, 2, 3, 4, 5)val sum = numbers.reduceLeft[Int](_+_)
//val sum = numbers.reduceLeft((a:Int, b:Int) => a + b) //same as above
println("The sum of the numbers one through five is " + sum)
(_+_) is actually a Scala shorthand for an anonymous function taking
two parameters and returning the result obtained by invoking the + operator
on the first passing the second.
Windows
some Windows users may get this error when first trying to get a DataFrame working:
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
This error can be resolved by running your command prompt as administrator and then typing:
C:\winutils\bin\winutils.exe chmod 777 C:\tmp\hive
this basically gives \tmp\hive a 777 permission. This error has to do more with HADOOP/HIVE than with Spark itself. If you are still getting the error (sometimes really old versions of Windows, like Windows 7 still encounter issues), you may need to get a different winutils file, here is a link to one that may work for you:
https://issues.apache.org/jira/browse/SPARK-10528
more general info regarding this error:
c:\DEV\spark-2.0.0-bin-hadoop2.7>bin\run-example.cmd SparkPi
c:\DEV\spark-2.0.0-bin-hadoop2.7>bin\spark-shell
scala> :help
All commands can be abbreviated, e.g., :he instead of :help.
:edit <id>|<line> edit history
:help [command] print this summary or command-specific help
:history [num] show the history (optional num is commands to show)
:h? <string> search the history
:imports [name name ...] show import history, identifying sources of names
:implicits [-v] show the implicits in scope
:javap <path|class> disassemble a file or class name
:line <id>|<line> place line(s) at the end of history
:load <path> interpret lines in a file
:paste [-raw] [path] enter paste mode or paste a file
:power enable power user mode
:quit exit the interpreter
:replay [options] reset the repl and replay all previous commands
:require <path> add a jar to the classpath
:reset [options] reset the repl to its initial state, forgetting all session entries
:save <path> save replayable session to a file
:sh <command line> run a shell command (result is implicitly => List[String])
:settings <options> update compiler options, if possible; see reset
:silent disable/enable automatic printing of results
:type [-v] <expr> display the type of an expression without evaluating it
:kind [-v] <expr> display the kind of expression's type
:warnings show the suppressed warnings from the most recent line which had any
Windows 10
http://stackoverflow.com/questions/25481325/how-to-set-up-spark-on-windows
http://yokekeong.com/running-apache-spark-with-sparklyr-and-r-in-windows/ RStudio
https://nerdsrule.co/2016/06/15/ipython-notebook-and-spark-setup-for-windows-10/ IPython
https://hernandezpaul.wordpress.com/2016/01/24/apache-spark-installation-on-windows-10/
http://techgobi.blogspot.in/2016/08/configure-spark-on-windows-some-error.html
http://hadoopsupport.blogspot.com/2016/05/install-apache-spark-on-windows-10.html
https://www.youtube.com/watch?v=WlE7RNdtfwE
https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details
In Spark 2.0 DataFrame is just a type alias for Dataset[Row], so data sets are the way to go. Under the hood RDDs are used so knowing how they work is not essential but often useful.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-overview.html
https://www.reddit.com/r/apachespark/comments/5078lw/apache_spark_developer_certification/
http://www.slideshare.net/Newprolab/data-science-week-2016-rambler-co-apache-spark
http://www.dattamsha.com/2014/09/hadoop-mr-vs-spark-rdd-wordcount-program/
https://habrahabr.ru/company/jugru/blog/309776/
https://www.youtube.com/watch?v=I5JtvpyM14U
http://beekeeperdata.com/posts/hadoop/2015/12/14/spark-scala-tutorial.html
https://www.youtube.com/watch?v=OheiUl_uXwo
http://www.kdnuggets.com/2016/09/7-steps-mastering-apache-spark.html
On Windows using vagrant and VirtualBox
1) https://atlas.hashicorp.com/paulovn/boxes/spark-base64
2) https://github.com/alexholmes/vagrant-hadoop-spark-hive
c:\DEV\vagrant_boxes>ls
spark-examples
c:\DEV\vagrant_boxes>vagrant box add centos65 https://github.com/2creatives/vagrant-centos/releases/download/v6.5.1/centos65-x86_64-20131205.box
==> box: Box file was not detected as metadata. Adding it directly...
==> box: Adding box 'centos65' (v0) for provider:
box: Downloading: https://github.com/2creatives/vagrant-centos/releases/download/v6.5.1/centos65-x86_64-20131205.box
box: Progress: 100% (Rate: 9596k/s, Estimated time remaining: --:--:--)
==> box: Successfully added box 'centos65' (v0) for 'virtualbox'!
c:\DEV\vagrant_boxes>ls
spark-examples
3) https://github.com/bethesdamd/spark-examples
c:\DEV\vagrant_boxes\spark-examples>vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'precise32' could not be found. Attempting to find and install...
default: Box Provider: virtualbox
default: Box Version: >= 0
==> default: Box file was not detected as metadata. Adding it directly...
==> default: Adding box 'precise32' (v0) for provider: virtualbox
default: Downloading: http://files.vagrantup.com/precise32.box
https://www.sigmoid.com/integrating-spark-kafka-hbase-to-power-a-real-time-dashboard/
http://blog.madhukaraphatak.com/categories/spark-two/
https://spark-summit.org/2016/schedule/
https://habrahabr.ru/company/wrike/blog/304570/
https://github.com/bethesdamd/spark-examples
Real-time analytics
https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-18-Neumann-Alla.pdf
https://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656 Book
https://www.infoq.com/articles/apache-spark-introduction
https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook
Online classes
http://bigdatauniversity.com/courses/spark-fundamentals/ Free training
https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x
https://www.dataquest.io/mission/123
https://university.cloudera.com/content/cca175
https://dl.dropboxusercontent.com/u/8252984/spark.html
https://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.html
http://technicaltidbit.blogspot.com/
SparkSandbox.scala
import org.apache.spark.{SparkConf, SparkContext}
object SparkSandbox extends App {
val conf = new SparkConf().setAppName("Spark Sandbox").setMaster("local[*]")
val sc = new SparkContext(conf)
val dataSource = sc.makeRDD(List(
Person("Mike",28), Person("Adam", 31), Person("John", 30)))
val output = dataSource.filter(person => person.age > 30) sc.stop()
}
$sbt compile
The SparkContext:
is the main entry point for Spark functionality
tells Spark how to access cluster
is limited to one active instance per JVM
is automatically created in interactive shells: sc
Interactive shell:
Scala: spark-shell
Python: pyspark
RDDs are:
resilient
distributed (i.e. partitioned across the nodes of a cluster)
data sets
that are immutable
can be cached or temporarily stored on disk
and may de depend on zero or more RDDs
Operations on RDDs:
transformations
create/extend DAG
are evaluated lazily
do not return values
actions
perform transformations and subsequent actions
return value(s)
Lineage refers to a directed acyclic graph (DAG) from the root or any persisted (intermediate) RDD, from which any partition is reconstructed in case of failure
Examples of narrow (i.e. co-partitioned data) operations:
map, flatMap or mapValues
filter
sample
union
Examples of wide (i.e. non-co-partitioned data) operations:
groupByKey
reduceByKey
sortByKey
distinct
join
Shuffle combines data from all partitions to generate new key-value pairs: expensive operation
Job is a sequence of transformations initiated by an action
A job consists of stages; stages consist of tasks
Stage boundaries are defined by shuffle dependencies
Tasks are serialized and distributed to executors
Caching RDDs can be cached/persisted with:
cache(): deserialized in memory
persist():
MEMORY_ONLY: deserialized in JVM
MEMORY_ONLY_SER: serialized; space efficient but more CPU-intensive to read
MEMORY_AND_DISK: deserialized in JVM unless partition(s) do not fit in memory
MEMORY_AND_DISK_SER: serialized version of previous
DISK_ONLY: stored only on disk
OFF_HEAP: serialized in Tachyon with a shared pool of memory and less overhead from garbage collection
Objects are aged out of cache in a least-recently-used (LRU) fashion; manual removal can be done with unpersist()
Spark performs best when data fits in memory; best not to spill to disk unless absolutely necessary
Replicated persist() modes perform better after faults
Serialization is critical for network performance and memory use:
Java (default): slow but flexible
Kryo: fast but not all data types are supported
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses( Array(classOf[MyClass1], classOf[MyClass2]) )
Tasks must be serializable (typically < 20 kB on master)
Variables and methods in local vs cluster mode behave differently due to closures and serialization
classes must be serializable (i.e. extend Serializable)
methods can be made functions (i.e. transferred from classes to objects)
hygienic closures can be created with shim functions
classes that depend on non-serializable classes can use the @transient annotation
http://spark.apache.org/docs/latest/quick-start.html
https://databaseline.wordpress.com/2016/04/25/shell-scripts-to-ease-spark-application-development/
https://databaseline.wordpress.com/2016/04/01/a-quickie-on-spark-actions-laziness-and-caching/
http://blog.cloudera.com/blog/2016/06/how-to-analyze-fantasy-sports-using-apache-spark-and-sql/
Spark on Mac
Option #1 Download Spark with Hadoop support
http://spark.apache.org/downloads.html
cd ~/Downloads/spark-2.1.0-bin-hadoop2.7
./bin/spark-shell
Option#2 - use brew - disadvantage: it does not include examples and MLLib
brew search spark
apache-spark
brew install apache-spark // version 2.1.0
ls /usr/local/Cellar/apache-spark/2.1.0/bin
find-spark-home pyspark spark-beeline spark-shell spark-submit
load-spark-env.sh run-example spark-class spark-sql sparkR
ls -l /usr/local/bin | grep spark
lrwxr-xr-x 1 39 Jan 31 14:01 sparkR -> ../Cellar/apache-spark/2.1.0/bin/sparkR
lrwxr-xr-x 1 45 Jan 31 14:01 spark-submit -> ../Cellar/apache-spark/2.1.0/bin/spark-submit
lrwxr-xr-x 1 42 Jan 31 14:01 spark-sql -> ../Cellar/apache-spark/2.1.0/bin/spark-sql
lrwxr-xr-x 1 44 Jan 31 14:01 spark-shell -> ../Cellar/apache-spark/2.1.0/bin/spark-shell
lrwxr-xr-x 1 44 Jan 31 14:01 spark-class -> ../Cellar/apache-spark/2.1.0/bin/spark-class
lrwxr-xr-x 1 46 Jan 31 14:01 spark-beeline -> ../Cellar/apache-spark/2.1.0/bin/spark-beeline
lrwxr-xr-x 1 44 Jan 31 14:01 run-example -> ../Cellar/apache-spark/2.1.0/bin/run-example
lrwxr-xr-x 1 40 Jan 31 14:01 pyspark -> ../Cellar/apache-spark/2.1.0/bin/pyspark
lrwxr-xr-x 1 50 Jan 31 14:01 load-spark-env.sh -> ../Cellar/apache-spark/2.1.0/bin/load-spark-env.sh
lrwxr-xr-x 1 48 Jan 31 14:01 find-spark-home -> ../Cellar/apache-spark/2.1.0/bin/find-spark-home
export SPARK_HOME=/usr/local/Cellar/apache-spark/2.1.0/libexec
export PYTHONPATH=/usr/local/Cellar/apache-spark/2.1.0/libexec/python/:$PYTHONPATH$
http://datafireball.com/2015/04/17/running-spark-locally-on-ios/
How to set $JAVA_HOME environment variable on Mac OSX.
http://include.aorcsik.com/2014/09/19/installing-java-on-mac-os-10-10-yosemite-beta/
Set the $JAVA_HOME variable to /usr/libexec/java_home just export $JAVA_HOME in file ~/. bash_profile or ~/.profile.
$ vim .bash_profile export JAVA_HOME=$(/usr/libexec/java_home) $ source .bash_profile $ echo $JAVA_HOME/Library/Java/JavaVirtualMachines/1.7.0.jdk/Contents/Home
Why /usr/libexec/java_home?
This java_home can return the Java version specified in Java Preferences for the current user. For examples,
/usr/libexec/java_home -V Matching Java Virtual Machines (3): 1.7.0_05, x86_64: "Java SE 7" /Library/Java/JavaVirtualMachines/1.7.0.jdk/Contents/Home 1.6.0_41-b02-445, x86_64: "Java SE 6" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home 1.6.0_41-b02-445, i386: "Java SE 6" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
HADOOP on Mac
http://amodernstory.com/2014/09/23/installing-hadoop-on-mac-osx-yosemite/
brew list
apache-spark gdbm htop maven mysql opensslreadline scala tig
cctools git makedepend mongodb nvm pkg-config sbt sqlite xz
$ ls /usr/local/Cellar
apache-spark gdbm htop maven mysql openssl readline scala tig
cctools git makedepend mongodb nvm pkg-config sbt sqlite xz
spark-shell
Spark context Web UI available at http://17.114.22.50:4040
Spark context available as 'sc' (master = local[*], app id = local-1485900389662).
Spark session available as 'spark'.
scala>
pyspark
https://hadoopist.wordpress.com/category/apache-spark/page/2/?iframe=true&preview=true%2Ffeed%2F
[~/Downloads/SPARK/spark-2.0.0-bin-hadoop2.7/examples/src/main/scala/org/apache/spark/examples]
IPython+Spark
http://amodernstory.com/2015/03/05/installing-and-running-spark-with-python-notebook-on-mac/
https://www.codementor.io/spark/tutorial/spark-python-rdd-basics#/
http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/
https://github.com/jadianes/spark-py-notebooks
http://blog.prabeeshk.com/blog/2015/04/07/self-contained-pyspark-application/
SPARK+CASSANDRA
http://rustyrazorblade.com/2015/01/introduction-to-spark-cassandra/
http://rustyrazorblade.com/2015/08/migrating-from-mysql-to-cassandra-using-spark/
http://rustyrazorblade.com/2015/07/cassandra-pyspark-dataframes-revisted/
Installing Spark from sources
$ brew install maven
$ tar -xvzf spark-1.4.1.tgz
$ cd spark-1.4.1
$ mvn -DskipTests clean package ( takes > 15 min to see following)
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 31.458 s]
[INFO] Spark Launcher Project ............................. SUCCESS [ 28.369 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 8.593 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 4.597 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 3.599 s]
[INFO] Spark Project Core ................................. SUCCESS [02:41 min]
[INFO] Spark Project Bagel ................................ SUCCESS [ 12.435 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 36.629 s]
[INFO] Spark Project Streaming ............................ SUCCESS [01:15 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [01:05 min]
[INFO] Spark Project SQL .................................. SUCCESS [01:32 min]
[INFO] Spark Project ML Library ........................... SUCCESS [01:56 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 9.808 s]
[INFO] Spark Project Hive ................................. SUCCESS [01:49 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 21.156 s]
[INFO] Spark Project Assembly ............................. SUCCESS [01:02 min]
[INFO] Spark Project External Twitter ..................... SUCCESS [ 12.396 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [ 16.039 s]
[INFO] Spark Project External Flume ....................... SUCCESS [ 22.974 s]
[INFO] Spark Project External MQTT ........................ SUCCESS [ 22.455 s]
[INFO] Spark Project External ZeroMQ ...................... SUCCESS [ 11.765 s]
[INFO] Spark Project External Kafka ....................... SUCCESS [ 21.574 s]
[INFO] Spark Project Examples ............................. SUCCESS [01:34 min]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [ 19.052 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 17:42 min
[INFO] Finished at: 2015-08-04T15:33:27-07:00
[INFO] Final Memory: 101M/2126M
[INFO] ------------------------------------------------------------------------
mlubinsky$ bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/04 15:41:34 INFO SecurityManager: Changing view acls to: mlubinsky
15/08/04 15:41:34 INFO SecurityManager: Changing modify acls to: mlubinsky
15/08/04 15:41:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mlubinsky); users with modify permissions: Set(mlubinsky)
15/08/04 15:41:35 INFO HttpServer: Starting HTTP Server
15/08/04 15:41:35 INFO Utils: Successfully started service 'HTTP class server' on port 55888.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.1
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
15/08/04 15:41:37 INFO SparkContext: Running Spark version 1.4.1
15/08/04 15:41:37 INFO SecurityManager: Changing view acls to: mlubinsky
15/08/04 15:41:37 INFO SecurityManager: Changing modify acls to: mlubinsky
15/08/04 15:41:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mlubinsky); users with modify permissions: Set(mlubinsky)
15/08/04 15:41:38 INFO Slf4jLogger: Slf4jLogger started
15/08/04 15:41:38 INFO Remoting: Starting remoting
15/08/04 15:41:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@17.247.157.92:55889]
15/08/04 15:41:38 INFO Utils: Successfully started service 'sparkDriver' on port 55889.
15/08/04 15:41:38 INFO SparkEnv: Registering MapOutputTracker
15/08/04 15:41:38 INFO SparkEnv: Registering BlockManagerMaster
15/08/04 15:41:38 INFO DiskBlockManager: Created local directory at /private/var/folders/fw/pvmhqrrn1flgsq36wkd3s7ww0000gn/T/spark-0368f07e-3826-4b15-8872-03554f95d58a/blockmgr-f502a6e1-db07-4cf5-98cc-8f967b0ff926
15/08/04 15:41:38 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
15/08/04 15:41:38 INFO HttpFileServer: HTTP File server directory is /private/var/folders/fw/pvmhqrrn1flgsq36wkd3s7ww0000gn/T/spark-0368f07e-3826-4b15-8872-03554f95d58a/httpd-878c8663-eb22-4787-bcf1-b472b5c014a1
15/08/04 15:41:38 INFO HttpServer: Starting HTTP Server
15/08/04 15:41:38 INFO Utils: Successfully started service 'HTTP file server' on port 55890.
15/08/04 15:41:38 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/04 15:41:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/08/04 15:41:38 INFO SparkUI: Started SparkUI at http://17.247.157.92:4040
15/08/04 15:41:38 INFO Executor: Starting executor ID driver on host localhost
15/08/04 15:41:38 INFO Executor: Using REPL class URI: http://17.247.157.92:55888
15/08/04 15:41:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55891.
15/08/04 15:41:38 INFO NettyBlockTransferService: Server created on 55891
15/08/04 15:41:38 INFO BlockManagerMaster: Trying to register BlockManager
15/08/04 15:41:38 INFO BlockManagerMasterEndpoint: Registering block manager localhost:55891 with 265.1 MB RAM, BlockManagerId(driver, localhost, 55891)
15/08/04 15:41:38 INFO BlockManagerMaster: Registered BlockManager
15/08/04 15:41:38 INFO SparkILoop: Created spark context..
Spark context available as sc.
15/08/04 15:41:39 INFO SparkILoop: Created sql context..
SQL context available as sqlContext.
scala>
On my comp I see 2 identical folders
./Documents/Downloads/SPARK/spark-1.4.1
./Downloads/SPARK/spark-1.4.1
brew info scala
scala: stable 2.11.5 (bottled)
http://www.scala-lang.org/
Not installed
From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/scala.rb
==> Options
--with-docs
Also install library documentation
--with-src
Also install sources for IDE support
==> Caveats
To use with IntelliJ, set the Scala home to:
/usr/local/opt/scala/idea
http://thomaswdinsmore.com/2015/07/13/big-analytics-roundup-july-13-2015/
http://www.ibmbigdatahub.com/technology/hadoop-and-spark
SPARK on windows
http://nishutayaltech.blogspot.in/2015/04/how-to-run-apache-spark-on-windows7-in.html
http://ysinjab.com/2015/03/28/hello-spark/
SPARK
http://habrahabr.ru/post/274705/
https://www.supergloo.com/spark-tutorial/spark-tutorials-scala/
https://habrahabr.ru/company/retailrocket/blog/302828/
https://news.ycombinator.com/item?id=11126188 SPARK 2
http://fullstackml.com/2016/01/19/how-to-check-hypotheses-with-bootstrap-and-apache-spark/
http://www.teachingmachines.io/blog/2016/1/1/6ywzqtxq79ajny7xkf8jkd13k4t7bu
https://github.com/jpzk/cookiecutter-scala-spark
https://spark.apache.org/docs/latest/
https://dl.dropboxusercontent.com/u/8252984/spark.html#1 Installing Spark
http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
https://github.com/cloudera/spark-timeseries
https://medium.com/@rickynguyen/getting-started-with-spark-day-4-3ad5e5d2d5fd#.41t3wu2ar
https://medium.com/@rickynguyen/getting-started-with-spark-day-5-36b62a6d13bf#.1lliya1hf
http://habrahabr.ru/company/bitrix/blog/273279/
http://www.allitebooks.com/?s=spark BOOKS
https://www.youtube.com/watch?v=y8WSPyeF73g&feature=share
http://www.it-ebooks.org/book/oreilly/advanced_analytics_with_spark
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/
http://habrahabr.ru/company/targetix/blog/266009/ SPARK
http://baweaver.com/blog/2015/06/21/intro-to-spark/
https://www.quora.com/How-do-I-learn-Apache-Spark
https://www.youtube.com/watch?v=RgBzVHe2D84
http://www.insightdataengineering.com/blog/Working-With-Apache-Spark.html
val textFile = sc.textFile("README.md")
val wordCounts =
textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
----------
JavaRDD<String> lines = sc.textFile("data.txt"); JavaRDD<Integer> lineLengths = lines.map(s -> s.length());int totalLength = lineLengths.reduce((a, b) -> a + b);
Understand what an RDD is.
Understand the difference between a driver, master and executor.
Understand the lazy execution model i.e difference between transformations and actions and when actual work is done.
How is data partitioned across executors? How does number of partitions and partitioning key affect a job running time, what is the right number of partitions for various jobs, how custom partitioning functions can be defined.
Learn to inspect RDDs from the UI to see # of partitions, how much fits in memory vs disk.
Learn about shuffles. Whats a shuffle, why is it expensive, which operations can cause shuffles, how can they be avoided (e.g use reduceByKey instead of groupByKey if you can)
http://habrahabr.ru/company/mlclass/blog/250811/
In Spark, a DataFrame is a distributed collection of data organized into named columns (see:https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html). Conceptually, DataFrames are similar to tables in a relational database except they are partitioned across multiple nodes in a Spark cluster.
It’s important to understand that Spark does not actually load the socialdata collection into memory at this point. We’re only setting up to perform some analysis on that data; the actual data isn’t loaded into Spark until it is needed to perform some calculation later in the job. This allows Spark to perform the necessary column and partition pruning operations to optimize data access into Solr.
Every DataFrame has a schema. You can use the printSchema() function to get information about the fields available for the tweets DataFrame:
http://habrahabr.ru/post/271375/
https://www.reddit.com/r/apachespark
http://www.zdnet.com/article/hadoop-and-spark-a-tale-of-two-cities/
we should think of Spark as superseding MapReduce and probably a couple other projects (Pig, and "classic" Mahout 0.9 by MLlib, and Giraph by GraphX).
Spark has no storage system of its own. Right off the bat, this means Spark has to be paired with something else to do any work. That's usually Hadoop-related storage like HDFS, or maybe HBase, but could also be things like Amazon S3. Spark has no security model of its own, so needs to lean on things like YARN / Hadoop mechanisms if that matters. There's no NoSQL-like engine in Spark like HBase / Cassandra, or interactive SQL like Drill /
Spark has a Map and a Reduce function like MapReduce, but it adds others like Filter, Join and Group-by, so it’s easier to develop for Spark. In fact, Spark provides for lots of instructions that are a higher level of abstraction than what MapReduce provided. You can think more about how you want the data processed, rather than about how to cajole MapReduce into doing what you want. This might not seem that important, until you look at this: MapReduce-Wordcount. This is the code to calculate a count of words in a text file, done in MapReduce (not Spark). It’s over 100 lines of code, and fairly unintuitive. The equivalent in Spark is found on this page: Spark Examples (look for the Word Count example). It’s four lines versus over 100.
Результатом применения данной операции к RDD является новый RDD. Как правило, это операции, которые каким-либо образом преобразовывают элементы данного датасета. Вот неполный самых распространенных преобразований, каждое из которых возвращает новый датасет (RDD):
.map(function) — применяет функцию function к каждому элементу датасета
.filter(function) — возвращает все элементы датасета, на которых функция function вернула истинное значение
.distinct([numTasks]) — возвращает датасет, который содержит уникальные элементы исходного датасета
Oперации над множествами:
.union(otherDataset)
.intersection(otherDataset)
.cartesian(otherDataset) — новый датасет содержит в себе всевозможные пары (A,B), где первый элемент принадлежит исходному датасету, а второй — датасету-аргументу
Действия применяются тогда, когда необходимо материализовать результат — как правило, сохранить данные на диск, либо вывести часть данных в консоль. Вот список самых распространенных действий, которые можно применять над RDD:
.saveAsTextFile(path) — сохраняет данные в текстовый файл (в hdfs, на локальную машину или в любую другую поддерживаемую файловую систему — полный список можно посмотреть в документации)
.collect() — возвращает элементы датасета в виде массива. Как правило, это применяется в случаях, когда данных в датасете уже мало (применены различные фильтры и преобразования) — и необходима визуализация, либо дополнительный анализ данных, например средствами пакета Pandas
.take(n) — возвращает в виде массива первые n элементов датасета
.count() — возвращает количество элементов в датасете
.reduce(function) — знакомая операция для тех, кто знаком с MapReduce. Из механизма этой операции следует, что функция function (которая принимает на вход 2 аргумента возвращает одно значение) должна быть обязательно коммутативной и ассоциативной
Это основы, которые необходимо знать при работе с инструментом. Теперь немного займемся практикой и покажем, как загружать данные в Spark и делать с ними простые вычисления
При запуске Spark, первое, что необходимо сделать — это создать SparkContext(если говорить простыми словами — это обьект, который отвечает за реализацию более низкоуровневых операций с кластером — подробнее — см. документацию), который при запуске Spark-Shell создается автоматически и доступен сразу (обьект sc)
Загружать данные в Spark можно двумя путями:
а). Непосредственно из локальной программы с помощью функции.parallelize(data)
localData = [5,7,1,12,10,25] ourFirstRDD = sc.parallelize(localData)
б). Из поддерживаемых хранилищ (например, hdfs) с помощью функции.textFile(path)
ourSecondRDD = sc.textFile("path to some data on the cluster")
В этом пункте важно отметить одну особенность хранения данных в Spark'e и в тоже время самую полезную функцию .cache() (отчасти благодаря которой Spark стал так популярен), которая позволяет закэшировать данные в оперативной памяти (с учетом доступности последней). Это позволяет производить итеративные вычисления в оперативной памяти, тем самым избавившись от IO-overhead'а. Это особенно важно в контексте машинного обучения и вычислений на графах, т.к. большинство алгоритмов итеративные — начиная от градиентных методов, заканчивая такими алгоритмами, как PageRank.
После загрузки данных в RDD мы можем делать над ним различные трансформации и действия, о которых говорилось выше. Например:
Посмотрим первые несколько элементов:
for item in ourRDD.top(10): print item
Либо сразу загрузим эти элементы в Pandas и будем работать с DataFrame'ом:
import pandas as pd pd.DataFrame(ourRDD.map(lambda x: x.split(";")[:]).top(10))
пример трансформации: вычислим максимальный и минимальный элементы нашего датасета. Как легко догадаться, сделать это можно, например, с помощью функции .reduce():
localData = [5,7,1,12,10,25] ourRDD = sc.parallelize(localData) print ourRDD.reduce(max) print ourRDD.reduce(min)
Итак, мы рассмотрели основные понятия, необходимые для работы с инструментом. Мы не рассматривали работу с SQL, работу с парами <ключ, значение> (что делается легко — для этого достаточно сначала применить к RDD, например, фильтр, чтобы выделить, ключ, а дальше — уже легко пользоваться встроенными функциями, вроде sortByKey, countByKey, join и др.)
Spark
http://www.infoq.com/articles/apache-spark-introduction
http://www.infoq.com/articles/apache-spark-sql
https://www.youtube.com/watch?v=65aV15uDKgA
https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
http://www.reddit.com/r/apachespark
http://baweaver.com/blog/2015/06/21/intro-to-spark/
http://sparkhub.databricks.com/
http://www.insightdataengineering.com/blog/Working-With-Apache-Spark.html
https://www.youtube.com/watch?v=FjhRkfAuU7I
http://www.infoq.com/presentations/apache-spark-future
http://www.infoq.com/presentations/spark-cassandra
http://en.wikipedia.org/wiki/Spark_(cluster_computing_framework)
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://horicky.blogspot.ru/2015/02/big-data-processing-in-spark.html
http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
http://www.planetcassandra.org/blog/the-new-analytics-toolbox-with-apache-spark-going-beyond-hadoop/
Spark is only a computing engine, while Hadoop is a complete stack of storage, cluster management and computing tools, and Spark can run well on Hadoop. However, we do see many deployments that are not on Hadoop, including deployments on NoSQL stores (e.g. Cassandra) and deployments directly against cloud storage (e.g. Amazon S3, Databricks Cloud). In this sense Spark is reaching a broader audience than Hadoop users.
Most of the development activity in Apache Spark is now in the built-in libraries, including Spark SQL, Spark Streaming, MLlib and GraphX. Out of these, the most popular are Spark Streaming and Spark SQL
it is true that Spark uses a micro-batch execution model, I don't think this is a problem in practice, because the batches are as short as 0.5 seconds. In most applications of streaming big data, the latency to get the data in is much higher (e.g. you have sensors that send in data every 10 seconds, or something like that), or the analysis you want to do is over a longer window (e.g. you want to track events over the past 10 minutes), so it doesn't matter that you take a short amount of time to process it.
The benefit of Spark's micro-batch model is that you get full fault-tolerance and "exactly-once" processing for the entire computation, meaning it can recover all state and results even if a node crashes. Flink and Storm don't provide this, requiring application developers to worry about missing data or to treat the streaming results as potentially incorrect. Again, that can be okay for some applications (e.g. just basic monitoring), but it makes it hard to write more complex applications and reason about their results