Spark SQL 2.x Fundamentals and Cookbook

About book

Apache Spark is one of the fastest growing technology in BigData computing world. It support multiple programming languages like Java, Scala, Python and R. Hence, many existing and new framework started to integrate Spark platform as well in their platform e.g. Hadoop, Cassandra, EMR etc. While creating Spark certification material HadoopExam technical team found that there is no proper material and book is available for the Spark SQL (version 2.x) which covers the concepts as well as use of various features and found difficulty in creating the material. Therefore, they decided to create full length book for Spark SQL and outcome of that is this book. In this book technical team try to cover both fundamental concepts of Spark SQL engine and many exercises approx. 35+ so that most of the programming features can be covered. There are approximately 35 exercises and total 15 chapters which covers the programming aspects of SparkSQL. All the exercises given in this book are written using Scala. However, concepts remain same even if you are using different programming language.


This is second full length book from and we love the feedback so that we can improve the quality of the book. Please send your feedback on or


Entire content of this book is owned by and before using it or publishing anywhere else either digitally on web or printing and distribution require prior written permission from You can use the code or exercises in for your software development or in your software product (commercial as well as open source) and there is no need to take prior permission.

Source code and Data: You can download the source code and data from below location

Download link is given in the Book

About SparkSQL

Spark SQL is a module created on top of Spark Core and introduced a data abstraction called DataFrames or Dataset which provides support for structured and semi-structured data. Spark SQL provides a domain-specific language (DSL) to manipulate DataFrames/Datasets in Scala, Java, or Python (No Dataset in Python). It also provides SQL language support, with command-line interfaces and ODBC/JDBC server. Although DataFrames lack the compile-time type-checking afforded by RDDs, as of Spark 2.0, the strongly typed DataSet is fully supported by Spark SQL as well. More detail about Spark SQL you will find in this book.