Spark SQL 2.x Fundamentals and Cookbook - Table of Contents

Spark SQL 2.x Fundamentals and Cookbook

Table of Contents

Chapter-1: Spark SQL Introduction

Introduction
Sample program using both SQL Query and API

Chapter-2 Catalyst Optimizer.

Catalyst optimizer Introduction.
Objectives of Catalyst optimizer.
Catalyst Library
Internal Representation
Catalyst Tree
Four phases of Catalyst optimization

Chapter 3: Project Tungsten

Introduction to Project Tungsten
Code generation
CPU Bound operations

Chapter-4: Setting up Spark Environment

Compare free vs Paid Version
Part-2 Installing Ubuntu Linux on VMWare
Part-3: Setting Spark 2.0 env on Ubuntu

Chapter-5 SparkSQL Schema

Schema Inference
Explicitly assigning schema
Schema Inference using reflection
Explicitly creating schema using StructType and StructFields

Chapter 6: SparkSQL abstractions & Other Objects

About SparkSession.
Submitting Spark applications
SparkConf object
Providing custom rules and optimization technique
SparkSQL Row (Catalyst Row) object
Resilient Distributed Dataset
DataFrame
Dataset
DataFrame to Dataset conversion
Dataset and Type-safety
Dataset and Catalyst optimizer
Dataset and compile time type safety
Working with Dataset
Transient
Spark Case classes
Dataset vs RDD operations
Converting an RDD to Dataset
Local Datasets
Dataset and Project Tungsten
Dataset and Encoder

Chapter 7: DataFrameReader and DataFrameWriter

Assigning Schema, while reading the Data
Handling corrupted records in csv/json file
Reading a text file as whole
Setting time Zone for the data
Reading Data from JDBC data source
Filtering Data at source only
Reading SparkSQL table as DataFrame
DataFrameWriter
Partitioning and bucketing
Bucketing
Data Compressions
Columns in Dataset

Chapter 8: SparkSQL and Hive Support

Spark SQL and Hive Query Support
Hive Metastore
Hive Support in SparkSQL
Hive Query support using SparkSQL

Chapter 9: SparkSQL and JSON

Read JSON data in Spark
Example of loading multiple JSON files
Explicitly assigning schema to loaded JSON Data
Loading JSON data and use SQL query
Infer the schema from Data
SparkSQL using JSON data full example

Chapter 10: SparkSQL and Encoders

Implicit Objects
Encoders (Serialization and De-serialization)
Creating Encoders
Hands on Exercise for SparkSQL Encoders

Chapter 11: Caching and Check-pointing

Dataset and Caching
SparkSQL and Caching
Check-pointing in SparkSQL
Types of Checkpoints
Caching (disk only) v/s check-pointing
Performance Improvements
Other important points about checkpointing

Chapter-12: Dataset and Joins

Joins Introduction
Broadcast Join
SparkSQL and Hint

Chapter-13: RelationalGroupedDataset

RelationalGroupedDataset
Multi Dimension aggregations
Dataset Aggregation API
Hands on Exercises for Multi-Dimensional Operator

Chapter-14: SparkSQL Functions

Spark SQL Functions
Standard or User Defined Functions
UDF: User Defined Functions
Exercise for User Defined Function and User Defined Aggregate Functions
Aggregate functions
Collection functions
About explode function
Date and Time Functions
Window Aggregate Functions
Non-aggregate functions
Sorting functions
String functions
More Window Functions Example
Examples of rank and dense_rank functions (Window function)
NTILE (Window) function
Cumulative Distribution

Chapter-15: Dataset Actions and Transformations.

Dataset Partitioning
About coalesce operator of Dataset
Dataset typed transformations
Actions on the Dataset

Chapter-16: Spark Certifications

Databricks Certifications
How to prepare for Databricks Spark Certifications
Cloudera Hadoop and Spark Developer Certifications
Hortonworks Spark Certification preparation material
MapR Spark Spark Certifications

Report abuse