Chapter-1: Spark SQL Introduction
- Introduction
- Sample program using both SQL Query and API
Chapter-2 Catalyst Optimizer.
- Catalyst optimizer Introduction.
- Objectives of Catalyst optimizer.
- Catalyst Library
- Internal Representation
- Catalyst Tree
- Four phases of Catalyst optimization
- Analysis phase
- Logical optimization
- Physical planning
- Code Generation
Chapter 3: Project Tungsten
- Introduction to Project Tungsten
- Code generation
- CPU Bound operations
Chapter-4: Setting up Spark Environment
- Compare free vs Paid Version
- Part-2 Installing Ubuntu Linux on VMWare
- Part-3: Setting Spark 2.0 env on Ubuntu
Chapter-5 SparkSQL Schema
- Schema Inference
- Explicitly assigning schema
- Schema Inference using reflection
- Explicitly creating schema using StructType and StructFields
Chapter 6: SparkSQL abstractions & Other Objects
- About SparkSession.
- Submitting Spark applications
- SparkConf object
- Providing custom rules and optimization technique
- SparkSQL Row (Catalyst Row) object
- Resilient Distributed Dataset
- DataFrame
- Dataset
- DataFrame to Dataset conversion
- Dataset and Type-safety
- Dataset and Catalyst optimizer
- Dataset and compile time type safety
- Working with Dataset
- Transient
- Spark Case classes
- Dataset vs RDD operations
- Converting an RDD to Dataset
- Local Datasets
- Dataset and Project Tungsten
- Dataset and Encoder
Chapter 7: DataFrameReader and DataFrameWriter
- Assigning Schema, while reading the Data
- Handling corrupted records in csv/json file
- Reading a text file as whole
- Setting time Zone for the data
- Reading Data from JDBC data source
- Filtering Data at source only
- Reading SparkSQL table as DataFrame
- DataFrameWriter
- Partitioning and bucketing
- Bucketing
- Data Compressions
- Columns in Dataset
Chapter 8: SparkSQL and Hive Support
- Spark SQL and Hive Query Support
- Hive Metastore
- Hive Support in SparkSQL
- Hive Query support using SparkSQL
Chapter 9: SparkSQL and JSON
- Read JSON data in Spark
- Example of loading multiple JSON files
- Explicitly assigning schema to loaded JSON Data
- Loading JSON data and use SQL query
- Infer the schema from Data
- SparkSQL using JSON data full example
Chapter 10: SparkSQL and Encoders
- Implicit Objects
- Encoders (Serialization and De-serialization)
- Creating Encoders
- Hands on Exercise for SparkSQL Encoders
Chapter 11: Caching and Check-pointing
- Dataset and Caching
- SparkSQL and Caching
- Check-pointing in SparkSQL
- Types of Checkpoints
- Caching (disk only) v/s check-pointing
- Performance Improvements
- Other important points about checkpointing
Chapter-12: Dataset and Joins
- Joins Introduction
- Broadcast Join
- SparkSQL and Hint
Chapter-13: RelationalGroupedDataset
- RelationalGroupedDataset
- Multi Dimension aggregations
- Dataset Aggregation API
- Hands on Exercises for Multi-Dimensional Operator
Chapter-14: SparkSQL Functions
- Spark SQL Functions
- Standard or User Defined Functions
- UDF: User Defined Functions
- Exercise for User Defined Function and User Defined Aggregate Functions
- Aggregate functions
- Collection functions
- About explode function
- Date and Time Functions
- Window Aggregate Functions
- Non-aggregate functions
- Sorting functions
- String functions
- More Window Functions Example
- Examples of rank and dense_rank functions (Window function)
- NTILE (Window) function
- Cumulative Distribution
Chapter-15: Dataset Actions and Transformations.
- Dataset Partitioning
- About coalesce operator of Dataset
- Dataset typed transformations
- Actions on the Dataset
Chapter-16: Spark Certifications
- Databricks Certifications
- How to prepare for Databricks Spark Certifications
- Cloudera Hadoop and Spark Developer Certifications
- Hortonworks Spark Certification preparation material
- MapR Spark Spark Certifications