Using SignIn, to login with your permitted email Id
PySpark Professional Training : Including HandsOn Sessions
Next Video : Pedagogy on Right Hand Side
To subscribe : Visit this page
Syllabus
Module-1: Apache Spark Introduction
Spark v/s MapReduce
Why Hadoop to be used?
HDFS and YARN Intro
Module-2 : Spark and Hadoop Performance Difference
Introduction to Iterative algorithm
Multiple Reasons behind Spark High Performance
RDD : Native Spark API intro
Module-3 : Spark Architecture
Cluster : Group of Computers
Spark Application Components
Driver
Executors
Cluster Manager
SparkSession
Transformation
Lazy Evaluation
Narrow Transformation
Wide Transformation
Module-4: Apache Spark Introduction DataFrame
DataFrame
DataFrame v/s Dataset
Sample API for DataFrame
Language Independent Catalyst Optimizer
Module-5A : Install VMWARE Workstation Player
Module-5B : Install Ubuntu Linux in VMWare Player
Install Ubuntu Image
Install SSH server
Install Putty and connect to Linux OS
Module-5C : Install Apache Spark
Install Apache Spark
Start spark-shell
Start pyspark
Module-6: Apache Spark Understanding RDD
About RDD
RDD V/s DataFrame v/s Dataset
RDD and Custom Partitioner concept
Module-7: Introduction Apache Spark SQLs Catalyst optimizer
What is Catalyst optimizer
Concepts of Tree and Rules
Various Phases of Catalyst optimizer
Analysis
Logical optimization
Physical planning
Code Generation
Scala Features concepts
Predicate Pushdown
Constant Folding
Physical operator
Project Prunning
Module-8 : Apache Spark DataFrame & Dataset API
Direct Acyclic Graph
DataFrame v/s Dataset
Explicit Schema for DataFrame/Dataset
Columns in DataFrame
Execution Path and Execution steps
Runtime Optimizations
Module-9: Working with Structured API
Schema, StructType and StructFields
Manual Schema Assignment
Creating and selecting columns
Module-10 : Working with Structured API
Creating Rows
expr and selectExpr
Basics of Literals
Hands on Exercise
Unique Rows
Explicit Assign Schema
Working with columns and Rows
Sorting Data
Union of Rows
Limit
Repartition and Coalesce
Collecting Rows on the Driver
Module-11 : Working with Spark DataTypes and User Defined Function
Spark has its own DataTypes
Boolean Expression (True/False)
Serially Define the filter
Working with Numerical Data
Module-12 : Working with Spark DataTypes and User Defined Function
Works with Character Data
Using Regular Expression
Dates and Timestamp
Module-13 : Working with Spark DataTypes and User Defined Function
Struct Data Type
Array Data Types
Explode Example
Map Types
User Defined Function
Module-14 : DataFrame Grouping and Aggregations
DataFrame and GroupBy operation
Understanding RelationalGroupedDataset
Basic Aggregation Operation
Working with Complex DataTypes
Module-15 : DataFrame Grouping and Aggregations
Understand Window Function
Grouping Set function
Pivot
Cubes
Module-16: PySpark and Joins
Inner Join
Left Outer Join
Right Outer Join
Full Outer Join
Module-17: PySpark and Joins
Left Semi Join
Left Anti Join
Shuffle Join
Broadcast Join
Module-18A : Understand RC and ORC File Types
Module-18B: Read and Write Data + File Formats
Understanding with the DataFrameReader
Various Data Read Modes
Permissive , Drop malformed , FailFast
Working with the DataFrameWriter
Save Modes
Append , Overwrite , Ignore , errorIfExists
HandsOn Exercises
String, Date and Timestamp
Working with Fields separator
Generating and working with the file formats (Read and Write as well)
ORC File
Parquet File
Json
Csv
Text
Module-19: Spark App on the Cluster
Spark Driver Process
Spark Executors
Cluster Manager
Various Execution Modes
Cluster Mode
Client Mode
Local Mode
Module-20: Spark App on the Cluster
Submit application and its Flow
Spark App Understanding in Depth
Application, Job, Stage and Task
Spark Shuffle
Tasks
Pipeline
Module 21 : SPARK ADVANCED : DATA PARTITIONING
What is Partitioning and why?
Data Partitioning example using Join (Hash Partitioning)
Understand Partitioning using Example for get Recommendations for Customer
Understand Partitioning code using Spark-Scala
Operations which create Partitioned RDD
Operation which get benefit of Partitioning
Operation that affect the partitioning