Home

Using SignIn, to login with your permitted email Id

PySpark Professional Training : Including HandsOn Sessions

Next Video : Pedagogy on Right Hand Side

To subscribe : Visit this page

Syllabus

Module-1: Apache Spark Introduction

Spark v/s MapReduce
Why Hadoop to be used?
HDFS and YARN Intro

Module-2 : Spark and Hadoop Performance Difference

- - Introduction to Iterative algorithm
  - Multiple Reasons behind Spark High Performance
  - RDD : Native Spark API intro

Module-3 : Spark Architecture

- Cluster : Group of Computers
- Spark Application Components
  - Driver
  - Executors
  - Cluster Manager
- SparkSession
- Transformation
  - Lazy Evaluation
  - Narrow Transformation
  - Wide Transformation

Module-4: Apache Spark Introduction DataFrame

- DataFrame
- DataFrame v/s Dataset
- Sample API for DataFrame
- Language Independent Catalyst Optimizer

Module-5A : Install VMWARE Workstation Player

Module-5B : Install Ubuntu Linux in VMWare Player

- Install Ubuntu Image
- Install SSH server
- Install Putty and connect to Linux OS

Module-5C : Install Apache Spark

- Install Apache Spark
- Start spark-shell
- Start pyspark

Module-6: Apache Spark Understanding RDD

- About RDD
- RDD V/s DataFrame v/s Dataset
- RDD and Custom Partitioner concept

Module-7: Introduction Apache Spark SQLs Catalyst optimizer

- What is Catalyst optimizer
- Concepts of Tree and Rules
- Various Phases of Catalyst optimizer
  - Analysis
  - Logical optimization
  - Physical planning
  - Code Generation
- Scala Features concepts
  - Predicate Pushdown
  - Constant Folding
  - Physical operator
  - Project Prunning

Module-8 : Apache Spark DataFrame & Dataset API

- Direct Acyclic Graph
- DataFrame v/s Dataset
- Explicit Schema for DataFrame/Dataset
- Columns in DataFrame
- Execution Path and Execution steps
- Runtime Optimizations

Module-9: Working with Structured API

- Schema, StructType and StructFields
- Manual Schema Assignment
- Creating and selecting columns

Module-10 : Working with Structured API

- Creating Rows
- expr and selectExpr
- Basics of Literals
- Hands on Exercise
  - Unique Rows
  - Explicit Assign Schema
  - Working with columns and Rows
  - Sorting Data
  - Union of Rows
  - Limit
  - Repartition and Coalesce
  - Collecting Rows on the Driver

Module-11 : Working with Spark DataTypes and User Defined Function

- Spark has its own DataTypes
- Boolean Expression (True/False)
- Serially Define the filter
- Working with Numerical Data

Module-12 : Working with Spark DataTypes and User Defined Function

- Works with Character Data
- Using Regular Expression
- Dates and Timestamp

Module-13 : Working with Spark DataTypes and User Defined Function

- Struct Data Type
- Array Data Types
- Explode Example
- Map Types
- User Defined Function

Module-14 : DataFrame Grouping and Aggregations

- DataFrame and GroupBy operation
- Understanding RelationalGroupedDataset
- Basic Aggregation Operation
- Working with Complex DataTypes

Module-15 : DataFrame Grouping and Aggregations

- Understand Window Function
- Grouping Set function
  - Pivot
  - Cubes

Module-16: PySpark and Joins

- Inner Join
- Left Outer Join
- Right Outer Join
- Full Outer Join

Module-17: PySpark and Joins

- Left Semi Join
- Left Anti Join
- Shuffle Join
- Broadcast Join

Module-18A : Understand RC and ORC File Types

Module-18B: Read and Write Data + File Formats

- Understanding with the DataFrameReader
- Various Data Read Modes
  - Permissive , Drop malformed , FailFast
- Working with the DataFrameWriter
- Save Modes
  - Append , Overwrite , Ignore , errorIfExists
- HandsOn Exercises
- String, Date and Timestamp
- Working with Fields separator
- Generating and working with the file formats (Read and Write as well)
  - - ORC File
    - Parquet File
    - Json
    - Csv
    - Text

Module-19: Spark App on the Cluster

- Spark Driver Process
- Spark Executors
- Cluster Manager
- Various Execution Modes
  - Cluster Mode
  - Client Mode
  - Local Mode

Module-20: Spark App on the Cluster

- Submit application and its Flow
- Spark App Understanding in Depth
- Application, Job, Stage and Task
- Spark Shuffle
- Tasks
- Pipeline

Module 21 : SPARK ADVANCED : DATA PARTITIONING

- What is Partitioning and why?
- Data Partitioning example using Join (Hash Partitioning)
- Understand Partitioning using Example for get Recommendations for Customer
- Understand Partitioning code using Spark-Scala
- Operations which create Partitioned RDD
- Operation which get benefit of Partitioning
- Operation that affect the partitioning

Page updated

Report abuse