Course-4 - syllabus

DATA ENGINEERING

PySpark , SCALA & SHELL SCRIPTING

Course Title: Data Engineering with PySpark, Scala, and Shell Scripting

**Course Overview:**

This online course is designed to provide participants with a comprehensive understanding of data engineering using PySpark, Scala, and Shell scripting. The course will cover essential concepts, tools, and techniques for efficient data processing and analysis in distributed computing environments.

**Week 1: Introduction to Data Engineering**

- Overview of data engineering

- Importance of data engineering in the data lifecycle

- Key tools and technologies in data engineering

**Week 2: Basics of PySpark**

- Introduction to PySpark

- Spark architecture and components

- RDDs (Resilient Distributed Datasets) and DataFrames

**Week 3: PySpark Transformations and Actions**

- Transformations and Actions in PySpark

- Narrow vs. wide transformations

- Caching and persistence in PySpark

**Week 4: PySpark SQL and DataFrames**

- Introduction to Spark SQL

- Working with DataFrames in PySpark

- SQL queries in PySpark

**Week 5: Introduction to Scala for Data Engineering**

- Basics of Scala programming language

- Functional programming in Scala

- Scala for distributed computing

**Week 6: Scala for Spark Programming**

- Scala syntax and features for Spark

- Creating Spark applications in Scala

- Hands-on exercises with Scala and Spark

**Week 7: PySpark Machine Learning Library (MLlib)**

- Overview of MLlib in PySpark

- Building machine learning pipelines

- Model training and evaluation with PySpark MLlib

**Week 8: Shell Scripting for Data Engineering**

- Introduction to Shell scripting

- Bash scripting fundamentals

- Practical examples of using Shell scripts for data engineering tasks

**Week 9: Data Engineering Best Practices**

- Code optimization techniques in PySpark and Scala

- Debugging and troubleshooting common issues

- Best practices for scalable and maintainable data engineering code

**Week 10: PySpark Streaming**

- Introduction to real-time data processing with PySpark Streaming

- Creating streaming applications

- Integration with external data sources and sinks

**Week 11: Advanced Topics in PySpark and Scala**

- Broadcast variables and accumulators in Spark

- Performance tuning and optimization

- Integration with cloud platforms (e.g., AWS, GCP, Azure)

**Week 12: Capstone Project**

- Apply knowledge gained throughout the course to design and implement a data engineering project

- Showcase and discuss projects in the final class session

**Prerequisites:**

- Basic understanding of data concepts and databases

- Familiarity with Python and programming concepts

- Access to a computing environment with PySpark and Scala installed

**Assessment:**

- Weekly quizzes

- Midterm project

- Final capstone project

**References:**

- "Learning Spark" by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia

- "Programming Scala" by Dean Wampler and Alex Payne

- Online documentation and tutorials for PySpark and Scala.

**Note:**

The syllabus is subject to adjustments based on the pace of the class and emerging developments in the field of data engineering with PySpark, Scala, and Shell scripting.

Page updated

Google Sites

Report abuse