Postgraduate teaching:
Data Engineering for AI
Course Objectives
Intended Knowledge Outcomes
You will:
Learn fundamental notions of parallel data processing and scalability
Understand the challenges associated with processing different types of Big Data (batch, streaming, graph-structured)
Learn fundamental concepts in data analytics: Exploratory (EDA) and Predictive (Machine Learning) with case studies in different application domains
Learn to “stay on top” of cutting edge algorithms and architectures for Scalable Data Engineering, by discovering and reading selected research literature and providing a critical analysis
Intended Skill Outcomes
Learn to use practical computation environments for Big Data: Spark (massively parallel data processing) on the Cloud, and analytics workflows, with applications to specific analysis goals in diverse application domains
Develop problem-solving skills that are specific to Big Data Analytics
About the Course
The aim of the module is to introduce students to the complex combination of data engineering technology and data science that makes it possible to extract valuable knowledge from “Big Data”. A number of technical challenges are derived from the high volume and high diversity (heterogeneity of meaning and format) and variable quality of the data, and a distinction is made based on whether the data is stationary (resides in a data repository) or it is in motion (data streaming, as it would be produced for instance by sensors).
The module emphasises the following aspects:
Distribution of data processing over a cluster of computing nodes, hosted in a cloud environment, as a way to scale out computing resources as the size of the data to be processed increases. This includes current frameworks for massively parallel data processing, notably Spark which is the most successful example of cloud-based distributed programming platform, and possibly Dask.
Examples of algorithms that can be successfully parallelised and thus are able to take advantage of distributed data architectures
Models of computation that enable near- real time analytics on data streams
Specialised data structures, specifically graphs. The module covers basics of graph databases (Neo4J) but also massively parallel graph algorithms, i.e., implemented using the Pregel framework.
Examples of data science applications, including Machine Learning algorithms, that are enabled by Big Data technology.
Emphasis is also placed on the rapid pace of technology advances in this area, and cutting-edge further reading material is offered for in-depth learning and deep-dives into specific topics
Module Topics
Introduction to Data Science and Data Analytics. Scalability, efficiency of parallel processing.
Batch Big Data Processing (MapReduce)
Computing environments for Big Data Analytics and Machine Learning: Big Data platforms (Databricks / Cloudera), Spark
Data Stream processing: Overview of real time Event Processing and querying
Graph data processing: Example of algorithms for graph analytics, graph databases and query languages (GDBMS), massively parallel graph processing model (Pregel)
Detailed Lecture plan
Introduction to Data Science and Data Analytics. Scalability, efficiency of parallel processing.
The MapReduce paradigm
HDFS
Map Reduce. Example: matrix-vector multiplication and PageRank
Introduction to Spark in Python: Exploratory Data Analytics (EDA) and Predictive Data Analytics (PDA)
The Spark programming model: functional programming using RDD and DataFrames
Practical examples: Exploratory and Predictive data analytics using PySpark and Pandas
Data Stream processing: Overview of real time Event Processing and querying
Stream data processing: scalability problems
Introduction to Sampling and Filtering
Technology: Spark Streaming
Graph data processing I: graph algorithms
Counting triangles
Community detection: Girwan-Newman
Graph data processing II: Graph databases and query languages (GDBMS)
Graph property model and the G-CORE query language
Neo4J: the Cypher graph query language
Lab Practical and signed-off checkpoint: the Neo4J Cypher challenge
Graph data processing III: The Pregel parallel graph processing model
PageRank a vertex-at-a-time
Assessment
The module is assessed through two pieces of coursework (90%), to be completed by the last day of class. A brief live one-on-one discussion of yourt work with the module leader / demonstrators is also required (10%)
(1) Spark programming assignment: you will manage ad analyze fragments of the NYC Taxi rides dataset on a series of tasks (data cleaning, joining, aggregation, ...). You will need to show how your code performs on inputs at different scales (from 3M trips to 130M trips).
You will work on the Databricks or Cloudera Spark clusters on Microsoft Azure.
(2) you will use the Neo4J Graph Data Science to implement simple analytics on the same Taxi datasets.