Postgraduate teaching:

Data Engineering for AI

Course Objectives

Intended Knowledge Outcomes

You will:

Learn fundamental notions of parallel data processing and scalability
Understand the challenges associated with processing different types of Big Data (batch, streaming, graph-structured)
Learn fundamental concepts in data analytics: Exploratory (EDA) and Predictive (Machine Learning) with case studies in different application domains

Learn to “stay on top” of cutting edge algorithms and architectures for Scalable Data Engineering, by discovering and reading selected research literature and providing a critical analysis

Intended Skill Outcomes

Learn to use practical computation environments for Big Data: Spark (massively parallel data processing) on the Cloud, and analytics workflows, with applications to specific analysis goals in diverse application domains
Develop problem-solving skills that are specific to Big Data Analytics

About the Course

The aim of the module is to introduce students to the complex combination of data engineering technology and data science that makes it possible to extract valuable knowledge from “Big Data”. A number of technical challenges are derived from the high volume and high diversity (heterogeneity of meaning and format) and variable quality of the data, and a distinction is made based on whether the data is stationary (resides in a data repository) or it is in motion (data streaming, as it would be produced for instance by sensors).

The module emphasises the following aspects:

Distribution of data processing over a cluster of computing nodes, hosted in a cloud environment, as a way to scale out computing resources as the size of the data to be processed increases. This includes current frameworks for massively parallel data processing, notably Spark which is the most successful example of cloud-based distributed programming platform, and possibly Dask.
Examples of algorithms that can be successfully parallelised and thus are able to take advantage of distributed data architectures
Models of computation that enable near- real time analytics on data streams
Specialised data structures, specifically graphs. The module covers basics of graph databases (Neo4J) but also massively parallel graph algorithms, i.e., implemented using the Pregel framework.
Examples of data science applications, including Machine Learning algorithms, that are enabled by Big Data technology.

Emphasis is also placed on the rapid pace of technology advances in this area, and cutting-edge further reading material is offered for in-depth learning and deep-dives into specific topics

Module Topics

Introduction to Data Science and Data Analytics. Scalability, efficiency of parallel processing.
Batch Big Data Processing (MapReduce)
Computing environments for Big Data Analytics and Machine Learning: Big Data platforms (Databricks / Cloudera), Spark
Data Stream processing: Overview of real time Event Processing and querying
Graph data processing: Example of algorithms for graph analytics, graph databases and query languages (GDBMS), massively parallel graph processing model (Pregel)

Detailed Lecture plan

Introduction to Data Science and Data Analytics. Scalability, efficiency of parallel processing.
The MapReduce paradigm
- HDFS
- Map Reduce. Example: matrix-vector multiplication and PageRank
Introduction to Spark in Python: Exploratory Data Analytics (EDA) and Predictive Data Analytics (PDA)
- The Spark programming model: functional programming using RDD and DataFrames
  - Practical examples: Exploratory and Predictive data analytics using PySpark and Pandas
Data Stream processing: Overview of real time Event Processing and querying
- - Stream data processing: scalability problems
  - Introduction to Sampling and Filtering
  - Technology: Spark Streaming
Graph data processing I: graph algorithms
- Counting triangles
- Community detection: Girwan-Newman
Graph data processing II: Graph databases and query languages (GDBMS)
- Graph property model and the G-CORE query language
  - Neo4J: the Cypher graph query language
  - Lab Practical and signed-off checkpoint: the Neo4J Cypher challenge
Graph data processing III: The Pregel parallel graph processing model
- PageRank a vertex-at-a-time

Assessment

The module is assessed through two pieces of coursework (90%), to be completed by the last day of class. A brief live one-on-one discussion of yourt work with the module leader / demonstrators is also required (10%)

(1) Spark programming assignment: you will manage ad analyze fragments of the NYC Taxi rides dataset on a series of tasks (data cleaning, joining, aggregation, ...). You will need to show how your code performs on inputs at different scales (from 3M trips to 130M trips).

You will work on the Databricks or Cloudera Spark clusters on Microsoft Azure.

(2) you will use the Neo4J Graph Data Science to implement simple analytics on the same Taxi datasets.

Page updated

Google Sites

Report abuse