Postgraduate teaching:

Data Engineering for AI

Course Objectives

Intended Knowledge Outcomes

You will:

Learn to “stay on top” of cutting edge algorithms and architectures for Scalable Data Engineering, by discovering and reading selected research literature and providing a critical analysis

Intended Skill Outcomes

About the Course

The aim of the module is to introduce students to the complex combination of data engineering technology and data science that makes it possible to extract valuable knowledge from “Big Data”. A number of technical challenges are derived from the high volume and high diversity (heterogeneity of meaning and format) and variable quality of the data, and a distinction is made based on whether the data is stationary (resides in a data repository) or it is in motion (data streaming, as it would be produced for instance by sensors).

The module emphasises the following aspects:

Emphasis is also placed on the rapid pace of technology advances in this area, and cutting-edge further reading material is offered for in-depth learning and deep-dives into specific topics

Module Topics

Detailed Lecture plan

Assessment

The module is assessed through two pieces of coursework (90%), to be completed by the last day of class. A brief live one-on-one discussion of yourt work with the module leader / demonstrators is also required (10%)

(1) Spark programming assignment: you will manage ad analyze fragments of the NYC Taxi rides dataset on a series of tasks (data cleaning, joining, aggregation, ...). You will need to show how your code performs on inputs at different scales (from 3M trips to 130M trips). 

You will work on the Databricks or Cloudera Spark clusters on Microsoft Azure.

(2)  you will use the Neo4J Graph Data Science to implement simple analytics on the same Taxi datasets.