This module will provide an overview of processing data at large scale and parallel processing. It will introduce Hadoop and Spark and the use of parallel processing paradigms. It will also extend to Data Analytics (distributed ETL and machine learning e.g. more advanced Hadoop & Spark) including: Data querying using remote database management system (RDMS)/NoSQL databases; Data processing pipeline; Database design using remote database management system (RDMS)/NoSQL databases; Cloud computing.
Understanding the key issues associated to distributed systems and their modern applications including performance, reliability, security, scalability and complexity of the data and computation.
Knowledge and skill to critically analyse various distributed computing techniques and approaches for their suitability to a specific application.
Ability to design and implement a distributed system application utilising modern distributed computing paradigms and technologies to achieve the required quality attributes.
Data models of unstructured and semi-structured data and querying with NoSQL databases.
Parallel programming models, such as MapReduce, actors, reactive stream processing, web services.
Infrastructures for large scale distributed data processing, such as Hadoop distributed file systems and NoSQL databases.
Service-oriented and cloud computing paradigms, such as IaaS, PaaS, and SaaS, including microservices and container technologies.
Assessment (if applicable):
A 15-minute recorded presentation and demonstration of individual coursework on the analysis and design of a distributed computer application.
Reading LIst