Big Data and Knowledge Management Systems (빅데이터 및 지식 관리 시스템)
Spring 2021 (Tue/Thur 11:00~12:15)
Instructor: Prof. SangKyun cha (chask@snu.ac.kr)
TA: Joohyun Lee(jhlee@kdb.snu.ac.kr), Jaehwan Lim(amethyst@snu.ac.kr)
Course Overview
Summary
This course deals with modeling and management of various types of big data and knowledge from the data-driven service life cycle perspective. Students learn modern technologies of ingestion, storage, distribution, and processing of big data in parallel and distributed cloud computing environment. Final project involves a group of students collecting real-world big data and build a data science analysis using relational and graph databases. Students report their own solutions on data science problems and demonstrate feasibility of the solutions.
Historical development of data models and data management systems
Structured data management and relational data model
Relational data storage and meta data management, and query processing
Transaction management and database recovery management
Spatio-temporal data, graph data, semi-structured and unstructured data, and knowledge structures
Parallel and distributed big data systems and complex analytics processing and machine learning on cloud infrastructure
Big data and model life cycle management
Grading scheme
Attendance 5%
Assignment 30%
Midterm exam 25%
Final exam 25%
Project 15%
Content
Introduction
Real-world data and knowledges structures, and their life cycle
Data management of Structured and Unstructured data
Query, Complex Analytics, and Machine Learning
Transactions, Concurrency Control, Logging, and Recovery
HW development and Real-world demands driving Technology Paradigm Changes: Relational DBMS, In-Memory Platform, Ambient AI
First Order Logic, PROLOG
First Order Logic Fundamentals, Translating Knowledge into Logic
Resolution Algorithm, Unification
PROLOG Syntax, PROLOG Tutorial (Hanoi Tower, RDBMS, Graph)
Natural Language Processing in PROLOG
Debugging and Tracing in Prolog
Natural Language Analysis in Prolog
Frequent Graph Structures and Applications, Graph Storage and Operators -Neo4j
Neo4j in Depth
Spatial/Temporal Type in Neo4j
Relational DBMS
Relational Algebra and Query Language
Physical Tables: Row vs Column Stores, Indexes
Why we need Virtual Table(View)
SQL: Data Definition Language, Data Manipulation Language, Data Control Language, Transactions: ACID Properties, Isolation Levels
Index as Redundant Structures for Fast Access: B+-Tree
Logical and Physical Query Plan, Relational Query Optimization
In-Memory Data Management Architecture, Column Store
Dictionary Encoding for Compression in Column Store
In-Memory Complex Analytical Query Processing
Transaction Processing, Concurrency Control - Locking and Latching, Multi-Version Concurrency Control(MVCC), Index Concurrency Control (with OLFIT)
Logging & Recovery, and Replication for High- Availability
Distributed Systems for Scalability: Shared-Nothing Partitioning, Distributed Transaction and Query Processing, Distributed Computing for Cloud and Edge System
Data Cleansing, ETL(Extract-Transform-Load)
Distributed Workflow Management and Long Transactions