Big data platform architecture and management. Various big data sources and types. Programming and management practices on distributed data system or cloud system. Mapreduce paradiam. Tools for working with structure and unstructure data on a large distributed file system. Batch and real-time streaming data processing. Working pipeline for big data processing from data source. Common open-sourced and cloud-based visual data analytics platform. Recommendation. Data. Lake. Open data. Case studies.
The cores of this class are three things : 1) big data platform management including Hadoop ecosystem 2) big data visualization and analytic tool 3) Some concepts of parallel and distributed computing.
The first priority will be on processing with big data with Hadoop ecosystems and tools where we spend about half of the time on it. Second will be distributed computing programming where we discuss along since the Hadoop FS relies on it . Third will be Elastic ecosystem where it includes processing pipeline and visualization which can be connected to the first part we learn.
We will learn by doing mostly. There will NOT be a lot of theories much. But we mention some and its design principle and some related concepts as we go on.
Students will learn to practice the skill set and get used to ecosystem so that in the future they can adapt themselves these similar things.
After finish ingthe course, students will get or earn more skill of problem-solving and ability to learn new things.
Note, there will be not many TA that can help you on setting up.
You will have to practice to operate your VM and Linux yourself. You are welcome to help your friends too.
Mostly, every week you will have hands-on lab or programming lab to finish up which may take around 1.5-2.0 hr depending on your skill. You may spend more time on exploring for solutions and studying/understanding them though.
The main principle is learn how to and understand why it works; as well as their architecture; not just copy and cut-paste, since we won't improve your knowledge.
Learning strategy:
- In-class meeting is for lecture and labwork/assignment discussions.
-Students are required to manipulate their VMs themselves.
-homework due every 2-3 weeks and quizzes using Quizziz every 3 weeks.
(programming assignment along with the VM setting up)
0. What is big data? intro slide ecosystem <video lecture>
1. Introduction to HDFS and Hadoop ecosystem
Hadoop installation guide single node
Your task: Hadoop installation
HDFS commands cheat sheet (1) (2)
installation guide multinode <optional>
2. MapReduce Concepts and Wordcount program
<slide wordcount> <video lecture>
Your task : run setup hadoop either using from scratch running hadoop wordcount
Example installation and runs (Java) (Video)
3. Data store Example on HDFS, Hive , HBase, Pig
Installation
Lecture
lecture hive and pig lecture-hbase
Video : hbase hive and pig
Tools:
parquet intro notebook1 notebook2
Hive SQL Command Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Hbase:
Pig:
4. Spark Ecosystem: Pyspark, SparkML, Streaming with Spark, GraphFrame
spark RDD video lecture and demo
Current version is at official page.
Spark cluster setup (assume you have hdfs)
GraphX
SparkML
5. Messaging service with Kafka
(optional MQTT & Python)
Kafka cluster Installation guide
A full running system at this point **
should have
hdfs, hive, hbase, kafka
6. Elasticsearch ecosystem (ELK)
Elasticsearch, Filebeat, Logstash, Kibana
Their connectivity to Spark, and Kafka
ELK Stack install or (installation guide)
Video lecture 2019 <filebeat logstash elastic kibana demo>
Elastic & Kibana cluster setup (video)
7. Certifications (potential)
-AWS Data Engineering / TBA
-NVIDIA Data Engineering Pipepline/ TBA
8. Demos
9. Misc topics
TaskFlow
https://airflow.apache.org/docs/apache-airflow/stable/tutorial/pipeline.html
Opensearch
https://opensearch.org/downloads.html
login: bigdata2024
password: bigdata1234
Before transferring any file / installing any, don't forget to authenticate using: KUWIN: http://login.ku.ac.th/
Assignments 50%
Quizzez, Exam (in class lecture and/or take home midterm/final) 50%