The goals of this class are to learn three things : 1) big data platform management including Hadoop ecosystem 2) big data visualization and analytic tool 3) A bit concepts of parallel and distributed computing.
The first priority will be on processing with big data with Hadoop ecosystems and tools where we spend about half of the time on it. Second will be distributed computing programming where we discuss along since the Hadoop FS relies on it . Third will be Elastic ecosystem where it includes processing pipeline and visualization which can be connected to the first part we learn.
We will learn by doing mostly. There will NOT be a lot of theories much. But we mention some and its design principle as we go on.
Students will learn to practice the skill set and get used to ecosystem so that in the future they can adapt themselves these similar things.
I am sure that as they finish the course, they will get or earn more skill of problem-solving and ability to learn new things.
Note, there will be not many TA that can help you on setting up.
You will have to operate your VM yourself.
Mostly, every week you will have hands-on lab or programming lab to finish up which may take around 1.5-2.0 hr depending on your skill. You may spend more time on exploring for solutions and studying/understanding them though.
The main principle is learn how to and understand why it works; not just copy and cut-paste, since we won't improve your knowledge.
This class is not suitable for:
1) Those who are not what to learn by practicing, and who are not interested in a system administration
2) Those who expects to learn a lot of theory.
3) Those who do not like programming. We can practice though if your background is not strong.
4) Who expects to need some help individually e.g. for UNIX command line, setting problem, syntax error etc. since we do not have enough TA.
Learning strategy:
- In-class meeting is for lecture and assignment discussions.
-Students are required to manipulate their VMs themselves.
-homework due every 2-3 weeks and quizzes using Quizziz every 3 weeks.
(programming assignment along with the VM setting up)
0. What is big data? intro slide ecosystem <video lecture>
1. Introduction to HDFS and Hadoop ecosystem
Hadoop installation guide single node
HDFS commands cheat sheet (1) (2)
2. MapReduce Concepts and Wordcount program
<slide wordcount> <video lecture>
--running setup hadoop either using from scratch or cloudera, running hadoop wordcount <slide>
installation guide multinode <optional>
3. Data store Example on HDFS, Hive , HBase, Pig
Installation
Lecture
lecture hive and pig lecture-hbase
Video : hbase hive and pig
Tools:
parquet intro notebook1 notebook2
Hive SQL Command Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Hbase:
Pig:
4. Spark Ecosystem: Pyspark, SparkML, Streaming with Spark, GraphFrame
spark RDD video lecture and demo
Current version is at official page.
Spark cluster setup (assume you have hdfs)
GraphX
SparkML
5. Messaging service with Kafka
(optional MQTT & Python)
Kafka cluster Installation guide
A full running system at this point **
should have
hdfs, hive, hbase, kafka
6. Elasticsearch ecosystem (ELK)
Elasticsearch, Filebeat, Logstash, Kibana
Their connectivity to Spark, and Kafka
ELK Stack install or (installation guide)
Video lecture 2019 <filebeat logstash elastic kibana demo>
Elastic & Kibana cluster setup (video)
7. Certifications (potential)
-AWS Data Engineering / TBA
-NVIDIA Data Engineering Pipepline/ TBA
8. Demos
9. Misc topics
TaskFlow
https://airflow.apache.org/docs/apache-airflow/stable/tutorial/pipeline.html
Opensearch
https://opensearch.org/downloads.html
login: bigdata2024
password: bigdata1234
Before transferring any file / installing any, don't forget to authenticate using: KUWIN: http://login.ku.ac.th/
Homework mapreduce
Assignments 60%
Quizzez, Exam (in class lecture and/or take home) 40%