2021/7/28: Links or files on this page might be lost. Please see the latest version of this course on the Teaching page.
This course serves as an introductory triggering class for the students who major in Management Information Systems and are interested in cybersecurity analysis. In this course, we focus on data analysis-related topics. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be architects to solve a security-related problem using data analysis algorithms and tools. Related security concepts, data analysis theories, research papers, and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.
Note that students should take Programming Language I, Programming Language II (i.e., two semesters of programming course) before taking this class. The programming language used in this class is Python (however we will NOT cover any Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing program, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts, so please bring your laptop to the class.
Note this course is designed for students who are in their third or fourth year of college at the MIS department. If you have taken any advanced AI/ML/DM course, you may want to skip this course. It is an English-taught class, and it will be broadcast online.
2/22: We now have a forum (https://groups.google.com/g/nccu-ds4s) for this class. If you have any questions (e.g., enrollment, homework, project, ...), please post your question directly in the forum. TA and I will respond later.
2/22: For the students who want to enroll in this class, please come to the classroom with your enrollment document on 2/25 and 3/4. We will check how many seats are available for the additional students.
2/25: Here is the QR code for the undergraduate students; and the QR code for the graduate students. Due to the COVID-19, NCCU requests that you MUST scan the QR code every time you attend the class.
3/25: #HW01 announced. Due 4/1 9 am UTC+8. TA will send an instruction for homework submission.
3/31: #HW02 announced. Due 4/15 9 am UTC+8.
4/15: #HW03 announced. Due 4/21 9 am UTC+8.
4/16: MIDTERM announced. Due 2020/04/29 (Thr) 09:00 UTC+8. Please follow the instruction in the colab file. You may ask questions in our class forum https://groups.google.com/g/nccu-ds4s. Note that you MUST finish the MIDTERM alone. You CANNOT share/discuss MIDTERM with any others. We highly recommend you to finish the MIDTERM as early as possible. Do not wait until the last minute.
5/06: #HW04 announced. Due 5/20 9 am UTC+8.
5/18:
No class on 5/20 (University Anniversary).
Please take a look at K-means and Self-Organizing Map: [YouTube] on your own. We will NOT cover this lecture.
Due to the covid-19, our course will be purely online. Please do NOT go to the classroom. If you have any questions, please post them on the forum.
The format of the FINAL exam will be the same as MIDTERM. You will have one week to finish it at home on your own. Start from 6/17 and due at 6/24 9 am UTC+8.
#HW05 announced. Due 6/3 9 am UTC+8.
Please spend some time on your FINAL PROJECT PROPOSAL. You will need to upload a PDF document (#PJ01, at most 4 pages) that includes the Dataset (you want to use), Problem (you want to solve), Preprocess (you want to apply), Models (you want to use), and Expectation of your data analysis project. Due 06/3 9 am UTC+8, and send the PDF file to TA. You can freely choose the topic of your final project. We recommend using the dataset from Kaggle or Google Dataset. It is not necessary to be a security-related dataset/problem; however, we strongly recommend choosing security-related topics. If any questions, post them on the forum. Send the PDA file to TA.
You will need to complete your FINAL PROJECT on colab (#PJ02) before the FINAL exam.
6/4:
#BN01: COVID-19. 4% of final points will be rewarded. You may see the example made by Taiwan CDC [link]. Send your code/webpage to TA before FINAL (6/24 9 am UTC+8.) Data Ref: https://github.com/CSSEGISandData/COVID-19
#HW06: Network Intrusion Detection. See the detailed description below. Due 6/17 9 am UTC+8.
6/17:
FINAL announced. Due 2020/06/24 (Thr) 09:00 UTC+8. Please follow the instruction in the colab file. You may ask questions in our class forum https://groups.google.com/g/nccu-ds4s. Note that you MUST finish the FINAL alone. You CANNOT share/discuss MIDTERM with any others. We highly recommend you to finish the FINAL as early as possible. Do not wait until the last minute.
Instructor: Shun-Wen Hsiao, NCCU MIS Dept., <hsiaom at nccu.edu.tw>
Lectures: Thursday 234 (09:10~12:00 UTC+8)
Classroom: Yi-Xian Building, 5F
Online Video: https://www.youtube.com/channel/UCIIOuh-0H1Wrq75ozOVBaHA
The class will be broadcast live on YouTube every Thursday morning (09:10~12:00 UTC+8).
The VOD will be kept in the YouTube channel, and the edited (shorten) versions will be uploaded later (but not guaranteed).
TA: Kelvin I. W. Kuok <108356041 at g.nccu.edu.tw>
Office Hours: By appointment.
QR codes: QR code for the undergraduate students; QR code for the graduate students
Understand the relationship of Cybersecurity and Security Management.
Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.
Understand the concept of static analysis and dynamic analysis.
Familiar with data analysis environment, GPU-based computation, and cloud computing.
Understand the data analysis algorithms: distance function, similarity function, classification, clustering, machine learning algorithms for security application.
Understand the neural network structures and algorithms.
Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system and sequence analysis system.
Understand visualized machine learning tools: Orange
Security Management
Data Analysis Environment
Static Analysis
Dynamic Analysis
Network Trace
Linux System Log
Supervised Learning
Unsupervised Learning
Intrusion Detection System
Anomaly Detection System
Neural Network
Spam Mail Filter System
Sequence Analysis
Data Visualization
Network Security Through Data Analysis, Michael Collins, OREILLY, 2014.
Data-Driven Security: Analysis, Visualization and Dashboards, Jay Jacobs and Bob Rudis, Wiley, 2014.
Malware Data Science: Attack Detection and Attribution, Joshua Saxe and Hillary Sanders, No Starch Press, Nov. 2018.
Python for Data Analysis, Wes McKinney, O'Reilly Media, October 2012.
02/25: Introduction to Cybersecurity
T01: Security Management
T02: Cyber Attack
T03: Data Analysis Environment
Python Data Science Packages [PythonDSPackages.ipynb]
GPU Accelerated Computing with Python: NVIDIA CUDA Toolkit (Installation Guide), [GPU-test.ipynb]
03/04: Supervised Learning: classification I
T04: Table-Based Data and Data Analysis Process
Model, Linear Regression, Gradient Descent
Training/Testing Process w/ Linear Regression
Cost Function: MSE, RMSE (for regression problem)
Feature Engineering: Normalization, (Dimension Reduction), Missing Values, (Unbalanced Data), Outlier-1
T05: Visualized Machine Learning Tool
Orange (class note)
Data Analysis Workflow
Evaluation: Accuracy, ROC curve, confusion matrix (for classification problem)
Testing process: k-fold validation
03/11: Supervised Learning: classification II
03/18: (cont'd)
03/25: Unsupervised Learning: clustering
T08: Unsupervised Learning Algorithms
Distance Functions, Scales, Similarity
UPGMA (UPGMA Walkthrough by Dr. R. Edwards), k-means, (knn supervised).
Dimension Reduction: PCA & LDA
T09: Problematic Data
Unbalanced (Imbalanced) Data
Outlier-2 (PCA-based)
Overfit and Underfit, Regularization
04/01: T10: Static Analysis
Windows Portable File Format: Wiki, Microsoft
Try this putty.exe and
PE file parsing
pip install pefile
PE file usage example, Binary Hacking
Entropy (again!)
Digital Signature, Windows Signature
Ref: A. Shalaginov et al. "Machine Learning Aided Static Malware Analysis: A Survey and Tutorial," in Cyber Threat Intelligence, August 2018. [Springer] or [Researchgate]
04/08: Dynamic Analysis
T11: System Call and API Call
Profiler: Cuckoo
S. Forrest, S. A. Hofmeyr, A. Somayaji and T. A. Longstaff, "A Sense of Self for Unix Processes," in Proc. IEEE Symposium on Security and Privacy, 1996.
04/15: Trace and Log
T12: Network: Packet Capture, Netflow
PCAP https://wiki.wireshark.org/SampleCaptures#Sample_Captures
Some PCAP examples: https://www.netresec.com/?page=PcapFiles
Event: Syslog, auditd
04/22: Midterm. No class.
04/29: T13: Deep Learning Basics
Supervised Learning, Unsupervised Learning, Reinforcement Learning
Regression (Sequence Stack and Model, Activation Function, Loss Function, Optimizer)
[Wiki] Activation function, Loss Function
Optimizer: RUDER's blog [EN], RUDER's paper [EN]
Classification (Convolution, Sub-sampling)
Polling: MaxPooling2D, AveragePooling2D
Structure: Dropout, Flatten, Softmax (see activation function)
Overfitting Explaination
Loss: sparse_categorical_crossentropy (integer), categorical_crossentropy (one-hot)
Data Set: MNIST (kaggle)
Another Example of MNIST, CNN and Image Classification.
05/06: Latent Space
05/13: Latent Space II
Understanding Latent Space in Machine Learning [EN]
Application: PCA-based Missing Value Imputation & Anomaly Detection
Word Embedding, Language Model
Too many weights? Ridge and Lasso Regression [EN]
05/20: No class. University Anniversary, University Sport Day
K-means and Self-Organizing Map: [YouTube]
Sensitivity of a Model to Neural Network Hyperparameter
05/27: Text-based Analysis with Orange
Orange provides some Text Machine Learning Workflows, see Orange.
# Spam Mail Filter: Spambase Data Set
06/03: Intrusion Detection
An Introduction to Intrusion Detection by Aurobindo Sundaram.
R. A. Kemmerer and V. Giovanni, "Intrusion Detection: A Brief History and Overview," Computer, vol. 35, 2002, supl. 27--30.
Additional material: E. Hodo ea al., "Shallow and Deep Networks Intrusion Detection System: A Taxonomy and Survey," arXiv:1701.02145 [cs.CR], Jan 2017.
Security Datasets: Awesome Machine Learning for Cyber Security, SecRepo.com - Samples of Security Related Data, vizsec.org, NSL_KDD.
06/10: Anomaly Detection
V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Survey, vol. 41, no. 3, July 2009.
A. Lakhina, M. Crovella and C. Diot, "Diagnosing Network-Wide Traffic Anomalies," ACM SIGCOMM Computer Communication Review, vol. 34, no. 4, pp. 219--230, 2004.
R. Sommer and V. Paxson, "Outside the Closed World: On Using Machine Learning For Network Intrusion Detection," in Proc. IEEE Symposium on Security and Privacy, 2010, pp. 305-316.
Here is an example of Network Analysis by Orange. But the "Network" here is not "Computer Network".
06/17: Language Model
Revisit language model
Word Embeddings, Word2Vec [ref: EN, TC]
Use language model to classify security-related articles.
Transformer and BERT
Attack Lifecycle
Google: attack life cycle
06/24: Final exam.
FINAL announced. Due 2020/06/24 (Thr) 09:00 UTC+8. Please follow the instruction in the colab file. You may ask questions in our class forum https://groups.google.com/g/nccu-ds4s. Note that you MUST finish the FINAL alone. You CANNOT share/discuss MIDTERM with any others. We highly recommend you to finish the FINAL as early as possible. Do not wait until the last minute.
03/24: #HW01: Orange
Network Intrusion Detection
You need to include the following mechanisms by Orange (but not limited to)
feature engineering
at least 5 different models
k-fold validation
Evaluation index
Due before the class (04/01 09:00 UTC+8). Submit your classification result (a pdf file) to TA. TA will send you the submission instruction by email.
03/31: #HW02: PCA
Due 04/15 09:00 UTC+8. Send a copy of your colab file to TA.
04/14: #HW03: Static Analysis
Extract features from 40 malware samples and try to classify/cluster them into 4 groups.
Due 04/21 09:00 UTC+8. Send a copy of your colab file to TA.
05/06: #HW04: PE image and CNN
Given a set of malware PE files and their malware family labels. Please design a CNN deep neural network to classify these PE files into their families. Here is HW04. Copy it to your own colab and finish it.
Due 05/20 09:00 UTC+8. Send a copy of your colab file to TA.
05/18: #HW05
Base on your #HW04 and try to design a NN-based classifier (need to include an autoencoder or PCA) to classify malware samples into the correct family. Answer the following question in your colab file.
Q0: Your name, department, ID.
Q1: PCA or AutoEncoder? Which is better for data representation? Why?
Q2: What is your design of NN (dense/convolution/maxpooling/softmask)? Why do you? What is your classification accuracy rate if PCA/AE is introduced?
Q3: Implement certain ML algorithms on the same dataset. Is ML better than NN (or not)? Why?
Due 06/03 09:00 UTC+8. Send a copy of your colab file to TA.
06/04: #BN01 & #HW06
#BN01: COVID-19
https://github.com/CSSEGISandData/COVID-19
4% of final points will be rewarded.
You may see the example made by Taiwan CDC. [link]
Send your code/webpage to TA before FINAL (6/24 9 am UTC+8.)
#HW06: Network Intrusion Detection
Kaggle Data Set, use the training data only for training and only use testing data for testing.
Please design a neural network and include the following concepts.
feature engineering, dimension reduction, k-fold testing, validation.
Here is an example on Kaggle. Note that this example is not NN.
Due 6/17 9 am UTC+8.
Homework (40%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.
Project (20%): student needs to write an analysis program on a security related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and uploaded GitHub codes are required.
Midterm and Final (40%)
The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You MUST read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.
Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.
Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.