This course serves as an introductory triggering class for students who are interested in cybersecurity analysis using machine learning methods. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be architects to solve security-related problems using data analysis algorithms and tools. Related security concepts, data analysis theories, research papers, and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.
Note that students should take programming courses before, such as Programming Language I/II. The programming language used in this class is Python (however we will NOT cover any Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing programs, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts.
Note this course is designed for students who are in their third or fourth year of college. If you have taken any advanced AI/ML/DM course, you may want to skip this course.
02/06: For students who want to enroll in this class, please come to the classroom directly. We will handle the enrollment process on 2/23.
-/-: We have a forum (https://groups.google.com/g/nccu-ds4s) for this class. If you have any questions (e.g., enrollment, homework, project, ...), please post your question directly in the forum. TA and I will respond later.
-/-: For the students who want to enroll in this class, please come to the classroom with your enrollment document in the 2nd and 3rd weeks. We will check how many seats are available for the additional students.
For English students, you MUST read this document before enrolling in this class.
02/06: For students who want to enroll in this class, please come to the classroom directly. We will handle the enrollment process on 02/23.
02/16: We have a new classroom 學思樓 #040103 (Xue Si Building, 103 Room). Please go to the new classroom directly.
02/22: Midterm changes to 4/20 (i.e., 10th week). No class.
02/23: Join Google Classroom by using your g.nccu.edu.tw account with ID qmgp6yj. The first homework (H-ZeroAccess) is announced. Due 2023.03.09 23:59.
03/22: The due date of H-NID has been extended to 3/27 (Monday). Please see the announcement in Google Classroom or contact TA.
03/27: The homework before midterm is shown as follows. After midterm, we will have only 2 more homework. Cheer up.
H-PCA: due on 4/8
H-Static: due on 4/15
No class on 4/20
H-Dynamc: due on 4/22
Midterm: announce on 4/20; due on 4/26. (I wish you could finish it before next class on 4/27.)
04/05:
The class on 4/6 (April 6) will be held as usual in the classroom #040103. But we will provide live online streaming (https://www.youtube.com/@nccu.hsiaom/) and the video on demand for students not in the classroom. You may come to the classroom, watching live streaming, or watch the VoD later.
04/27:
H-PEimageCNN homework announced. Due 5/10 23:59.
05/22: Final Project! We switch the date of Final Project DEMO and Final Exam!
One or two students forms a group. You MUST specify all your members in every document and codes.
06/01: Proposal. Hand-in a PDF proposal on June 1. It SHOULD include a title, the problem definition, the dataset used, and the planing of experiment. We will review your dataset, and please do not use very commonly-used dataset (since many works have been focused on them).
06/08: Final Exam! No class! File announced 06/07 at 18:00. Due 06/11 at 23:59.
06/15: Demo. Present you colab file in the class. No presentation file is needed. Hand-in your modified colab file before 6/17. Due to the limited time, only part of the teams can demo their works. If you want to demo voluntarily, please contact TA. If too many or few teams, draw straws might be needed. Please be prepared!
Instructor: Shun-Wen Hsiao, NCCU MIS Dept., <hsiaom at nccu.edu.tw>
Lectures: Thursday 234 (09:10~12:00 UTC+8)
Classroom: 學思樓 #040103 (Xue Si Building, 103 Room)
TA: Ms. HSIA, 111356021 at g.nccu.edu.tw
GitHub: [Tx] https://github.com/hsiaom26/DS4CS/ and [Mx] https://github.com/hsiaom26/DS4CS22
Homework: Google Classroom (You MUST have a @g.nccu.edu.tw account) #qmgp6yj.
Office Hours: By appointment.
Understand the relationship between Cybersecurity and Security Management.
Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.
Understand the concept of static analysis and dynamic analysis.
Familiar with data analysis environment, GPU-based computation, and cloud computing.
Understand the data analysis algorithms: distance function, similarity function, classification, clustering, and machine learning algorithms for security applications.
Understand the neural network structures and algorithms.
Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system, and sequence analysis system.
Understand visualized machine learning tools: Orange
Security Management
Data Analysis Environment
Static Analysis
Dynamic Analysis
Network Trace
Linux System Log
Supervised Learning
Unsupervised Learning
Intrusion Detection System
Anomaly Detection System
Neural Network
Spam Mail Filter System
Sequence Analysis
Language Model
Data Visualization
GUI-based Analysis Tool
Malware Data Science: Attack Detection and Attribution, Joshua Saxe and Hillary Sanders, No Starch Press, Nov. 2018.
Python for Data Analysis, Wes McKinney, O'Reilly Media, October 2012.
Introduction to Cybersecurity
Supervised Learning (I)
CoLab [gpu-test.ipynb, VoD] [M02]
Python Data Science Packages [PythonDSPackages.ipynb]
GPU Accelerated Computing with Python: NVIDIA CUDA Toolkit (Installation Guide)
Supervised Learning (II)
Supervised Learning (III)
Unsupervised Learning: clustering
Unsupervised Learning Algorithms [VoD] [M07]
Distance Functions, Scales, Similarity [DM0.pptx]
UPGMA.ipynb (UPGMA Walkthrough by Dr. R. Edwards)
knn.ipynb (a supervised algorithm, self-reading material)
Unsupervised Learning: dimension reduction
Program Analysis
[Hash], Digital Signature, Windows Signature, VirusTotal [VoD]
Ref: [Endianness]
PE file parsing [VoD]
Entropy (again!)
Ref: A. Shalaginov et al. "Machine Learning Aided Static Malware Analysis: A Survey and Tutorial," in Cyber Threat Intelligence, August 2018. [Springer] or [Researchgate]
Network Trace and System Log
Network: Packet Capture, Netflow [T12, VoD] [M12]
PCAP https://wiki.wireshark.org/SampleCaptures#Sample_Captures
Some PCAP examples: https://www.netresec.com/?page=PcapFiles
Event: Syslog, audited (we will use NN to analyze logs later)
Deep Learning Basics (T13byMIT: GitHub, Youtube)
Neural Network Structure and Regression [M13] [VoD][VoD]
[API Doc] Sequence Model, Sequence Class, Activation Function, Loss Function, Optimizer
[Wiki] Activation function
[Wiki] Loss Function
sparse_categorical_crossentropy (integer), categorical_crossentropy (one-hot)
Optimizer: RUDER's blog [EN], RUDER's paper [EN] [VoD]
Midterm. 4/20, 2023. No class.
Neural Network Structure and Convolution [M14]
Convolution: Conv2D, Conv1D, kernel size [VoD]
Polling: MaxPooling2D, AveragePooling2D [VoD]
Flattern layer
Overfitting (T14: by Google, no VoD, self-learning material)
Validation dataset
Dropout (will be covered in the latter VoD)
Ridge and Lasso (will be covered in the latter VoD)
Latent Space
Google: "MNIST latent space" to see the latent space of handwriting digits [VoD]
Latent Space II [T16 no slides, see VoD][M16]
Activation Function Visualization [VoD]
Ref: EN A visual proof that neural nets can compute any function
Ref: EN ConvnetJS demo by karpathy
Ref: EN playground.tensorflow.org
Ref: C LeeMeng
Sensitivity of a Model to Neural Network Hyperparameter [jpg, VoD]
Text Machine Learning Workflows (Orange) [T17-1/M17(O) VoD only]
Example: Spam Mail Filter: Spambase Data Set
RNN + Language Model
Algorithms
M18-1 Understanding LSTM Networks (RNN, LSTM, GRU)
Ref: MIT 6.S191: Recurrent Neural Networks and Transformers [YouTube by MIT]
Example: Attack Lifecycle
Intrusion Detection + Language Model Codes
Paper Reading (Please download the paper and read them on your own. I highlight important concepts in the paper.)
An Introduction to Intrusion Detection by Aurobindo Sundaram.
R. A. Kemmerer and V. Giovanni, "Intrusion Detection: A Brief History and Overview," Computer, vol. 35, 2002, supl. 27--30.
Additional material (if you have time to read it): E. Hodo et al., "Shallow and Deep Networks Intrusion Detection System: A Taxonomy and Survey," arXiv:1701.02145 [cs.CR], Jan 2017.
Code Review (covered in the class)
Security Datasets for your information
Anomaly Detection
Paper Reading (Please download the paper and read them on your own. I highlight important concepts in the paper.)
V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Survey, vol. 41, no. 3, July 2009.
S. Forrest, S. A. Hofmeyr, A. Somayaji and T. A. Longstaff, "A sense of self for Unix processes," in Proc. 1996 IEEE Symposium on Security and Privacy (S&P), May 1996, pp. 120-128.
Code Review
Anomaly Detection
Novelty and Outlier Detection by scikit-learn
Anomaly Detection using Machine Learning by projectpro
Anomaly Detection by Oreilly
Final exam.
Note that you MUST finish the FINAL alone. You CANNOT share/discuss with any others. We highly recommend you finish the FINAL as early as possible. Do not wait until the last minute.
Demo.
Homework: watch "K-means and Self-Organizing Map" [YouTube]
H-ZeroAccess (4%)
Copy and save this colab file to your google drive and finish the questions.
Due 2023.03.09 (23:59 UTC+8)
Submit to Google Classroom using your @g.nccu.edu.tw account.
Upload your colab file (an .ipynb file) to the Google Classroom system; please do NOT send your colab link.
O-COVID19 (4%)
https://github.com/CSSEGISandData/COVID-19
Example by Taiwan CDC. [link]
Submit to Google Classroom; due 2023.03.16.
Upload a pdf file containing your Orange data analysis flow and screenshots of your analysis.
It is open homework, feel free to do any analysis on Covid-19 data.
H-NID (5%)
Network Intrusion Detection
You need to include the following mechanisms by Orange (but not limited to)
feature engineering
at least 5 different models
k-fold validation
Evaluation index
Due 2023.03.23.
H-PCA (6%)
Use all you have learned to analyze three datasets we have used before.
Due 2023.04.08
Submit your .ipynb file to Google Classroom.
H-Static (6%)
Extract features from 242 malware samples and try to classify/cluster them.
Due 2023.04.15.
H-Dynamic (6%)
Use T11 or M11 to analyze 272 malware samples.
Due 2023.04.22.
H-PEimageCNN
Given a set of malware PE files and their malware family labels. Please design a CNN deep neural network to classify these PE files into their families. H-PEimageCNN.ipynb
Due 2023.05.10 23:59.
Here is a SMS Spam Collection Data Set.
Design NN-based classifiers to determine if a message is spam or not.
First, you need a word embedder. Please use at least TWO embedders.
You may want to use fasttext as your English words embedder.
Or you can use any of the embedders shown in the class
Second, after obtaining the vectors. You MUST implement at least THREE classifiers.
One classifier MUST be a conventional ML classifier, such as LR, SVM, Tree, ...
The other one MUST be a neural network classifier.
The third model MUST contains an AutoEncoder or VAE.
Please show us which combination of embedder and classifier has better accuracy.
Homework (40%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.
Project (20%): student needs to write an analysis program on a security-related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and uploaded GitHub codes are required.
Midterm (20%)
Final (20%)
The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You SHOULD read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.
Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.
Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.