Data Science for Cybersecurity (2024)
This course serves as an introductory triggering class for students who are interested in cybersecurity analysis using machine learning methods. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be architects to solve security-related problems using data analysis algorithms and tools. Related security concepts, data analysis theories, research papers, and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.
Note that students should take programming courses before, such as Programming Language I/II. The programming language used in this class is Python (however we will NOT cover any Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing programs, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts in homework and projects.
If you have problem of programming, you may first review the P01~P04 references in out GitHub. If they are too difficult for you to comprehend the codes (that we will use in the class), then this course may not be suitable for you to enroll.
Note this course is designed for students who are in their third or fourth year of college. If you have taken any advanced AI/ML/DM course, you may want to skip this course.
Announcements (Spring 2024)
-/-: We have a forum (https://groups.google.com/g/nccu-ds4s) for this class. If you have any questions (e.g., enrollment, homework, project, ...), please post your question directly in the forum. TA and I will respond later.
-/-: For the students who want to enroll in this class, please come to the classroom with your enrollment document in the 2nd and 3rd week. We will check how many seats are available for the additional students.
02/22: Classroom and homework password!
Classroom will be changed to #250206 (Research Building 2F 研究大樓 2樓)
Students MUST attend the class in 2/29 to get their (hardcopy) password for homework submission. You are NOT allowed to get the password for somebody else. Students who are not listed in the class enrollment list officially MUST get the password directly from Prof. Hsiao.
04/27:
For students attending English Teaching Course, I am sorry that we cannot finish the videos this week by Friday (due to the Earthquake interruption and the lecturer losing his voice this week), the videos will be online as soon as possible in this weekend.
Class Info
Instructor: Shun-Wen Hsiao, NCCU MIS Dept., <hsiaom at nccu.edu.tw>
Lectures: Thursday 234 (09:10~12:00 UTC+8)
Classroom: NCCU College of Commerce Building Room #260102 Research Building #250206
TA: Mr. Lo, 112356021 at g.nccu.edu.tw
Homework Submission: http://hsiaom.nccu.edu.tw:8888/
YouTube Channel: https://www.youtube.com/@nccu.hsiaom/
For English-speaking students, please see [DS4CS][24][En] playlist. It will be updated every Friday.
For Mandarin-speaking students, please see [DS4CS][24] playlist. It will be updated every Thursday.
Office Hours: By appointment.
Course Objectives & Learning Outcomes
Understand the relationship between Cybersecurity and Security Management.
Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.
Understand the concept of static analysis and dynamic analysis.
Familiar with data analysis environment, GPU-based computation, and cloud computing.
Understand the data analysis algorithms: distance function, similarity function, classification, clustering, and machine learning algorithms for security applications.
Understand the neural network structures and algorithms.
Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system, and sequence analysis system.
Understand visualized machine learning tools: Orange
Topics (Spring 2024)
Security Management
Data Analysis Environment
Static Analysis
Dynamic Analysis
Network Trace
Linux System Log
Supervised Learning
Unsupervised Learning
Intrusion Detection System
Anomaly Detection System
Neural Network
Spam Mail Filter System
Sequence Analysis
Language Model
Data Visualization
GUI-based Analysis Tool
References (Spring 2024)
Malware Data Science: Attack Detection and Attribution, Joshua Saxe and Hillary Sanders, No Starch Press, Nov. 2018.
Python for Data Analysis, Wes McKinney, O'Reilly Media, October 2012.
Schedule (Spring 2024)
Introduction (M01, M02)
Security Management
Malicious Software and Cyberattacks
CoLab and Orange
Datasets
Regression (M03)
Model and Data Table
Linear Regression (MSE, Gradient Descent)
Classification (M04, M05)
Logistic Regression (Cross-Entropy)
Support Vector Machine
Evaluation
Orange Workflow
Tree (M06)
Tree and Random Forest
Entropy, Information Gain, Gini, Chi, Variance
Clustering (M07)
Distance
K-means
Hierarchical Clustering
DBScan
Problematic Data (M08, M09)
Dimension Reduction, PCA
Problematic Data
Midterm (4/1-4/10)
Neural Network Bascis (N01)
Convolution (N02)
LaNet
Static Analysis: Windows PE file and image analysis (D01)
Recurrent NN (N03)
Understanding LSTM Networks (N03-1)
LSTM, GRU, ResNet
Dynamic Analysis: Malware call and sequence analysis (D02)
Text classification with an RNN (N03-2)
Latent Space
Auto-Encoder (N04)
denoising, anomaly detection
convolutional, variational AE (N04-2)
Activation Function (N05)
activation
optimizing gradient-descent
Language Model (N06)
word2vec (cbow, skip-gram), fastText (supervised, unsupervised)
Transformer, Self-Attention, BERT
Language Model
Pre-train and fine-tune
LoRA
Packet Analysis (D03)
Anomaly Detection
One-class SVM
Self-Organized Map
Advanced DL
Classification on imbalanced data (class weights, bias. EarlyStop)
Multi-modality, Muti-fusion
Final (6/3-6/12)
Project Demo (6/13)
=== Under Construction ===
TBA (6/20)
Assignment (Spring 2024)
You can find homework Colab file and its corresponding data (in data folder) in our GitHub.
H-ZeroAccess (5%)
H-NID (5%)
No homework template is provided. You need to use Orange to create a workflow with several models using NID datasets and perform classification to evaluate these models.
Screenshot your workflow and parameters, paste them in your favorite editor, and convert the document to a pdf file. Put your student ID, name, department info at the first page and submit your pdf.
Network Intrusion Detection
H-calls4famCluster (6%)
H-PCA (6%)
No homework template is provided. You need to use Orange to create a workflow with several models using calls4famCluster datasets.
You need to perform PCA before any models. Compare the results with models without PCA.
Screenshot your workflow and parameters, and convert the document to a pdf file. Put your student ID, name, department info at the first page and submit your pdf.
H-PEimage (6%)
Using a small PE binary samples to perform CNN classification.
H-calls4famRNN (6%)
Using the same dataset for RNN classification.
H-SMSSpam (O) (8%)
Use Orange and the above dataset to perform spam detection. Please screenshot your Orange workflow and results, and upload a single pdf file to the homework system.
Try to design different combination of widgets to obtain the detection accuracy as high as possible, especially using 'bag of word', 'document embedding', 'preprocess text', or you may preprocess the data using python in advanced.
=== Under Construction ===
H-LoRASpam (8%)
Project (10%)
Grading Policy
Homework (50%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.
Project (10%): student needs to write an analysis program on a security-related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and GitHub codes are required.
Midterm (20%)
Final (20%)
The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You SHOULD read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.
Academic Integrity
Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.
Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.