This course serves as an introductory triggering class for students who are interested in cybersecurity analysis using machine learning methods. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be architects to solve security-related problems using data analysis algorithms and tools. Related security concepts, data analysis theories, research papers, and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.
Note that students should take programming courses before, such as Programming Language I/II. The programming language used in this class is Python (however we will NOT cover any Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing programs, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts in homework and projects.
Note this course is designed for students in MIS gradate students for Advanced Information System Development. The class will be conducted for 16 weeks.
-/-: We have a forum (https://groups.google.com/g/nccu-ds4s) for this class. If you have any questions (e.g., enrollment, homework, project, ...), please post your question directly in the forum. TA and I will respond later.
0414: Midterm announced! Download midterm file here! (Open a colab page.) Due 2025/4/22 23:59.
06/11: Final announced. (https://colab.research.google.com/drive/1EyfwLYEjIn1xc-WnQHHhwtW2gNrfXuBU)
Instructor: Shun-Wen Hsiao, NCCU MIS Dept., <hsiaom at nccu.edu.tw>
Lectures: Wednesday D56 (13:10~16:00 UTC+8)
Course ID: 356813001, 356017001, 791020001
Classroom: NCCU College of Commerce Building Room #260310
TA: 111356509, 112356021 at g.nccu.edu.tw
Homework Submission: http://hsiaom.nccu.edu.tw:8888/
Office Hours: By appointment.
Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.
Understand the concept of static analysis and dynamic analysis.
Understand the data analysis algorithms: distance function, similarity function, classification, clustering, and machine learning algorithms for security applications.
Understand the neural network structures and algorithms.
Understand the usage of language model to analyze security realted data.
Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system, and sequence analysis system.
Data Analysis Environment
Static Analysis
Dynamic Analysis
Network Trace
Linux System Log
Supervised Learning
Unsupervised Learning
Intrusion Detection System
Anomaly Detection System
Neural Network
Sequence Analysis
Language Model
Malware Data Science: Attack Detection and Attribution, Joshua Saxe and Hillary Sanders, No Starch Press, Nov. 2018.
Python for Data Analysis, Wes McKinney, O'Reilly Media, October 2012.
W1 (02/19): Regression (M03)
Model and Data Table
Linear Regression (MSE, Gradient Descent)
W2 (02/26): Classification (M04)
Logistic Regression (Cross-Entropy)
Support Vector Machine
Evaluation
W3 (03/05): Tree (M06)
Tree and Random Forest
Entropy, Information Gain, Gini, Chi, Variance
W4 (03/12): Clustering (M07)
Distance
K-means
Hierarchical Clustering
DBScan
Malware Calls (f14s1940_callonly_tfds)
W5 (03/19): Problematic Data (M08, M09)
Dimension Reduction, PCA (M08)
W6 (03/26): Neural Network
(04/02): No class.
W7 (04/09): Recurrent NN (N03)
Static Analysis: Windows PE file and image analysis (D01)
Understanding LSTM Networks (N03-1)
LSTM, GRU, ResNet
Dynamic Analysis: Malware call and sequence analysis (D02)
Text classification with an RNN (N03-2)
W8 (04/16): Midterm, no class. (Take home exam, due before 04/23.)
W9 (04/23): Latent Space
denoising, anomaly detection
activation
optimizing gradient-descent
W10 (04/30) Language Model (N06)
word2vec (cbow, skip-gram), fastText (supervised, unsupervised)
Transformer, Self-Attention, BERT
W11 (05/07): Language Model
Basic text classification (N06-2), Classify text with BERT (N06-3)
HuggingFace NLP Course (N06-4)
1. Transformer Models, 2. Using Transformer, 3. Fine-Tuning a Pretrained Model
W12 (05/14): Language Model and Others
(05/21): No class. University Anniversary.
W13 (05/28): Large Language Model
NLP Course, Diffusion Course
W14 (06/04): Anomaly Detection
Variational Autoencoder (N04-2)
V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Survey, vol. 41, no. 3, July 2009.
One-class SVM
Self-Organized Map
W15 (06/11): Project Demo
W16 (06/18): Final, no class. (Take home exam, due 06/18 at 23:59)
5 homework is expected. Announcement dates are
03/05 H-NID
https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection
03/26 H-calls4famPlusPCA
Find the homework template in the above link. You may use any tools to analyze the data (and show/post the results in ipynb).
Finish it and upload your ipynb file to the homework system before April 8th 23:59.
04/14 midterm announcement
Download midterm file here! Due 2025/4/22 23:59. (PE by CNN)
04/30
5/6 Announced. Due 5/21 23:59.
05/14 (5/16 announced)
HuggingFace H-calls4famLM
1) Similar to H-calls4fam_TFDS_rnn, but, in this homework, please use one language model from hugging face to complete this homework. You may use any language model.
Make sure you print the classification accuracy obtained from previous homework (i.e., rnn, 'skip-gram', 'cbow' or 'SBERT') and new result of your newly selected language model.
2) The second part of this homework is re-pre-training. You have to use MLM (masked language model, shown in the following URL), to improve the language model. And then, perform classification again to show if the results improve or not.
05/28, Announced 6/3
Due 06/18 23:59 (same as FINAL)
06/04 project announcement
A one-week sprint project.
Try to analyze a security-related dataset with language model (and/or other models).
You should upload a pdf as the final report that contains
title, goal, downstream task(s)
where we can find your complete codes
dataset introduction
data preprocessing
model used
results
Note that you can output latent vector from Python and use Orange for latter analysis. If so, screenshot Orange workflow in your pdf report.
06/11 Final announcement. https://colab.research.google.com/drive/1EyfwLYEjIn1xc-WnQHHhwtW2gNrfXuBU
Homework (50%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.
Project (10%): student needs to write an analysis program on a security-related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and GitHub codes are required.
Midterm (20%)
Final (20%)
The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You SHOULD read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.
Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.
Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.