This course serves as an introductory triggering class for the students who major in Management Information Systems and are interested in cybersecurity analysis. In this course, we focus on data analysis-related topics. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be architects to solve security-related problems using data analysis algorithms and tools. Related security concepts, data analysis theories, research papers, and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.
Note that students should take Programming Language I, Programming Language II (i.e., two semesters of programming course) before taking this class. The programming language used in this class is Python (however we will NOT cover any Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing programs, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts.
Note this course is designed for students who are in their third or fourth year of college at the MIS department. If you have taken any advanced AI/ML/DM course, you may want to skip this course.
-/-: We have a forum (https://groups.google.com/g/nccu-ds4s) for this class. If you have any questions (e.g., enrollment, homework, project, ...), please post your question directly in the forum. TA and I will respond later.
-/-: For the students who want to enroll in this class, please come to the classroom with your enrollment document in the 2nd and 3rd weeks. We will check how many seats are available for the additional students.
2/16: Live streaming of the first class will be broadcasted at 9:10 am (UTC+8, Taiwan time) on the youtube channel. https://www.youtube.com/channel/UCIIOuh-0H1Wrq75ozOVBaHA. Syllabus, Security Management, Malware will be covered (in English).
If you miss the syllabus, please watch the video Syllabus22 carefully.
2/23: To take this course, you MUST read this document first. It tells you which video, codes, homework, additional materials you need to work with.
3/1: A Google Classroom (ufyyt6s)is created for this class. Please use your '@g.nccu.edu.tw' account to sign on for submitting your first homework -- H-ZeroAccess. Since it is our first time using Google Classroom, the due date is now March 8 (AoE). If any questions, please contact TA (or send your homework to TA via email, in case there is something wrong).
3/3: Homework O-COVID19 announced; due 2022.Mar.13 (AoE). Please go to our Google Classroom. Remember to log in with your g.nccu.edu.tw account.
3/8: Since NCCU's Google Classroom only accept g.nccu.edu.tw account. Here are some instructions to get a g.nccu.edu.tw account.
3/12: Homework H-NID announced; due 2022.Mar.24 (AoE). Please submit it by our Google Classroom.
4/5:
Homework H-StaticAnalysis_242variants announced; due 2022.April.18 (AoE). Please submit it by our Google Classroom.
Homework H-Dynamic announced; due 2022.April.22 (AoE). Data is here, and no codes are provided (you may use M11 or T11).
4/17
Data Science for Cybersecurity, 2022@NCCU Midterm, 2022/04/14 (Thur) 09:00 to 2020/04/17 (Sun) 23:59. UTC+8.
Please see Google Classroom to get the colab file. (It will be announced at 9 am 04/14), or Midterm File Here!
4/27
Due to the COVID-19 epidemic, our following classes will be all online. Students will NOT need to go to the classroom. Make sure you visit the class website once a week and submit your homework on Google Classroom.
The [mandarin] classes will be broadcasted live on our YouTube Channel (https://www.youtube.com/channel/UCIIOuh-0H1Wrq75ozOVBaHA) every Thursday from 9 a.m. to 12 p.m.
The [english] classes are pre-recorded and edited. You can find them in this playlist (https://www.youtube.com/playlist?list=PL_ExBw9oE-6iHE31yoeZ_5HgaCsgpyMtk).
5/11
Due to the COVID-19 epidemic, we will NOT have the final project demo. Please prepare your final project as follows.
1) A two-page proposal (DUE May 26) that specifies what kind of "AI-based security system" you would like to develop.
An introduction to the dataset (if possible, find a public-available dataset) you would like to use.
What preprocessing would you like to perform? Why?
What algorithms will you use to analyze the data? Why?
The expected results.
2) A final report (DUE June 13) that includes your codes and the following analysis (but not limited to).
The description of your dataset (may include missing rate, mean/std.dev of each column, number of logs, correlation, etc.)
Insight of your dataset after you perform exploratory data analysis.
The preprocessing was applied (for example, feature engineering, missing value, data balancing, etc.)
The model design. Why such a design? Pros and cons.
Experiment process (e.g., training, testing, 10/5 fold, validation, etc.)
Experiment result (e.g., confusion matrix, loss, etc.)
Insight of your result.
Your codes.
5/12, 6/13
The final exam will be held online from June 13 to June 19.
The final exam colabfile is here.
Instructor: Shun-Wen Hsiao, NCCU MIS Dept., <hsiaom at nccu.edu.tw>
Lectures: Thursday 234 (09:10~12:00 UTC+8)
Classroom: NCCU Commerce Building, Room 260313
Google Classroom: https://classroom.google.com/c/NDc3NTM5MTU3NDUx?cjc=ufyyt6s
YouTube: https://www.youtube.com/channel/UCIIOuh-0H1Wrq75ozOVBaHA
TA: 109356017 at g.nccu.edu.tw
GitHub: https://github.com/hsiaom26/DS4CS/ and https://github.com/hsiaom26/DS4CS22
Office Hours: By appointment.
QR codes: 306732001 (undergraduate, upper image), 356378001 (graduate, lower image)
Understand the relationship between Cybersecurity and Security Management.
Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.
Understand the concept of static analysis and dynamic analysis.
Familiar with data analysis environment, GPU-based computation, and cloud computing.
Understand the data analysis algorithms: distance function, similarity function, classification, clustering, machine learning algorithms for security application.
Understand the neural network structures and algorithms.
Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system, and sequence analysis system.
Understand visualized machine learning tools: Orange
Security Management
Data Analysis Environment
Static Analysis
Dynamic Analysis
Network Trace
Linux System Log
Supervised Learning
Unsupervised Learning
Intrusion Detection System
Anomaly Detection System
Neural Network
Spam Mail Filter System
Sequence Analysis
Data Visualization
Malware Data Science: Attack Detection and Attribution, Joshua Saxe and Hillary Sanders, No Starch Press, Nov. 2018.
Python for Data Analysis, Wes McKinney, O'Reilly Media, October 2012.
Introduction to Cybersecurity
Supervised Learning (I)
CoLab [gpu-test.ipynb, VoD] [M02]
Python Data Science Packages [PythonDSPackages.ipynb]
GPU Accelerated Computing with Python: NVIDIA CUDA Toolkit (Installation Guide)
Supervised Learning (II)
Supervised Learning (III)
Unsupervised Learning: clustering
Unsupervised Learning Algorithms [VoD] [M07]
Distance Functions, Scales, Similarity [DM0.pptx]
UPGMA.ipynb (UPGMA Walkthrough by Dr. R. Edwards)
knn.ipynb (a supervised algorithm, self-reading material)
Unsupervised Learning: dimension reduction
Program Analysis
[Hash], Digital Signature, Windows Signature, VirusTotal [VoD]
Ref: [Endianness]
PE file parsing [VoD]
Entropy (again!)
Ref: A. Shalaginov et al. "Machine Learning Aided Static Malware Analysis: A Survey and Tutorial," in Cyber Threat Intelligence, August 2018. [Springer] or [Researchgate]
Network Trace and System Log
Network: Packet Capture, Netflow [T12, VoD] [M12]
PCAP https://wiki.wireshark.org/SampleCaptures#Sample_Captures
Some PCAP examples: https://www.netresec.com/?page=PcapFiles
Event: Syslog, audited (we will use NN to analyze logs later)
Midterm. No class.
Deep Learning Basics (T13byMIT: GitHub, Youtube)
Neural Network Structure and Regression [M13] [VoD][VoD]
[API Doc] Sequence Model, Sequence Class, Activation Function, Loss Function, Optimizer
[Wiki] Activation function
[Wiki] Loss Function
sparse_categorical_crossentropy (integer), categorical_crossentropy (one-hot)
Optimizer: RUDER's blog [EN], RUDER's paper [EN] [VoD]
Neural Network Structure and Convolution [M14]
Convolution: Conv2D, Conv1D, kernel size [VoD]
Polling: MaxPooling2D, AveragePooling2D [VoD]
Flattern layer
Overfitting (T14: by Google, no VoD, self-learning material)
Validation dataset
Dropout (will be covered in the latter VoD)
Ridge and Lasso (will be covered in the latter VoD)
Latent Space
Google: "MNIST latent space" to see the latent space of handwriting digits [VoD]
Latent Space II [T16 no slides, see VoD][M16]
Activation Function Visualization [VoD]
Ref: EN A visual proof that neural nets can compute any function
Ref: EN ConvnetJS demo by karpathy
Ref: EN playground.tensorflow.org
Ref: C LeeMeng
Sensitivity of a Model to Neural Network Hyperparameter [jpg, VoD]
Text Machine Learning Workflows (Orange) [T17-1/M17(O) VoD only]
Example: Spam Mail Filter: Spambase Data Set
RNN + Language Model
MIT 6.S191: Recurrent Neural Networks and Transformers [YouTube by MIT]
Codes [T17-2&3, see VoD & VoD]
Word Embeddings, Word2Vec [ref: EN, TC]
Transformer and BERT
Example: Attack Lifecycle
Intrusion Detection
Paper Reading (Please download the paper and read them on your own. I highlight important concepts in the paper.)
An Introduction to Intrusion Detection by Aurobindo Sundaram.
R. A. Kemmerer and V. Giovanni, "Intrusion Detection: A Brief History and Overview," Computer, vol. 35, 2002, supl. 27--30.
Additional material (if you have time to read it): E. Hodo et al., "Shallow and Deep Networks Intrusion Detection System: A Taxonomy and Survey," arXiv:1701.02145 [cs.CR], Jan 2017.
Code Review (covered in the class)
Security Datasets for your information
Anomaly Detection
Paper Reading (Please download the paper and read them on your own. I highlight important concepts in the paper.)
V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Survey, vol. 41, no. 3, July 2009.
S. Forrest, S. A. Hofmeyr, A. Somayaji and T. A. Longstaff, "A sense of self for Unix processes," in Proc. 1996 IEEE Symposium on Security and Privacy (S&P), May 1996, pp. 120-128.
Code Review
Anomaly Detection
Novelty and Outlier Detection by scikit-learn
Anomaly Detection using Machine Learning by projectpro
Anomaly Detection by Oreilly
Demo
No Class. Please finish your final project!
Homework: watch "K-means and Self-Organizing Map" [YouTube]
Final exam.
Note that you MUST finish the FINAL alone. You CANNOT share/discuss with any others. We highly recommend you finish the FINAL as early as possible. Do not wait until the last minute.
H-ZeroAccess (4%)
Copy the colab file to your google drive and finish the questions.
Due 2022.Mar.03 Mar. 08@ AoE
Submit to Google Classroom using your @g.nccu.edu.tw.
O-COVID19 (4%)
https://github.com/CSSEGISandData/COVID-19
Example by Taiwan CDC. [link]
Submit to Google Classroom; due 2022.Mar.13 (AoE).
Upload a pdf file containing your Orange data analysis flow and screenshots of your analysis.
It is open homework, feel free to do any analysis on Covid-19 data.
H-NID (5%)
Network Intrusion Detection
You need to include the following mechanisms by Orange (but not limited to)
feature engineering
at least 5 different models
k-fold validation
Evaluation index
Due 2022.Mar.24 (AoE). See Google Classroom.
H-PCA (6%)
Use all you have learned to analyze three datasets we have used before.
Due 2022.Mar.31@ AoE
Submit to Google Classroom with your .ipynb file (not a link to your colab).
H-Static (6%)
Extract features from 242 malware samples and try to classify/cluster them.
Due 2022.April.18@ AoE
H-Dynamic (6%)
Use T11 or M11 to analyze 272 malware samples.
Due 2022.April.22@ AoE
H-PEimageCNN
Given a set of malware PE files and their malware family labels. Please design a CNN deep neural network to classify these PE files into their families. H-PEimageCNN.ipynb
Due 2022.May.8@ AoE
Here is a SMS Spam Collection Data Set.
Design NN-based classifiers to determine if a message is spam or not. You may want to use fasttext as your English words embedder.
You MUST design TWO neural network models: one with an AutoEncoder and another without AE. Show us which one is better.
You MUST apply ONE ML classifier, and show us NN is better or not.
Homework (40%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.
Project (20%): student needs to write an analysis program on a security-related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and uploaded GitHub codes are required.
Midterm (20%)
Final (20%)
The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You SHOULD read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.
Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.
Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.