Data Science for Cybersecurity (2020)
2021/7/28: Links or files on this page might be lost. Please see the latest version of this course on the Teaching page.
This course serves as an important triggering class for the students who major in Management Information Systems and are interested in cybersecurity analysis. In this course, we focus on the data analysis related topics. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be an architect to solve a security-related problem by using data analysis algorithm and tools. Related security concept, data analysis theories, research papers and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.
Note that students should take Programming Language I, Programming Language II (i.e., two semesters of programming course), and Business Date Communication (or Computer Networks) before taking this class. The programming language used in this class is Python (but we will NOT cover Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing program, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts in the class, and please bring your laptop to the class.
Note that this course is designed for students who are in their third or fourth year of college at MIS department. If you have taken any advanced AI/ML/DM course, you may want to skip this course.
Announcements (Spring 2020)
2/4: Duo to the outbreak of the novel coronavirus, the schedule of this class has been changed as follows. This class starts from March 2 (i.e., 3-hour lecture will be given in this day) and please bring your enrollment document to the class on March 9. ONLY MIS students, double major students, MIS minor students are allowed to enroll this class, and it is subject to the availability of seats in the classroom.
3/23: Homework upload and class roll call QR code. Here.
4/21: NO CLASS in 04/27. The mid-term exam will be announced in 04/27 at 19:00. You MUST send your .ipynb file (in terms of colab share link) to TA via email before 04/30 23:59. Late submission is allowed, but with a 20% per day penalty. You CANNOT discuss midterm with anyone.
4/27: Midterm. Please copy this colab file to your google drive, and finish it before due day. Send your shared link to TA (108356041 at g.nccu.edu.tw).
5/7: #PJ01: Term Project Proposal
Please upload a document that includes Dataset (you want to use), Problem (you want to solve), Preprocess (you want to apply), Models (you want to use), and Expectation.
Due 05/13 at 23:59. Upload your file here (use your student ID as filename).
6/15: Upload you final project today. Zip your presentation file, document and codes to a single .zip file. Upload your zipped file here. Do not forget use your student ID as the file name.
6/15: Final Exam will be announced at 9pm in 6/15. Due date is 6/23 23:59. Please upload here using PDF format. Do not forget use your student ID as the file name. Download here [.docx].
Class Info
Instructor: Shun-Wen Hsiao, NCCU MIS Dept., hsiaom at nccu.edu.tw
Lectures: Monday D56 (13:10~16:00)
Classroom: #260102 (College of Commerce Building)
TA: 108356041 at g.nccu.edu.tw
Office Hours: Monday 16:00~17:00 (i.e., after class).
Course Objectives & Learning Outcomes
Understand the relationship of Cybersecurity and Security Management.
Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.
Understand the concept of static analysis and dynamic analysis.
Familiar with data analysis environment, GPU-based computation, and cloud computing.
Understand the data analysis algorithms: distance function, similarity function, classification, clustering, machine learning algorithms for security application.
Understand the neural network structures and algorithms.
Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system and sequence analysis system.
Understand visualized machine learning tools: Orange
Topics (Spring 2020)
Security Management
Data Analysis Environment
Static Analysis
Dynamic Analysis
Network Trace
Linux System Log
Supervised Learning
Unsupervised Learning
Intrusion Detection System
Anomaly Detection System
Neural Network
Spam Mail Filter System
Sequence Analysis
Data Visualization
References (Spring 2020)
Network Security Through Data Analysis, Michael Collins, OREILLY, 2014.
Data-Driven Security: Analysis, Visualization and Dashboards, Jay Jacobs and Bob Rudis, Wiley, 2014.
Malware Data Science: Attack Detection and Attribution, Joshua Saxe and Hillary Sanders, No Starch Press, Nov. 2018.
Python for Data Analysis, Wes McKinney, O'Reilly Media, October 2012.
Schedule (Spring 2020)
02/25: No class. Semester starts on 2nd March.
03/02: Introduction to Cybersecurity
T01: Security Management
T02: Cyber Attack
T03: Data Analysis Environment
Python Data Science Packages [PythonDSPackages.ipynb]
GPU Accelerated Computing with Python: NVIDIA CUDA Toolkit (Installation Guide), [GPU-test.ipynb]
03/09: Supervised Learning: classification I
T04: Table-Based Data and Data Analysis Process
Model, Linear Regression, Gradient Descent
Training/Testing Process w/ Linear Regression
Cost Function: MSE, RMSE (for regression problem)
Feature Engineering: Normalization, (Dimension Reduction), Missing Values, (Unbalanced Data), Outlier-1
T05: Visualized Machine Learning Tool
Orange (class note)
Data Analysis Workflow
Evaluation: Accuracy, ROC curve, confusion matrix (for classification problem)
Testing process: k-fold validation
03/16: Supervised Learning: classification II
T06: Supervised Learning Algorithms
(Linear Regression), Logistic Regression, SVM
Cost Function: Cross-Entropy
T07: Tree-based Classification
Entropy, Gini, Chi, Information Gain, Variance
Decision Tree
Random Forest (using Orange)
03/23: Unsupervised Learning: clustering
T08: Unsupervised Learning Algorithms
Distance Functions, Scales, Similarity
UPGMA (UPGMA Walkthrough by Dr. R. Edwards), k-means, knn.
Dimension Reduction: PCA & LDA
T09: Problematic Data
Unbalanced (Imbalanced) Data
Outlier-2 (PCA-based)
Overfit and Underfit, Regularization
03/30: T10: Static Analysis
Windows Portable File Format: Wiki, Microsoft
Try this putty.exe and
PE file parsing
pip install pefile
PE file usage example, Binary Hacking
Entropy (again!)
Digital Signature, Windows Signature
Ref: A. Shalaginov et al. "Machine Learning Aided Static Malware Analysis: A Survey and Tutorial," in Cyber Threat Intelligence, August 2018. [Springer] or [Researchgate]
04/06: Dynamic Analysis
T11: System Call and API Call
Profiler: Cuckoo
S. Forrest, S. A. Hofmeyr, A. Somayaji and T. A. Longstaff, "A Sense of Self for Unix Processes," in Proc. IEEE Symposium on Security and Privacy, 1996.
04/13: Trace and Log
T12: Network: Packet Capture, Netflow
PCAP https://wiki.wireshark.org/SampleCaptures#Sample_Captures
Some PCAP examples: https://www.netresec.com/?page=PcapFiles
Event: Syslog, auditd
04/20:
04/27: (late) Midterm (20%). NO CLASS. The exam will be announced in 04/27 at 19:00. You MUST send your .ipynb file (in terms of colab share link) to TA via email before 04/30 23:59. Late submission is allowed, but with a 20% per day penalty.
05/04: Intrusion Detection
An Introduction to Intrusion Detection by Aurobindo Sundaram.
R. A. Kemmerer and V. Giovanni, "Intrusion Detection: A Brief History and Overview," Computer, vol. 35, 2002, supl. 27--30.
Additional material: E. Hodo ea al., "Shallow and Deep Networks Intrusion Detection System: A Taxonomy and Survey," arXiv:1701.02145 [cs.CR], Jan 2017.
Security Datasets: Awesome Machine Learning for Cyber Security, SecRepo.com - Samples of Security Related Data, vizsec.org, NSL_KDD.
05/11: Anomaly Detection
V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Survey, vol. 41, no. 3, July 2009.
A. Lakhina, M. Crovella and C. Diot, "Diagnosing Network-Wide Traffic Anomalies," ACM SIGCOMM Computer Communication Review, vol. 34, no. 4, pp. 219--230, 2004.
R. Sommer and V. Paxson, "Outside the Closed World: On Using Machine Learning For Network Intrusion Detection," in Proc. IEEE Symposium on Security and Privacy, 2010, pp. 305-316.
05/18: Text-based Analysis with Orange
Orange provides some Text Machine Learning Workflows, see Orange.
# Spam Mail Filter: Spambase Data Set
Orange
Here is an example of Network Analysis by Orange.
Here is the COVID-19 data.
05/25: T13: Deep Learning Basics
Supervised Learning, Unsupervised Learning,
Reinforcement LearningRegression (Sequence Stack, Activation Function, Loss Function, Optimizer)
[Wiki] Activation function, Loss Function
Optimizer: RUDER's blog [EN], RUDENR's paper [EN]
Classification (Convolution, Sub-sampling)
Polling: MaxPooling2D, AveragePooling2D
Structure: Dropout, Flatten, Softmax (see activation function)
Loss: sparse_categorical_crossentropy (integer), categorical_crossentropy (one-hot)
Another Example of CNN and Image Classification.
06/01: Latent Space
Activation Function Visualization [C]
Auto Encoder [keras.io] [Stanford] [tensorflow2.0]
Application: PCA-based Missing Value Imputation & Anomaly Detection
06/08: Latent Space II
K-means and Self-Organizing Map: [YouTube]
Word Embedding, Language Model
06/15: Project Presentation
06/22: Final
Lab/Assignment (Spring 2020)
03/09: #HW01: ZeroAccess
ZeroAccess is a cyber attack. We provide you some data about ZeroAccess incidents (longitude and latitude) and geographical information.
Please tell me does UFO cause ZeroAccess infection?
Due at the class (03/23).
03/09: #BN01: COVID-19
2% of final points will be rewarded.
Checkout the latest COVID-19 data. See if you can mine/visualize/analyze the data. Bring your codes on 03/23 in the class.
03/16: #HW02: Orange
Network Intrusion Detection on Kaggle
You need include the following mechanisms by Orange (but not limited to)
feature engineering
at least 5 different models
k-fold validation
Evaluation index
Please register a kaggle account.
Due at the class (03/30). Tell me your classification accuracy (CA).
03/23: #HW03: PCA&datasets
Due at the class (04/06).
04/06: #HW04: Static Analysis
Extract features from 40 malware samples and try to classify/cluster them into 4 groups.
Due at the class (04/13).
04/13: #HW05
Try to cluster these dynamic analysis profiles: hooklog419
Due at the class (04/20).
04/20: #HW06
Try KDDCup99 data. Here is my analysis. You may add any classifiers to analyze the data.
Read this first! http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Due 04/27 at 16:00.
04/27: #PJ01, #BN02
#PJ01: Term Project Proposal
Please upload a document that includes Dataset (you want to use), Problem (you want to solve), Preprocess (you want to apply), Models (you want to use), and Expectation.
Due 05/13 at 23:59. Upload your file to here (use your student ID as filename).
#BN02: CORD-19-research-challenge
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
5% of final points will be rewarded considering the quality.
Due at the class (05/18). Upload Here.
5/18: #HW07
Spam mail detector.
Please upload your spam mail and normal mail to the repository. Your assignment is to create a spam detector based on all the normal and spam emails uploaded by the students.
Due at the class (05/25). Upload Here.
5/25: #HW08
Given a set of malware PE files and their malware family labels.
Please design a deep neural network to classify these PE files into their families.
Here is the HW08. Copy it to your own google drive and finish it.
Due at the class (06/01). Upload Here.
6/1: #HW09
Base on your #HW04, Base on your #HW08 (#HW08 has been updated as well) and try to design a NN-based classifier (need to include an autoencoder) to classify malware samples into the correct family.Q1: PCA or AutoEncoder? Which is better for data representation?
Q2: What is your design of NN? Your classification accuracy rate?
Q3: Implement certain ML algorithms on the same dataset. ML is better or NN is better?
Due at the class (06/08). UploadHere.6/5: Due to the data set problem, this #HW09 will due at the class (06/15). New data set is announced in 6/8 (Please see new #HW08 as well). Please spend sometime on your final project first. Upload your #HW09 Here. #HW09 is our last homework.
6/15: Upload your final project today. See announcement.
Grading Policy
Homework (30%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.
Class Participation (10%): attendance, discussion. Students are expected to attend classes and participate in class discussions. It’s important that you attend and participate in class; our class meets only once a week, so missing one class represents a substantial portion of the semester. If there are special circumstances requiring you to be out of class, please email me/TA BEFORE class. You should come to class prepared and on time. You get ONE freebie absence. Your second absence is excusable in a dire emergency (e.g., illness, family emergency, flood, volcano, locusts, etc). A third absence can mean you fail the class.
Project (20%): student needs to write an analysis program on a security related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and uploaded GitHub codes are required.
Midterm and Final (40%)
The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You MUST read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.
Academic Integrity
Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.
Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.