Advanced Network Technologies and Services Laboratory - Data Science for Cybersecurity 2022

Data Science for Cybersecurity (2022)

This course serves as an introductory triggering class for the students who major in Management Information Systems and are interested in cybersecurity analysis. In this course, we focus on data analysis-related topics. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be architects to solve security-related problems using data analysis algorithms and tools. Related security concepts, data analysis theories, research papers, and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.

Note that students should take Programming Language I, Programming Language II (i.e., two semesters of programming course) before taking this class. The programming language used in this class is Python (however we will NOT cover any Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing programs, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts.

Note this course is designed for students who are in their third or fourth year of college at the MIS department. If you have taken any advanced AI/ML/DM course, you may want to skip this course.

Announcements (Spring 2022)

-/-: We have a forum (https://groups.google.com/g/nccu-ds4s) for this class. If you have any questions (e.g., enrollment, homework, project, ...), please post your question directly in the forum. TA and I will respond later.
-/-: For the students who want to enroll in this class, please come to the classroom with your enrollment document in the 2nd and 3rd weeks. We will check how many seats are available for the additional students.
2/16: Live streaming of the first class will be broadcasted at 9:10 am (UTC+8, Taiwan time) on the youtube channel. https://www.youtube.com/channel/UCIIOuh-0H1Wrq75ozOVBaHA. Syllabus, Security Management, Malware will be covered (in English).
- If you miss the syllabus, please watch the video Syllabus22 carefully.
2/23: To take this course, you MUST read this document first. It tells you which video, codes, homework, additional materials you need to work with.
3/1: A Google Classroom (ufyyt6s)is created for this class. Please use your '@g.nccu.edu.tw' account to sign on for submitting your first homework -- H-ZeroAccess. Since it is our first time using Google Classroom, the due date is now March 8 (AoE). If any questions, please contact TA (or send your homework to TA via email, in case there is something wrong).
3/3: Homework O-COVID19 announced; due 2022.Mar.13 (AoE). Please go to our Google Classroom. Remember to log in with your g.nccu.edu.tw account.
3/8: Since NCCU's Google Classroom only accept g.nccu.edu.tw account. Here are some instructions to get a g.nccu.edu.tw account.
3/12: Homework H-NID announced; due 2022.Mar.24 (AoE). Please submit it by our Google Classroom.
4/5:
- Homework H-StaticAnalysis_242variants announced; due 2022.April.18 (AoE). Please submit it by our Google Classroom.
- Homework H-Dynamic announced; due 2022.April.22 (AoE). Data is here, and no codes are provided (you may use M11 or T11).
4/17
- Data Science for Cybersecurity, 2022@NCCU Midterm, 2022/04/14 (Thur) 09:00 to 2020/04/17 (Sun) 23:59. UTC+8.
- Please see Google Classroom to get the colab file. (It will be announced at 9 am 04/14), or Midterm File Here!
4/27
- Due to the COVID-19 epidemic, our following classes will be all online. Students will NOT need to go to the classroom. Make sure you visit the class website once a week and submit your homework on Google Classroom.
- The [mandarin] classes will be broadcasted live on our YouTube Channel (https://www.youtube.com/channel/UCIIOuh-0H1Wrq75ozOVBaHA) every Thursday from 9 a.m. to 12 p.m.
- The [english] classes are pre-recorded and edited. You can find them in this playlist (https://www.youtube.com/playlist?list=PL_ExBw9oE-6iHE31yoeZ_5HgaCsgpyMtk).
5/11
- Due to the COVID-19 epidemic, we will NOT have the final project demo. Please prepare your final project as follows.
  - 1) A two-page proposal (DUE May 26) that specifies what kind of "AI-based security system" you would like to develop.
    - An introduction to the dataset (if possible, find a public-available dataset) you would like to use.
    - What preprocessing would you like to perform? Why?
    - What algorithms will you use to analyze the data? Why?
    - The expected results.
  - 2) A final report (DUE June 13) that includes your codes and the following analysis (but not limited to).
    - The description of your dataset (may include missing rate, mean/std.dev of each column, number of logs, correlation, etc.)
    - Insight of your dataset after you perform exploratory data analysis.
    - The preprocessing was applied (for example, feature engineering, missing value, data balancing, etc.)
    - The model design. Why such a design? Pros and cons.
    - Experiment process (e.g., training, testing, 10/5 fold, validation, etc.)
    - Experiment result (e.g., confusion matrix, loss, etc.)
    - Insight of your result.
    - Your codes.
5/12, 6/13
- The final exam will be held online from June 13 to June 19.
- The final exam colabfile is here.

Class Info

Instructor: Shun-Wen Hsiao, NCCU MIS Dept., <hsiaom at nccu.edu.tw>
Lectures: Thursday 234 (09:10~12:00 UTC+8)
Classroom: NCCU Commerce Building, Room 260313
Google Classroom: https://classroom.google.com/c/NDc3NTM5MTU3NDUx?cjc=ufyyt6s
YouTube: https://www.youtube.com/channel/UCIIOuh-0H1Wrq75ozOVBaHA
TA: 109356017 at g.nccu.edu.tw
Forum: https://groups.google.com/g/nccu-ds4s
GitHub: https://github.com/hsiaom26/DS4CS/ and https://github.com/hsiaom26/DS4CS22
Office Hours: By appointment.
QR codes: 306732001 (undergraduate, upper image), 356378001 (graduate, lower image)

Course Objectives & Learning Outcomes

Understand the relationship between Cybersecurity and Security Management.
Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.
Understand the concept of static analysis and dynamic analysis.
Familiar with data analysis environment, GPU-based computation, and cloud computing.
Understand the data analysis algorithms: distance function, similarity function, classification, clustering, machine learning algorithms for security application.
Understand the neural network structures and algorithms.
Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system, and sequence analysis system.
Understand visualized machine learning tools: Orange

Topics (Spring 2022)

Security Management
Data Analysis Environment
Static Analysis
Dynamic Analysis
Network Trace
Linux System Log
Supervised Learning

Unsupervised Learning
Intrusion Detection System
Anomaly Detection System
Neural Network
Spam Mail Filter System
Sequence Analysis
Data Visualization

References (Spring 2022)

Machine Learning for Cyber Security
Data Science for Cyber-Security
Awesome Machine Learning for Cyber Security
Python Data Science Handbook
Malware Data Science: Attack Detection and Attribution, Joshua Saxe and Hillary Sanders, No Starch Press, Nov. 2018.
Python for Data Analysis, Wes McKinney, O'Reilly Media, October 2012.
https://machinelearningmastery.com

DS4CS2022

Schedule (Spring 2022)

Introduction to Cybersecurity
- Syllabus22
- Security Management [T01, VoD] [M01]
- Cyber Attack [T02, VoD] [M01]
Supervised Learning (I)
- CoLab [gpu-test.ipynb, VoD] [M02]
  - Google Colab
  - Python Data Science Packages [PythonDSPackages.ipynb]
  - GPU Accelerated Computing with Python: NVIDIA CUDA Toolkit (Installation Guide)
  - Google Dataset Search
- Model [T04] [M03]
  - Model [VoD], Linear Regression [VoD], Gradient Descent [VoD]
  - Exploratory Data Analysis [VoD], Pandas [VoD]
  - Training/Testing Process w/ Linear Regression
  - Cost Function: MSE, RMSE
  - Feature Engineering: Normalization, (Dimension Reduction), Missing Values, (Unbalanced Data), Outlier-1
Supervised Learning (II)
- [T06] [M05]: Logistic Regression [VoD], Support Vector Machine [VoD]
  - Cost Function: Cross-Entropy
- Orange: Visualized Machine Learning Tool [VoD] [M04(O)]
  - Data Analysis Workflow
  - Evaluation: Accuracy, ROC curve, confusion matrix (for classification problem) [VoD]
  - Testing process: k-fold validation
Supervised Learning (III)
- Tree-based Classification [T07, DT.pptx] [M06]
  - Decision Tree, Entropy, Gini, Chi, Information Gain, Variance [VoD]
  - Revisit Supervised Learning and Random Forest (by Orange) [VoD]
Unsupervised Learning: clustering
- Unsupervised Learning Algorithms [VoD] [M07]
  - Distance Functions, Scales, Similarity [DM0.pptx]
  - UPGMA.ipynb (UPGMA Walkthrough by Dr. R. Edwards)
  - k-means.ipynb [VoD]
  - knn.ipynb (a supervised algorithm, self-reading material)
Unsupervised Learning: dimension reduction
- PCA & LDA [PCA.pptx, VoD] [M08]
- [T09] [M09]: Problematic Data [VoD]
  - Unbalanced (Imbalanced) Data
  - Outlier-2 (PCA-based)
  - Overfit and Underfit, Regularization (self-reading material)
Program Analysis
- Static Analysis [T10] [M10]
  - Windows Portable File Format: Wiki, Microsoft [VoD]
  - [Hash], Digital Signature, Windows Signature, VirusTotal [VoD]
    - Ref: [Endianness]
  - PE file parsing [VoD]
    - PE file usage example
    - Entropy (again!)
  - Ref: A. Shalaginov et al. "Machine Learning Aided Static Malware Analysis: A Survey and Tutorial," in Cyber Threat Intelligence, August 2018. [Springer] or [Researchgate]
- Dynamic Analysis [T11] [M11]
  - API Call Analysis, One-hot Encoding [VoD]
  - Time series and event sequence [VoD]
  - Advanced Techniques: PCA, UPGMA, DotMatrix [VoD]
  - Ref: S. Forrest, S. A. Hofmeyr, A. Somayaji and T. A. Longstaff, "A Sense of Self for Unix Processes," in Proc. IEEE Symposium on Security and Privacy, 1996.
Network Trace and System Log
- Network: Packet Capture, Netflow [T12, VoD] [M12]
  - PCAP https://wiki.wireshark.org/SampleCaptures#Sample_Captures
  - Some PCAP examples: https://www.netresec.com/?page=PcapFiles
- Event: Syslog, audited (we will use NN to analyze logs later)
Midterm. No class.
Deep Learning Basics (T13byMIT: GitHub, Youtube)
- Neural Network Structure and Regression [M13] [VoD][VoD]
  - [API Doc] Sequence Model, Sequence Class, Activation Function, Loss Function, Optimizer
    - [Wiki] Activation function
    - [Wiki] Loss Function
      - sparse_categorical_crossentropy (integer), categorical_crossentropy (one-hot)
    - Optimizer: RUDER's blog [EN], RUDER's paper [EN] [VoD]
Neural Network Structure and Convolution [M14]
- Convolution: Conv2D, Conv1D, kernel size [VoD]
  - Polling: MaxPooling2D, AveragePooling2D [VoD]
  - Flattern layer
  - Example: [CNNImage] [VoD]
- Overfitting (T14: by Google, no VoD, self-learning material)
  - Validation dataset
  - Dropout (will be covered in the latter VoD)
  - Ridge and Lasso (will be covered in the latter VoD)
Latent Space
- T15 [VoD] [M15-1 by Google]
  - Example: Auto Encoder [by Stanford] [by tensorflow2.0]
  - AE Applications [VoD]:
    - Example: Convolutional Autoencoder & Image Denoising [M15-2 by Keras]
  - Ref: Variational Autoencoder ([TC] and [En])
- Google: "MNIST latent space" to see the latent space of handwriting digits [VoD]
- L1/L2, Dropout, Overfit [VoD]
  - Revisit T09 [VoD]
  - Ridge and Lasso Regression [En]
Latent Space II [T16 no slides, see VoD][M16]
- Activation Function Visualization [VoD]
  - Ref: EN A visual proof that neural nets can compute any function
  - Ref: C Activation Function Visualization
  - Ref: EN ConvnetJS demo by karpathy
  - Ref: EN playground.tensorflow.org
  - Ref: C LeeMeng
- Sensitivity of a Model to Neural Network Hyperparameter [jpg, VoD]
  - Ref: tf.keras.initializers
- Text Machine Learning Workflows (Orange) [T17-1/M17(O) VoD only]
  - Example: Spam Mail Filter: Spambase Data Set
RNN + Language Model
- MIT 6.S191: Recurrent Neural Networks and Transformers [YouTube by MIT]
  - Ref: Understanding LSTM Networks
- Codes [T17-2&3, see VoD & VoD]
  - Word Embeddings, Word2Vec [ref: EN, TC]
  - Keras Embedding Layer
  - Transformer and BERT
- Example: Attack Lifecycle
  - MITRE ATT&CK
Intrusion Detection
- Paper Reading (Please download the paper and read them on your own. I highlight important concepts in the paper.)
  - An Introduction to Intrusion Detection by Aurobindo Sundaram.
  - R. A. Kemmerer and V. Giovanni, "Intrusion Detection: A Brief History and Overview," Computer, vol. 35, 2002, supl. 27--30.
  - Additional material (if you have time to read it): E. Hodo et al., "Shallow and Deep Networks Intrusion Detection System: A Taxonomy and Survey," arXiv:1701.02145 [cs.CR], Jan 2017.
- Code Review (covered in the class)
  - Word Embedding (Revisit)
  - TFDS: Tensorflow Dataset
  - Basic text classification (NN), Text classification with an RNN, Classify text with BERT
- Security Datasets for your information
  - Awesome Machine Learning for Cyber Security, SecRepo.com - Samples of Security Related Data, vizsec.org, NSL_KDD.
Anomaly Detection
- Paper Reading (Please download the paper and read them on your own. I highlight important concepts in the paper.)
  - V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Survey, vol. 41, no. 3, July 2009.
  - S. Forrest, S. A. Hofmeyr, A. Somayaji and T. A. Longstaff, "A sense of self for Unix processes," in Proc. 1996 IEEE Symposium on Security and Privacy (S&P), May 1996, pp. 120-128.
- Code Review
  - Anomaly Detection
    - Classification on imbalanced data
    - Novelty and Outlier Detection by scikit-learn
    - Anomaly Detection using Machine Learning by projectpro
    - Anomaly Detection by Oreilly
~~Demo~~
- No Class. Please finish your final project!
- Homework: watch "K-means and Self-Organizing Map" [YouTube]
Final exam.
- Note that you MUST finish the FINAL alone. You CANNOT share/discuss with any others. We highly recommend you finish the FINAL as early as possible. Do not wait until the last minute.

Lab/Assignment (Spring 2022)

H-ZeroAccess (4%)
- Copy the colab file to your google drive and finish the questions.
- Due 2022.~~Mar.03~~ Mar. 08@ AoE
  - Submit to Google Classroom using your @g.nccu.edu.tw.
  - https://classroom.google.com/c/NDc3NTM5MTU3NDUx?cjc=ufyyt6s
O-COVID19 (4%)
- https://github.com/CSSEGISandData/COVID-19
  - Example by Taiwan CDC. [link]
- Data Mining COVID-19 with Orange [Blog] [YouTube]
- Submit to Google Classroom; due 2022.Mar.13 (AoE).
  - Upload a pdf file containing your Orange data analysis flow and screenshots of your analysis.
  - It is open homework, feel free to do any analysis on Covid-19 data.
H-NID (5%)
- Network Intrusion Detection
  - https://www.kaggle.com/sampadab17/network-intrusion-detection-using-python
- You need to include the following mechanisms by Orange (but not limited to)
  - feature engineering
  - at least 5 different models
  - k-fold validation
  - Evaluation index
- Due 2022.Mar.24 (AoE). See Google Classroom.
H-PCA (6%)
- PCA&datasets
- Use all you have learned to analyze three datasets we have used before.
- Due 2022.Mar.31@ AoE
  - Submit to Google Classroom with your .ipynb file (not a link to your colab).
H-Static (6%)
- H-StaticAnalysis_242variants
- Extract features from 242 malware samples and try to classify/cluster them.
- Due 2022.April.18@ AoE
H-Dynamic (6%)
- Use T11 or M11 to analyze 272 malware samples.
- Due 2022.April.22@ AoE
H-PEimageCNN
- Given a set of malware PE files and their malware family labels. Please design a CNN deep neural network to classify these PE files into their families. H-PEimageCNN.ipynb
- Due 2022.May.8@ AoE
H-SMSSpam
- Here is a SMS Spam Collection Data Set.
- Design NN-based classifiers to determine if a message is spam or not. You may want to use fasttext as your English words embedder.
  - https://github.com/facebookresearch/fastText/
  - https://fasttext.cc/docs/en/python-module.html
- You MUST design TWO neural network models: one with an AutoEncoder and another without AE. Show us which one is better.
- You MUST apply ONE ML classifier, and show us NN is better or not.

Grading Policy

Homework (40%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.
Project (20%): student needs to write an analysis program on a security-related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and uploaded GitHub codes are required.
Midterm (20%)
Final (20%)

The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You SHOULD read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.

Academic Integrity

Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.
Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.

Google Sites

Report abuse