Advanced Network Technologies and Services Laboratory - Data Science for Cybersecurity 2023

Data Science for Cybersecurity (2023)

This course serves as an introductory triggering class for students who are interested in cybersecurity analysis using machine learning methods. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be architects to solve security-related problems using data analysis algorithms and tools. Related security concepts, data analysis theories, research papers, and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.

Note that students should take programming courses before, such as Programming Language I/II. The programming language used in this class is Python (however we will NOT cover any Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing programs, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts.

Note this course is designed for students who are in their third or fourth year of college. If you have taken any advanced AI/ML/DM course, you may want to skip this course.

02/06: For students who want to enroll in this class, please come to the classroom directly. We will handle the enrollment process on 2/23.

Announcements (Spring 2023)

-/-: We have a forum (https://groups.google.com/g/nccu-ds4s) for this class. If you have any questions (e.g., enrollment, homework, project, ...), please post your question directly in the forum. TA and I will respond later.
-/-: For the students who want to enroll in this class, please come to the classroom with your enrollment document in the 2nd and 3rd weeks. We will check how many seats are available for the additional students.
For English students, you MUST read this document before enrolling in this class.
02/06: For students who want to enroll in this class, please come to the classroom directly. We will handle the enrollment process on 02/23.
02/16: We have a new classroom 學思樓 #040103 (Xue Si Building, 103 Room). Please go to the new classroom directly.
02/22: Midterm changes to 4/20 (i.e., 10th week). No class.
02/23: Join Google Classroom by using your g.nccu.edu.tw account with ID qmgp6yj. The first homework (H-ZeroAccess) is announced. Due 2023.03.09 23:59.
03/22: The due date of H-NID has been extended to 3/27 (Monday). Please see the announcement in Google Classroom or contact TA.
03/27: The homework before midterm is shown as follows. After midterm, we will have only 2 more homework. Cheer up.
- H-PCA: due on 4/8
- H-Static: due on 4/15
- No class on 4/20
- H-Dynamc: due on 4/22
- Midterm: announce on 4/20; due on 4/26. (I wish you could finish it before next class on 4/27.)
04/05:
- The class on 4/6 (April 6) will be held as usual in the classroom #040103. But we will provide live online streaming (https://www.youtube.com/@nccu.hsiaom/) and the video on demand for students not in the classroom. You may come to the classroom, watching live streaming, or watch the VoD later.
04/27:
- H-PEimageCNN homework announced. Due 5/10 23:59.
05/22: Final Project! We switch the date of Final Project DEMO and Final Exam!
- One or two students forms a group. You MUST specify all your members in every document and codes.
- 06/01: Proposal. Hand-in a PDF proposal on June 1. It SHOULD include a title, the problem definition, the dataset used, and the planing of experiment. We will review your dataset, and please do not use very commonly-used dataset (since many works have been focused on them).
- 06/08: Final Exam! No class! File announced 06/07 at 18:00. Due 06/11 at 23:59.
- 06/15: Demo. Present you colab file in the class. No presentation file is needed. Hand-in your modified colab file before 6/17. Due to the limited time, only part of the teams can demo their works. If you want to demo voluntarily, please contact TA. If too many or few teams, draw straws might be needed. Please be prepared!

Class Info

Instructor: Shun-Wen Hsiao, NCCU MIS Dept., <hsiaom at nccu.edu.tw>
Lectures: Thursday 234 (09:10~12:00 UTC+8)
Classroom: 學思樓 #040103 (Xue Si Building, 103 Room)
TA: Ms. HSIA, 111356021 at g.nccu.edu.tw
Forum: https://groups.google.com/g/nccu-ds4s
GitHub: [Tx] https://github.com/hsiaom26/DS4CS/ and [Mx] https://github.com/hsiaom26/DS4CS22
Homework: Google Classroom (You MUST have a @g.nccu.edu.tw account) #qmgp6yj.
Office Hours: By appointment.

Course Objectives & Learning Outcomes

Understand the relationship between Cybersecurity and Security Management.
Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.
Understand the concept of static analysis and dynamic analysis.
Familiar with data analysis environment, GPU-based computation, and cloud computing.
Understand the data analysis algorithms: distance function, similarity function, classification, clustering, and machine learning algorithms for security applications.
Understand the neural network structures and algorithms.
Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system, and sequence analysis system.
Understand visualized machine learning tools: Orange

Topics (Spring 2023)

Security Management
Data Analysis Environment
Static Analysis
Dynamic Analysis
Network Trace
Linux System Log
Supervised Learning
Unsupervised Learning

Intrusion Detection System
Anomaly Detection System
Neural Network
Spam Mail Filter System
Sequence Analysis
Language Model
Data Visualization
GUI-based Analysis Tool

References (Spring 2023)

Machine Learning for Cyber Security
Data Science for Cyber-Security
Awesome Machine Learning for Cyber Security
Python Data Science Handbook
Malware Data Science: Attack Detection and Attribution, Joshua Saxe and Hillary Sanders, No Starch Press, Nov. 2018.
Python for Data Analysis, Wes McKinney, O'Reilly Media, October 2012.
https://machinelearningmastery.com

DS4CS2023

Schedule (Spring 2023)

Introduction to Cybersecurity
- Syllabus 2023
- Security Management [T01, VoD] [M01]
- Cyber Attack [T02, VoD] [M01]
Supervised Learning (I)
- CoLab [gpu-test.ipynb, VoD] [M02]
  - Google Colab
  - Python Data Science Packages [PythonDSPackages.ipynb]
  - GPU Accelerated Computing with Python: NVIDIA CUDA Toolkit (Installation Guide)
  - Google Dataset Search
- Model [T04] [M03]
  - Model [VoD], Linear Regression [VoD], Gradient Descent [VoD]
  - Exploratory Data Analysis [VoD], Pandas [VoD]
  - Training/Testing Process w/ Linear Regression
  - Cost Function: MSE, RMSE
  - Feature Engineering: Normalization, (Dimension Reduction), Missing Values, (Unbalanced Data), Outlier-1
Supervised Learning (II)
- [T06] [M05]: Logistic Regression [VoD], Support Vector Machine [VoD]
  - Cost Function: Cross-Entropy
- Orange: Visualized Machine Learning Tool [VoD] [M04(O)]
  - Data Analysis Workflow
  - Evaluation: Accuracy, ROC curve, confusion matrix (for classification problem) [VoD]
  - Testing process: k-fold validation
Supervised Learning (III)
- Tree-based Classification [T07, DT.pptx] [M06]
  - Decision Tree, Entropy, Gini, Chi, Information Gain, Variance [VoD]
  - Revisit Supervised Learning and Random Forest (by Orange) [VoD]
Unsupervised Learning: clustering
- Unsupervised Learning Algorithms [VoD] [M07]
  - Distance Functions, Scales, Similarity [DM0.pptx]
  - UPGMA.ipynb (UPGMA Walkthrough by Dr. R. Edwards)
  - k-means.ipynb [VoD]
  - knn.ipynb (a supervised algorithm, self-reading material)
Unsupervised Learning: dimension reduction
- PCA & LDA [PCA.pptx, VoD] [M08]
- [T09] [M09]: Problematic Data [VoD]
  - Unbalanced (Imbalanced) Data
  - Outlier-2 (PCA-based)
  - Overfit and Underfit, Regularization (self-reading material)
Program Analysis
- Static Analysis [T10] [M10]
  - Windows Portable File Format: Wiki, Microsoft [VoD]
  - [Hash], Digital Signature, Windows Signature, VirusTotal [VoD]
    - Ref: [Endianness]
  - PE file parsing [VoD]
    - PE file usage example
    - Entropy (again!)
  - Ref: A. Shalaginov et al. "Machine Learning Aided Static Malware Analysis: A Survey and Tutorial," in Cyber Threat Intelligence, August 2018. [Springer] or [Researchgate]
- Dynamic Analysis [T11] [M11]
  - API Call Analysis, One-hot Encoding [VoD]
  - Time series and event sequence [VoD]
  - Advanced Techniques: PCA, UPGMA, DotMatrix [VoD]
  - Ref: S. Forrest, S. A. Hofmeyr, A. Somayaji and T. A. Longstaff, "A Sense of Self for Unix Processes," in Proc. IEEE Symposium on Security and Privacy, 1996.
Network Trace and System Log
- Network: Packet Capture, Netflow [T12, VoD] [M12]
  - PCAP https://wiki.wireshark.org/SampleCaptures#Sample_Captures
  - Some PCAP examples: https://www.netresec.com/?page=PcapFiles
- Event: Syslog, audited (we will use NN to analyze logs later)
Deep Learning Basics (T13byMIT: GitHub, Youtube)
- Neural Network Structure and Regression [M13] [VoD][VoD]
  - [API Doc] Sequence Model, Sequence Class, Activation Function, Loss Function, Optimizer
    - [Wiki] Activation function
    - [Wiki] Loss Function
      - sparse_categorical_crossentropy (integer), categorical_crossentropy (one-hot)
    - Optimizer: RUDER's blog [EN], RUDER's paper [EN] [VoD]
Midterm. 4/20, 2023. No class.
Neural Network Structure and Convolution [M14]
- Convolution: Conv2D, Conv1D, kernel size [VoD]
  - Polling: MaxPooling2D, AveragePooling2D [VoD]
  - Flattern layer
  - Example: [CNNImage] [VoD]
- Overfitting (T14: by Google, no VoD, self-learning material)
  - Validation dataset
  - Dropout (will be covered in the latter VoD)
  - Ridge and Lasso (will be covered in the latter VoD)
Latent Space
- T15, M15 [VoD] [oldM15-1 by Google]
  - Example: Auto Encoder [by Stanford] [by tensorflow2.0]
  - AE Applications [VoD]:
    - Example: Convolutional Autoencoder & Image Denoising [oldM15-2 by Keras]
  - Ref: Variational Autoencoder ([TC] and [En])
- Google: "MNIST latent space" to see the latent space of handwriting digits [VoD]
- L1/L2, Dropout, Overfit [VoD]
  - Revisit T09 [VoD]
  - Ridge and Lasso Regression [En]
Latent Space II [T16 no slides, see VoD][M16]
- Activation Function Visualization [VoD]
  - Ref: EN A visual proof that neural nets can compute any function
  - Ref: C Activation Function Visualization
  - Ref: EN ConvnetJS demo by karpathy
  - Ref: EN playground.tensorflow.org
  - Ref: C LeeMeng
- Sensitivity of a Model to Neural Network Hyperparameter [jpg, VoD]
  - Ref: tf.keras.initializers
- Text Machine Learning Workflows (Orange) [T17-1/M17(O) VoD only]
  - Example: Spam Mail Filter: Spambase Data Set
RNN + Language Model
- Algorithms
  - M18-1 Understanding LSTM Networks (RNN, LSTM, GRU)
    - Ref: MIT 6.S191: Recurrent Neural Networks and Transformers [YouTube by MIT]
  - M18-2 Attention and Self-Attention for NLP
- Codes [T17-2&3, see VoD & VoD] (will be covered next week)
  - Word Embeddings, Word2Vec [ref: EN, TC]
  - Basic text classification, Text classification with an RNN, Classify text with BERT
- Example: Attack Lifecycle
  - MITRE ATT&CK
Intrusion Detection + Language Model Codes
- Paper Reading (Please download the paper and read them on your own. I highlight important concepts in the paper.)
  - An Introduction to Intrusion Detection by Aurobindo Sundaram.
  - R. A. Kemmerer and V. Giovanni, "Intrusion Detection: A Brief History and Overview," Computer, vol. 35, 2002, supl. 27--30.
  - Additional material (if you have time to read it): E. Hodo et al., "Shallow and Deep Networks Intrusion Detection System: A Taxonomy and Survey," arXiv:1701.02145 [cs.CR], Jan 2017.
- Code Review (covered in the class)
  - Word Embedding (Revisit)
  - TFDS: Tensorflow Dataset
  - Basic text classification (NN), Text classification with an RNN, Classify text with BERT
- Security Datasets for your information
  - Awesome Machine Learning for Cyber Security, SecRepo.com - Samples of Security Related Data, vizsec.org, NSL_KDD.
Anomaly Detection
- Paper Reading (Please download the paper and read them on your own. I highlight important concepts in the paper.)
  - V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Survey, vol. 41, no. 3, July 2009.
  - S. Forrest, S. A. Hofmeyr, A. Somayaji and T. A. Longstaff, "A sense of self for Unix processes," in Proc. 1996 IEEE Symposium on Security and Privacy (S&P), May 1996, pp. 120-128.
- Code Review
  - Anomaly Detection
    - Classification on imbalanced data
    - Novelty and Outlier Detection by scikit-learn
    - Anomaly Detection using Machine Learning by projectpro
    - Anomaly Detection by Oreilly
Final exam.
- Note that you MUST finish the FINAL alone. You CANNOT share/discuss with any others. We highly recommend you finish the FINAL as early as possible. Do not wait until the last minute.
Demo.
- Homework: watch "K-means and Self-Organizing Map" [YouTube]

Lab/Assignment (Spring 2023)

H-ZeroAccess (4%)
- Copy and save this colab file to your google drive and finish the questions.
- Due 2023.03.09 (23:59 UTC+8)
  - Submit to Google Classroom using your @g.nccu.edu.tw account.
  - Upload your colab file (an .ipynb file) to the Google Classroom system; please do NOT send your colab link.
O-COVID19 (4%)
- https://github.com/CSSEGISandData/COVID-19
  - Example by Taiwan CDC. [link]
- Data Mining COVID-19 with Orange [Blog] [YouTube]
- Submit to Google Classroom; due 2023.03.16.
  - Upload a pdf file containing your Orange data analysis flow and screenshots of your analysis.
  - It is open homework, feel free to do any analysis on Covid-19 data.
H-NID (5%)
- Network Intrusion Detection
  - https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection
- You need to include the following mechanisms by Orange (but not limited to)
  - feature engineering
  - at least 5 different models
  - k-fold validation
  - Evaluation index
- Due 2023.03.23.
H-PCA (6%)
- PCA&datasets
- Use all you have learned to analyze three datasets we have used before.
- Due 2023.04.08
  - Submit your .ipynb file to Google Classroom.
H-Static (6%)
- H-StaticAnalysis_242variants
- Extract features from 242 malware samples and try to classify/cluster them.
- Due 2023.04.15.
H-Dynamic (6%)
- Use T11 or M11 to analyze 272 malware samples.
- Due 2023.04.22.
H-PEimageCNN
- Given a set of malware PE files and their malware family labels. Please design a CNN deep neural network to classify these PE files into their families. H-PEimageCNN.ipynb
- Due 2023.05.10 23:59.
H-SMSSpam
- Here is a SMS Spam Collection Data Set.
- Design NN-based classifiers to determine if a message is spam or not.
- First, you need a word embedder. Please use at least TWO embedders.
  - You may want to use fasttext as your English words embedder.
    - https://github.com/facebookresearch/fastText/
    - https://fasttext.cc/docs/en/python-module.html
  - Or you can use any of the embedders shown in the class
- Second, after obtaining the vectors. You MUST implement at least THREE classifiers.
  - One classifier MUST be a conventional ML classifier, such as LR, SVM, Tree, ...
  - The other one MUST be a neural network classifier.
  - The third model MUST contains an AutoEncoder or VAE.
- Please show us which combination of embedder and classifier has better accuracy.

Grading Policy

Homework (40%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.
Project (20%): student needs to write an analysis program on a security-related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and uploaded GitHub codes are required.
Midterm (20%)
Final (20%)

The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You SHOULD read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.

Academic Integrity

Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.
Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.

Google Sites

Report abuse