Advanced Network Technologies and Services Laboratory - Data Science for Cybersecurity 2021

Data Science for Cybersecurity (2021)

2021/7/28: Links or files on this page might be lost. Please see the latest version of this course on the Teaching page.

This course serves as an introductory triggering class for the students who major in Management Information Systems and are interested in cybersecurity analysis. In this course, we focus on data analysis-related topics. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be architects to solve a security-related problem using data analysis algorithms and tools. Related security concepts, data analysis theories, research papers, and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.

Note that students should take Programming Language I, Programming Language II (i.e., two semesters of programming course) before taking this class. The programming language used in this class is Python (however we will NOT cover any Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing program, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts, so please bring your laptop to the class.

Note this course is designed for students who are in their third or fourth year of college at the MIS department. If you have taken any advanced AI/ML/DM course, you may want to skip this course. It is an English-taught class, and it will be broadcast online.

Announcements (Spring 2021)

2/22: We now have a forum (https://groups.google.com/g/nccu-ds4s) for this class. If you have any questions (e.g., enrollment, homework, project, ...), please post your question directly in the forum. TA and I will respond later.
2/22: For the students who want to enroll in this class, please come to the classroom with your enrollment document on 2/25 and 3/4. We will check how many seats are available for the additional students.
2/25: Here is the QR code for the undergraduate students; and the QR code for the graduate students. Due to the COVID-19, NCCU requests that you MUST scan the QR code every time you attend the class.
3/25: #HW01 announced. Due 4/1 9 am UTC+8. TA will send an instruction for homework submission.
3/31: #HW02 announced. Due 4/15 9 am UTC+8.
4/15: #HW03 announced. Due 4/21 9 am UTC+8.
4/16: MIDTERM announced. Due 2020/04/29 (Thr) 09:00 UTC+8. Please follow the instruction in the colab file. You may ask questions in our class forum https://groups.google.com/g/nccu-ds4s. Note that you MUST finish the MIDTERM alone. You CANNOT share/discuss MIDTERM with any others. We highly recommend you to finish the MIDTERM as early as possible. Do not wait until the last minute.
5/06: #HW04 announced. Due 5/20 9 am UTC+8.
5/18:
- No class on 5/20 (University Anniversary).
- Please take a look at K-means and Self-Organizing Map: [YouTube] on your own. We will NOT cover this lecture.
- Due to the covid-19, our course will be purely online. Please do NOT go to the classroom. If you have any questions, please post them on the forum.
- The format of the FINAL exam will be the same as MIDTERM. You will have one week to finish it at home on your own. Start from 6/17 and due at 6/24 9 am UTC+8.
- #HW05 announced. Due 6/3 9 am UTC+8.
- Please spend some time on your FINAL PROJECT PROPOSAL. You will need to upload a PDF document (#PJ01, at most 4 pages) that includes the Dataset (you want to use), Problem (you want to solve), Preprocess (you want to apply), Models (you want to use), and Expectation of your data analysis project. Due 06/3 9 am UTC+8, and send the PDF file to TA. You can freely choose the topic of your final project. We recommend using the dataset from Kaggle or Google Dataset. It is not necessary to be a security-related dataset/problem; however, we strongly recommend choosing security-related topics. If any questions, post them on the forum. Send the PDA file to TA.
  - You will need to complete your FINAL PROJECT on colab (#PJ02) before the FINAL exam.
- 6/4:
  - #BN01: COVID-19. 4% of final points will be rewarded. You may see the example made by Taiwan CDC [link]. Send your code/webpage to TA before FINAL (6/24 9 am UTC+8.) Data Ref: https://github.com/CSSEGISandData/COVID-19
  - #HW06: Network Intrusion Detection. See the detailed description below. Due 6/17 9 am UTC+8.
- 6/17:
  - FINAL announced. Due 2020/06/24 (Thr) 09:00 UTC+8. Please follow the instruction in the colab file. You may ask questions in our class forum https://groups.google.com/g/nccu-ds4s. Note that you MUST finish the FINAL alone. You CANNOT share/discuss MIDTERM with any others. We highly recommend you to finish the FINAL as early as possible. Do not wait until the last minute.

Class Info

Instructor: Shun-Wen Hsiao, NCCU MIS Dept., <hsiaom at nccu.edu.tw>
Lectures: Thursday 234 (09:10~12:00 UTC+8)
Classroom: Yi-Xian Building, 5F
Online Video: https://www.youtube.com/channel/UCIIOuh-0H1Wrq75ozOVBaHA
- The class will be broadcast live on YouTube every Thursday morning (09:10~12:00 UTC+8).
- The VOD will be kept in the YouTube channel, and the edited (shorten) versions will be uploaded later (but not guaranteed).
TA: Kelvin I. W. Kuok <108356041 at g.nccu.edu.tw>
Forum: https://groups.google.com/g/nccu-ds4s
GitHub: https://github.com/hsiaom26/DS4CS/
Office Hours: By appointment.
QR codes: QR code for the undergraduate students; QR code for the graduate students

Course Objectives & Learning Outcomes

Understand the relationship of Cybersecurity and Security Management.
Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.
Understand the concept of static analysis and dynamic analysis.
Familiar with data analysis environment, GPU-based computation, and cloud computing.
Understand the data analysis algorithms: distance function, similarity function, classification, clustering, machine learning algorithms for security application.
Understand the neural network structures and algorithms.
Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system and sequence analysis system.
Understand visualized machine learning tools: Orange

Topics (Spring 2021)

Security Management
Data Analysis Environment
Static Analysis
Dynamic Analysis
Network Trace
Linux System Log
Supervised Learning

Unsupervised Learning
Intrusion Detection System
Anomaly Detection System
Neural Network
Spam Mail Filter System
Sequence Analysis
Data Visualization

References (Spring 2021)

Network Security Through Data Analysis, Michael Collins, OREILLY, 2014.
Data-Driven Security: Analysis, Visualization and Dashboards, Jay Jacobs and Bob Rudis, Wiley, 2014.
Machine Learning for Cyber Security
Data Science for Cyber-Security
Awesome Machine Learning for Cyber Security
Python Data Science Handbook
Malware Data Science: Attack Detection and Attribution, Joshua Saxe and Hillary Sanders, No Starch Press, Nov. 2018.
Python for Data Analysis, Wes McKinney, O'Reilly Media, October 2012.
https://machinelearningmastery.com

Schedule (Spring 2021)

02/25: Introduction to Cybersecurity
- T01: Security Management
- T02: Cyber Attack
- T03: Data Analysis Environment
  - Jupyter Notebook & JupyterLab, Anaconda, Google Colab
  - Python Data Science Packages [PythonDSPackages.ipynb]
  - GPU Accelerated Computing with Python: NVIDIA CUDA Toolkit (Installation Guide), [GPU-test.ipynb]
  - Google Dataset Search
03/04: Supervised Learning: classification I
- T04: Table-Based Data and Data Analysis Process
  - Model, Linear Regression, Gradient Descent
  - Training/Testing Process w/ Linear Regression
  - Cost Function: MSE, RMSE (for regression problem)
  - Feature Engineering: Normalization, (Dimension Reduction), Missing Values, (Unbalanced Data), Outlier-1
- T05: Visualized Machine Learning Tool
  - Orange (class note)
  - Data Analysis Workflow
  - Evaluation: Accuracy, ROC curve, confusion matrix (for classification problem)
  - Testing process: k-fold validation
03/11: Supervised Learning: classification II
- T06: Supervised Learning Algorithms
  - (Linear Regression), Logistic Regression, SVM
  - Cost Function: Cross-Entropy
- T07: Tree-based Classification
  - Entropy, Gini, Chi, Information Gain, Variance
  - Decision Tree
  - Random Forest (using Orange)
03/18: (cont'd)
03/25: Unsupervised Learning: clustering
- T08: Unsupervised Learning Algorithms
  - Distance Functions, Scales, Similarity
  - UPGMA (UPGMA Walkthrough by Dr. R. Edwards), k-means, (knn supervised).
  - Dimension Reduction: PCA & LDA
- T09: Problematic Data
  - Unbalanced (Imbalanced) Data
  - Outlier-2 (PCA-based)
  - Overfit and Underfit, Regularization
04/01: T10: Static Analysis
- Windows Portable File Format: Wiki, Microsoft
  - Try this putty.exe and
- PE file parsing
  - pip install pefile
  - PE file usage example, Binary Hacking
  - Entropy (again!)
- Digital Signature, Windows Signature
- Ref: A. Shalaginov et al. "Machine Learning Aided Static Malware Analysis: A Survey and Tutorial," in Cyber Threat Intelligence, August 2018. [Springer] or [Researchgate]
04/08: Dynamic Analysis
- T11: System Call and API Call
- Profiler: Cuckoo
- S. Forrest, S. A. Hofmeyr, A. Somayaji and T. A. Longstaff, "A Sense of Self for Unix Processes," in Proc. IEEE Symposium on Security and Privacy, 1996.
04/15: Trace and Log
- T12: Network: Packet Capture, Netflow
  - PCAP https://wiki.wireshark.org/SampleCaptures#Sample_Captures
  - Some PCAP examples: https://www.netresec.com/?page=PcapFiles
- Event: Syslog, auditd
04/22: Midterm. No class.
04/29: T13: Deep Learning Basics
- Supervised Learning, Unsupervised Learning, ~~Reinforcement Learning~~
- Regression (Sequence Stack and Model, Activation Function, Loss Function, Optimizer)
  - [Wiki] Activation function, Loss Function
  - Optimizer: RUDER's blog [EN], RUDER's paper [EN]
- Classification (Convolution, Sub-sampling)
  - Convolution: Conv2D, Conv1D, kernel size
  - Polling: MaxPooling2D, AveragePooling2D
  - Structure: Dropout, Flatten, Softmax (see activation function)
    - Overfitting Explaination
  - Loss: sparse_categorical_crossentropy (integer), categorical_crossentropy (one-hot)
- Data Set: MNIST (kaggle)
- Another Example of MNIST, CNN and Image Classification.
05/06: Latent Space
- Activation Function Visualization [C]
- Auto Encoder [keras.io] [Stanford] [tensorflow2.0]
  - Building Autoencoders in Keras [En]
  - Example: Convolutional Autoencoder For Image Denoising [En]
  - Ref: Variational Autoencoder (VAE [TC]), Autoencoder App [TC]
05/13: Latent Space II
- Understanding Latent Space in Machine Learning [EN]
- Application: PCA-based Missing Value Imputation & Anomaly Detection
- Word Embedding, Language Model
- Too many weights? Ridge and Lasso Regression [EN]
05/20: No class. University Anniversary, University Sport Day
- K-means and Self-Organizing Map: [YouTube]
- Sensitivity of a Model to Neural Network Hyperparameter
05/27: Text-based Analysis with Orange
- Orange provides some Text Machine Learning Workflows, see Orange.
  - # Spam Mail Filter: Spambase Data Set
06/03: Intrusion Detection
- An Introduction to Intrusion Detection by Aurobindo Sundaram.
- R. A. Kemmerer and V. Giovanni, "Intrusion Detection: A Brief History and Overview," Computer, vol. 35, 2002, supl. 27--30.
- Additional material: E. Hodo ea al., "Shallow and Deep Networks Intrusion Detection System: A Taxonomy and Survey," arXiv:1701.02145 [cs.CR], Jan 2017.
- Security Datasets: Awesome Machine Learning for Cyber Security, SecRepo.com - Samples of Security Related Data, vizsec.org, NSL_KDD.
06/10: Anomaly Detection
- V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Survey, vol. 41, no. 3, July 2009.
- A. Lakhina, M. Crovella and C. Diot, "Diagnosing Network-Wide Traffic Anomalies," ACM SIGCOMM Computer Communication Review, vol. 34, no. 4, pp. 219--230, 2004.
- R. Sommer and V. Paxson, "Outside the Closed World: On Using Machine Learning For Network Intrusion Detection," in Proc. IEEE Symposium on Security and Privacy, 2010, pp. 305-316.
- Here is an example of Network Analysis by Orange. But the "Network" here is not "Computer Network".
06/17: Language Model
- Revisit language model
  - Word Embeddings, Word2Vec [ref: EN, TC]
  - Keras Embedding Layer
- Use language model to classify security-related articles.
  - Transformer and BERT
- Attack Lifecycle
  - Google: attack life cycle
  - MITRE ATT&CK
06/24: Final exam.
- FINAL announced. Due 2020/06/24 (Thr) 09:00 UTC+8. Please follow the instruction in the colab file. You may ask questions in our class forum https://groups.google.com/g/nccu-ds4s. Note that you MUST finish the FINAL alone. You CANNOT share/discuss MIDTERM with any others. We highly recommend you to finish the FINAL as early as possible. Do not wait until the last minute.

Lab/Assignment (Spring 2021)

03/24: #HW01: Orange
- Network Intrusion Detection
  - https://www.kaggle.com/sampadab17/network-intrusion-detection-using-python
- You need to include the following mechanisms by Orange (but not limited to)
  - feature engineering
  - at least 5 different models
  - k-fold validation
  - Evaluation index
- Due before the class (04/01 09:00 UTC+8). Submit your classification result (a pdf file) to TA. TA will send you the submission instruction by email.
03/31: #HW02: PCA
- PCA&datasets
- Due 04/15 09:00 UTC+8. Send a copy of your colab file to TA.
04/14: #HW03: Static Analysis
- Extract features from 40 malware samples and try to classify/cluster them into 4 groups.
- Due 04/21 09:00 UTC+8. Send a copy of your colab file to TA.
05/06: #HW04: PE image and CNN
- Given a set of malware PE files and their malware family labels. Please design a CNN deep neural network to classify these PE files into their families. Here is HW04. Copy it to your own colab and finish it.
- Due 05/20 09:00 UTC+8. Send a copy of your colab file to TA.
05/18: #HW05
- Base on your #HW04 and try to design a NN-based classifier (need to include an autoencoder or PCA) to classify malware samples into the correct family. Answer the following question in your colab file.
  - Q0: Your name, department, ID.
  - Q1: PCA or AutoEncoder? Which is better for data representation? Why?
  - Q2: What is your design of NN (dense/convolution/maxpooling/softmask)? Why do you? What is your classification accuracy rate if PCA/AE is introduced?
  - Q3: Implement certain ML algorithms on the same dataset. Is ML better than NN (or not)? Why?
- Due 06/03 09:00 UTC+8. Send a copy of your colab file to TA.
06/04: #BN01 & #HW06
- #BN01: COVID-19
- https://github.com/CSSEGISandData/COVID-19
  - 4% of final points will be rewarded.
  - You may see the example made by Taiwan CDC. [link]
  - Send your code/webpage to TA before FINAL (6/24 9 am UTC+8.)
- #HW06: Network Intrusion Detection
  - Kaggle Data Set, use the training data only for training and only use testing data for testing.
  - Please design a neural network and include the following concepts.
    - feature engineering, dimension reduction, k-fold testing, validation.
    - Here is an example on Kaggle. Note that this example is not NN.
    - Due 6/17 9 am UTC+8.

Grading Policy

Homework (40%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.
Project (20%): student needs to write an analysis program on a security related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and uploaded GitHub codes are required.
Midterm and Final (40%)

The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You MUST read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.

Academic Integrity

Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.
Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.

Google Sites

Report abuse