Advanced Network Technologies and Services Laboratory

Advanced Information System Development: Data Science for Cybersecurity (2025)

This course serves as an introductory triggering class for students who are interested in cybersecurity analysis using machine learning methods. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be architects to solve security-related problems using data analysis algorithms and tools. Related security concepts, data analysis theories, research papers, and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.

Note that students should take programming courses before, such as Programming Language I/II. The programming language used in this class is Python (however we will NOT cover any Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing programs, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts in homework and projects.

Note this course is designed for students in MIS gradate students for Advanced Information System Development. The class will be conducted for 16 weeks.

Announcements (Spring 2025)

-/-: We have a forum (https://groups.google.com/g/nccu-ds4s) for this class. If you have any questions (e.g., enrollment, homework, project, ...), please post your question directly in the forum. TA and I will respond later.
0414: Midterm announced! Download midterm file here! (Open a colab page.) Due 2025/4/22 23:59.
06/11: Final announced. (https://colab.research.google.com/drive/1EyfwLYEjIn1xc-WnQHHhwtW2gNrfXuBU)

Class Info

Instructor: Shun-Wen Hsiao, NCCU MIS Dept., <hsiaom at nccu.edu.tw>
Lectures: Wednesday D56 (13:10~16:00 UTC+8)
Course ID: 356813001, 356017001, 791020001
Classroom: NCCU College of Commerce Building Room #260310
TA: 111356509, 112356021 at g.nccu.edu.tw
Forum: https://groups.google.com/g/nccu-ds4s
GitHub: https://github.com/hsiaom26/DS4CS-24
Homework Submission: http://hsiaom.nccu.edu.tw:8888/
Office Hours: By appointment.

Course Objectives & Learning Outcomes

Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.
Understand the concept of static analysis and dynamic analysis.
Understand the data analysis algorithms: distance function, similarity function, classification, clustering, and machine learning algorithms for security applications.
Understand the neural network structures and algorithms.
Understand the usage of language model to analyze security realted data.
Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system, and sequence analysis system.

Topics (Spring 2025)

Data Analysis Environment
Static Analysis
Dynamic Analysis
Network Trace
Linux System Log
Supervised Learning
Unsupervised Learning

Intrusion Detection System
Anomaly Detection System
Neural Network
Sequence Analysis
Language Model

References (Spring 2025)

Machine Learning for Cyber Security
Data Science for Cyber-Security
Awesome Machine Learning for Cyber Security
Python Data Science Handbook
Malware Data Science: Attack Detection and Attribution, Joshua Saxe and Hillary Sanders, No Starch Press, Nov. 2018.
Python for Data Analysis, Wes McKinney, O'Reilly Media, October 2012.
https://machinelearningmastery.com

Schedule (Spring 2025)

W1 (02/19): Regression (M03)
- Model and Data Table
- Linear Regression (MSE, Gradient Descent)
W2 (02/26): Classification (M04)
- Logistic Regression (Cross-Entropy)
- Support Vector Machine
- Evaluation
W3 (03/05): Tree (M06)
- Tree and Random Forest
- Entropy, Information Gain, Gini, Chi, Variance
W4 (03/12): Clustering (M07)
- Distance
- K-means
- Hierarchical Clustering
- DBScan
- Malware Calls (f14s1940_callonly_tfds)
W5 (03/19): Problematic Data (M08, M09)
- Dimension Reduction, PCA (M08)
- Problematic Data (M09)
W6 (03/26): Neural Network
- Bascis (N01)
- Convolution (N02)
(04/02): No class.
W7 (04/09): Recurrent NN (N03)
- Static Analysis: Windows PE file and image analysis (D01)
- Understanding LSTM Networks (N03-1)
  - LSTM, GRU, ResNet
- Dynamic Analysis: Malware call and sequence analysis (D02)
- Text classification with an RNN (N03-2)
W8 (04/16): Midterm, no class. (Take home exam, due before 04/23.)
W9 (04/23): Latent Space
- Auto-Encoder (N04)
  - denoising, anomaly detection
  - convolutional, variational AE (N04-2)
- Activation Function (N05)
  - activation
  - optimizing gradient-descent
W10 (04/30) Language Model (N06)
- word2vec (cbow, skip-gram), fastText (supervised, unsupervised)
- Transformer, Self-Attention, BERT
W11 (05/07): Language Model
- Basic text classification (N06-2), Classify text with BERT (N06-3)
- HuggingFace NLP Course (N06-4)
  - 1. Transformer Models, 2. Using Transformer, 3. Fine-Tuning a Pretrained Model
  - 7-3. Fine-tuning a masked language model
- Packet Analysis (D03)
W12 (05/14): Language Model and Others
(05/21): No class. University Anniversary.
- Security Management
W13 (05/28): Large Language Model
- NLP Course, Diffusion Course
  - https://huggingface.co/learn/nlp-course/
  - https://huggingface.co/learn/diffusion-course/
W14 (06/04): Anomaly Detection
- Variational Autoencoder (N04-2)
- V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Survey, vol. 41, no. 3, July 2009.
- Novelty and Outlier Detection
  - One-class SVM
- Self-Organized Map
W15 (06/11): Project Demo
- Awesome Machine Learning for Cyber Security, SecRepo.com - Samples of Security Related Data, vizsec.org, NSL_KDD.
W16 (06/18): Final, no class. (Take home exam, due 06/18 at 23:59)

Assignment (Spring 2025)

- 5 homework is expected. Announcement dates are
- 03/05 H-NID
  - https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection
- 03/26 H-calls4famPlusPCA
  - Find the homework template in the above link. You may use any tools to analyze the data (and show/post the results in ipynb).
  - Finish it and upload your ipynb file to the homework system before April 8th 23:59.
- 04/14 midterm announcement
  - Download midterm file here! Due 2025/4/22 23:59. (PE by CNN)
- 04/30
  - 5/6 Announced. Due 5/21 23:59.
  - H-calls4fam_TFDS_rnn
- 05/14 (5/16 announced)
  - HuggingFace H-calls4famLM
    - 1) Similar to H-calls4fam_TFDS_rnn, but, in this homework, please use one language model from hugging face to complete this homework. You may use any language model.
    - Make sure you print the classification accuracy obtained from previous homework (i.e., rnn, 'skip-gram', 'cbow' or 'SBERT') and new result of your newly selected language model.
    - 2) The second part of this homework is re-pre-training. You have to use MLM (masked language model, shown in the following URL), to improve the language model. And then, perform classification again to show if the results improve or not.
      - https://huggingface.co/learn/llm-course/en/chapter7/3
- 05/28, Announced 6/3
  - H-LM_ATT&CK
  - Due 06/18 23:59 (same as FINAL)
- 06/04 project announcement
  - A one-week sprint project.
  - Try to analyze a security-related dataset with language model (and/or other models).
  - You should upload a pdf as the final report that contains
    - title, goal, downstream task(s)
    - where we can find your complete codes
    - dataset introduction
    - data preprocessing
    - model used
    - results
  - Note that you can output latent vector from Python and use Orange for latter analysis. If so, screenshot Orange workflow in your pdf report.
- 06/11 Final announcement. https://colab.research.google.com/drive/1EyfwLYEjIn1xc-WnQHHhwtW2gNrfXuBU

Grading Policy

Homework (50%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.
Project (10%): student needs to write an analysis program on a security-related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and GitHub codes are required.
Midterm (20%)
Final (20%)

The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You SHOULD read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.

Academic Integrity

Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.
Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.

Google Sites

Report abuse