Data Science for Cybersecurity (2020)

2021/7/28: Links or files on this page might be lost. Please see the latest version of this course on the Teaching page.

This course serves as an important triggering class for the students who major in Management Information Systems and are interested in cybersecurity analysis. In this course, we focus on the data analysis related topics. Students should get familiar with tools, algorithms, concepts, and the execution environment to perform data analysis on cybersecurity data. Students need to learn to be an architect to solve a security-related problem by using data analysis algorithm and tools. Related security concept, data analysis theories, research papers and background knowledge will be covered in the class. We will introduce several security systems that implement data analysis algorithms to achieve their security goals.

Note that students should take Programming Language I, Programming Language II (i.e., two semesters of programming course), and Business Date Communication (or Computer Networks) before taking this class. The programming language used in this class is Python (but we will NOT cover Python language tutorial), and we will leverage TensorFlow and Keras for AI-based analysis. You MUST be familiar with writing program, be able to find/search solutions from online documents and Stack Overflow, and debug on your own. This course REQUIRES students to implement Python scripts in the class, and please bring your laptop to the class.

Note that this course is designed for students who are in their third or fourth year of college at MIS department. If you have taken any advanced AI/ML/DM course, you may want to skip this course.

Announcements (Spring 2020)

  • 2/4: Duo to the outbreak of the novel coronavirus, the schedule of this class has been changed as follows. This class starts from March 2 (i.e., 3-hour lecture will be given in this day) and please bring your enrollment document to the class on March 9. ONLY MIS students, double major students, MIS minor students are allowed to enroll this class, and it is subject to the availability of seats in the classroom.

  • 3/23: Homework upload and class roll call QR code. Here.

  • 4/21: NO CLASS in 04/27. The mid-term exam will be announced in 04/27 at 19:00. You MUST send your .ipynb file (in terms of colab share link) to TA via email before 04/30 23:59. Late submission is allowed, but with a 20% per day penalty. You CANNOT discuss midterm with anyone.

  • 4/27: Midterm. Please copy this colab file to your google drive, and finish it before due day. Send your shared link to TA (108356041 at g.nccu.edu.tw).

  • 5/7: #PJ01: Term Project Proposal

      • Please upload a document that includes Dataset (you want to use), Problem (you want to solve), Preprocess (you want to apply), Models (you want to use), and Expectation.

      • Due 05/13 at 23:59. Upload your file here (use your student ID as filename).

  • 6/15: Upload you final project today. Zip your presentation file, document and codes to a single .zip file. Upload your zipped file here. Do not forget use your student ID as the file name.

  • 6/15: Final Exam will be announced at 9pm in 6/15. Due date is 6/23 23:59. Please upload here using PDF format. Do not forget use your student ID as the file name. Download here [.docx].

Class Info

  • Instructor: Shun-Wen Hsiao, NCCU MIS Dept., hsiaom at nccu.edu.tw

  • Lectures: Monday D56 (13:10~16:00)

  • Classroom: #260102 (College of Commerce Building)

  • TA: 108356041 at g.nccu.edu.tw

  • Office Hours: Monday 16:00~17:00 (i.e., after class).

Course Objectives & Learning Outcomes

  • Understand the relationship of Cybersecurity and Security Management.

  • Understand the concept of detection, the profiling subject, profiling techniques, misuse detection, and anomaly detection.

  • Understand the concept of static analysis and dynamic analysis.

  • Familiar with data analysis environment, GPU-based computation, and cloud computing.

  • Understand the data analysis algorithms: distance function, similarity function, classification, clustering, machine learning algorithms for security application.

  • Understand the neural network structures and algorithms.

  • Understand the operation of security-related information systems from the perspective of the data-driven system: intrusion detection system, anomaly detection system, spam mail filter system and sequence analysis system.

  • Understand visualized machine learning tools: Orange

Topics (Spring 2020)

  • Security Management

  • Data Analysis Environment

  • Static Analysis

  • Dynamic Analysis

  • Network Trace

  • Linux System Log

  • Supervised Learning

  • Unsupervised Learning

  • Intrusion Detection System

  • Anomaly Detection System

  • Neural Network

  • Spam Mail Filter System

  • Sequence Analysis

  • Data Visualization

References (Spring 2020)

Schedule (Spring 2020)

  1. 02/25: No class. Semester starts on 2nd March.

  2. 03/02: Introduction to Cybersecurity

  3. 03/09: Supervised Learning: classification I

    • T04: Table-Based Data and Data Analysis Process

      • Model, Linear Regression, Gradient Descent

      • Training/Testing Process w/ Linear Regression

      • Cost Function: MSE, RMSE (for regression problem)

      • Feature Engineering: Normalization, (Dimension Reduction), Missing Values, (Unbalanced Data), Outlier-1

    • T05: Visualized Machine Learning Tool

      • Orange (class note)

      • Data Analysis Workflow

      • Evaluation: Accuracy, ROC curve, confusion matrix (for classification problem)

      • Testing process: k-fold validation

  4. 03/16: Supervised Learning: classification II

    • T06: Supervised Learning Algorithms

      • (Linear Regression), Logistic Regression, SVM

      • Cost Function: Cross-Entropy

    • T07: Tree-based Classification

      • Entropy, Gini, Chi, Information Gain, Variance

      • Decision Tree

      • Random Forest (using Orange)

  5. 03/23: Unsupervised Learning: clustering

    • T08: Unsupervised Learning Algorithms

    • T09: Problematic Data

      • Unbalanced (Imbalanced) Data

      • Outlier-2 (PCA-based)

      • Overfit and Underfit, Regularization

  6. 03/30: T10: Static Analysis

  7. 04/06: Dynamic Analysis

    • T11: System Call and API Call

    • Profiler: Cuckoo

    • S. Forrest, S. A. Hofmeyr, A. Somayaji and T. A. Longstaff, "A Sense of Self for Unix Processes," in Proc. IEEE Symposium on Security and Privacy, 1996.

  8. 04/13: Trace and Log

  9. 04/20:

  10. 04/27: (late) Midterm (20%). NO CLASS. The exam will be announced in 04/27 at 19:00. You MUST send your .ipynb file (in terms of colab share link) to TA via email before 04/30 23:59. Late submission is allowed, but with a 20% per day penalty.

  11. 05/04: Intrusion Detection

    • An Introduction to Intrusion Detection by Aurobindo Sundaram.

    • R. A. Kemmerer and V. Giovanni, "Intrusion Detection: A Brief History and Overview," Computer, vol. 35, 2002, supl. 27--30.

    • Additional material: E. Hodo ea al., "Shallow and Deep Networks Intrusion Detection System: A Taxonomy and Survey," arXiv:1701.02145 [cs.CR], Jan 2017.

    • Security Datasets: Awesome Machine Learning for Cyber Security, SecRepo.com - Samples of Security Related Data, vizsec.org, NSL_KDD.

  12. 05/11: Anomaly Detection

    • V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Survey, vol. 41, no. 3, July 2009.

    • A. Lakhina, M. Crovella and C. Diot, "Diagnosing Network-Wide Traffic Anomalies," ACM SIGCOMM Computer Communication Review, vol. 34, no. 4, pp. 219--230, 2004.

    • R. Sommer and V. Paxson, "Outside the Closed World: On Using Machine Learning For Network Intrusion Detection," in Proc. IEEE Symposium on Security and Privacy, 2010, pp. 305-316.

  13. 05/18: Text-based Analysis with Orange

  14. 05/25: T13: Deep Learning Basics

  15. 06/01: Latent Space

  16. 06/08: Latent Space II

    • K-means and Self-Organizing Map: [YouTube]

    • Word Embedding, Language Model

  17. 06/15: Project Presentation

  18. 06/22: Final

Lab/Assignment (Spring 2020)

  1. 03/09: #HW01: ZeroAccess

    • ZeroAccess is a cyber attack. We provide you some data about ZeroAccess incidents (longitude and latitude) and geographical information.

    • Please tell me does UFO cause ZeroAccess infection?

    • Due at the class (03/23).

  2. 03/09: #BN01: COVID-19

  3. 03/16: #HW02: Orange

  4. 03/23: #HW03: PCA&datasets

    • Due at the class (04/06).

  5. 04/06: #HW04: Static Analysis

    • Extract features from 40 malware samples and try to classify/cluster them into 4 groups.

    • Due at the class (04/13).

  6. 04/13: #HW05

    • Try to cluster these dynamic analysis profiles: hooklog419

    • Due at the class (04/20).

  7. 04/20: #HW06

  8. 04/27: #PJ01, #BN02

    • #PJ01: Term Project Proposal

      • Please upload a document that includes Dataset (you want to use), Problem (you want to solve), Preprocess (you want to apply), Models (you want to use), and Expectation.

      • Due 05/13 at 23:59. Upload your file to here (use your student ID as filename).

    • #BN02: CORD-19-research-challenge

  9. 5/18: #HW07

    • Spam mail detector.

    • Please upload your spam mail and normal mail to the repository. Your assignment is to create a spam detector based on all the normal and spam emails uploaded by the students.

    • Due at the class (05/25). Upload Here.

  10. 5/25: #HW08

    • Given a set of malware PE files and their malware family labels.

    • Please design a deep neural network to classify these PE files into their families.

    • Here is the HW08. Copy it to your own google drive and finish it.

    • Due at the class (06/01). Upload Here.

  11. 6/1: #HW09

    • Base on your #HW04, Base on your #HW08 (#HW08 has been updated as well) and try to design a NN-based classifier (need to include an autoencoder) to classify malware samples into the correct family.

      • Q1: PCA or AutoEncoder? Which is better for data representation?

      • Q2: What is your design of NN? Your classification accuracy rate?

      • Q3: Implement certain ML algorithms on the same dataset. ML is better or NN is better?

    • Due at the class (06/08). Upload Here.

    • 6/5: Due to the data set problem, this #HW09 will due at the class (06/15). New data set is announced in 6/8 (Please see new #HW08 as well). Please spend sometime on your final project first. Upload your #HW09 Here. #HW09 is our last homework.

    • 6/15: Upload your final project today. See announcement.

Grading Policy

  • Homework (30%): programming exercises and essays. You MUST see the ACADEMIC INTEGRITY section before taking this class.

  • Class Participation (10%): attendance, discussion. Students are expected to attend classes and participate in class discussions. It’s important that you attend and participate in class; our class meets only once a week, so missing one class represents a substantial portion of the semester. If there are special circumstances requiring you to be out of class, please email me/TA BEFORE class. You should come to class prepared and on time. You get ONE freebie absence. Your second absence is excusable in a dire emergency (e.g., illness, family emergency, flood, volcano, locusts, etc). A third absence can mean you fail the class.

  • Project (20%): student needs to write an analysis program on a security related data set to demonstrate their understanding of security issues and data analysis skill. A proposal, a report, a presentation, and uploaded GitHub codes are required.

  • Midterm and Final (40%)


The Problem Solving Through Inquiry and Data Analysis rubric can be found here. You MUST read it carefully before submitting your first homework. It allows you to know exactly the way in which you will be assessed, it is helpful in facilitating academic integrity.

Academic Integrity

  • Plagiarism is a serious breach of academic trust. In academic work, our words, ideas and programs are the value of our work, so turning in someone else’s work as if it were your own is a form of theft. When you use someone else’s words, ideas, or programs without crediting the source or authorship of those words, ideas, and program, you are plagiarizing. So here’s the bottom line: original work only, credit to ideas, writing, words, or programs from someone other than you. Plagiarized work will automatically receive a “0” or “F” for the assignment.

  • Since cheating usually arises out of desperation and everyone has an occasional problem and finishes their work late, this class accepts late homework submission, but with a 15% per day penalty. We encourage you to complete your homework rather than drop it. Any oral discussion with classmates, TA and lecturer are welcomed, but you MUST NOT share any of your code in any form.