Anna Nigamatova

Using Deep Learning and isolation forest Techniques on CERT Dataset to Develop aA supervised and unsupervised anomaly Detection Models

PHASE I: Project Pitch, Literature Review, and EDA

“Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain.” ^[1]

Overview

It is a common knowledge that a vast majority of thefts in such businesses as retail are committed by the employees. The cyber-world is no exception. That is why Insider Threat Analysis is a vital mission for most organizations. However, it is challenging to study or teach Insider Threats due to a lack of meaningful data. Employee personal records contain private information, and many of their electronic footprints can contain confidential organizational data. Therefore, “Real” data is not easy to obtain.

On the other hand, synthetic data may not be realistic enough to create a Machine Learning Model. To address the data availability issue, Software Engineering Institute developed a high-quality Insider Threat Test Dataset ^[2] that can be used for education, research, and testing. The dataset has six versions of progressive complexity and data sample size. In addition, the synthetic dataset attempts to address various factors that come into play when one evaluates risky behaviors. Many of the risks are described in the diagram ^[3] on the right.

The Data

I have downloaded the latest dataset from the ftp site and unzipped it. The folder contains two folders. Folder r6.2 contains the data itself while the other folder contains answers that can be used as labels for an implementation of supervised learning techniques.

Here are the contents of the data folder. As we can see, it contains various employee data which includes their email, logon, device connections (f.i. removable storage), and even psychometric scores. The detailed description can also be found in the readme.txt file below that can be expanded for viewing.

readme.txt

Release 6 Notes
Major Changes* Content generated as complete sentences rather than bag-of-words* Single topic per document in dataset 1. (Multiple topics per document in dataset 2.)* Size of user popuation increased to 4000* Web uploads/downloads added* Email receipt is replaced by email viewing, to match Vegas data.* File actions now include "copy", "delete", "open", and "write".* "Persistent" filesystem and inbox. (Consistency across time and data types.)* Additional red team scenario

license.txt* ExactData license information

logon.csv* Fields: id, date, user, pc, activity (Logon/Logoff)* Weekends and statutory holidays (but not personal vacations) are included as days when fewer people work.* No user may log onto a machine where another user is already logged on, unless the first user has locked the screen.* Logoff requires preceding logon* A small number of daily logons are intentionally not recorded to simulate dirty data.* Some logons occur after-hours - After-hours logins and after-hours thumb drive usage are intended to be significant.* Logons precede other PC activity* Screen unlocks are recorded as logons. Screen locks are not recorded.* Any particular user’s average habits persist day-to-day - Start time (+ a small amount of variance) - Length of work day (+ a small amount of variance) - After-hours work: some users will logon after-hours, most will not* Some employees leave the organization: no new logon activity from the default start time on the day of termination* 4k users, each with an assigned PC* 400 shared machines used by some of the users in addition to their assigned PC. These are shared in the sense of a computer lab, not in the sense of a Unix server or Windows Terminal Server.* Systems administrators with global access privileges are identified by job role "ITAdmin".* Some users log into another user's dedicated machine from time to time.

device.csv* Fields: id, date, user, pc, file_tree, activity (connect/disconnect)* Some users use a thumb drive* Some connect events may be missing disconnect events, because users can power down machine before removing drive* Users are assigned a normal/average number of thumb drive uses per day. Deviations from a user's normal usage can be considered significant.* The file_tree field is a semicolon-delimited list of directories on the device.

http.csv* Fields: id, date, user, pc, url, activity, content* Has modular/community structure, but is not correlated with social/email graph.* Full URLs with paths.* Words in the URL are usually related to the topic of the web page.* Content consists of a space-separated list of content keywords.* Each web page can contain multiple topics.* Activity can be "WWW Download", "WWW Upload", or "WWW Visit"* WARNING: Most of the domain names are randomly generated, so some may point to malicious websites. Please exercise caution if visiting any of them.

email.csv* Fields: id, date, user, pc, to, cc, bcc, from, activity, size, attachments, content* Driven by underlying friendship and organizational graphs.* Role (from LDAP) drives the amount of email a user sends per day.* The vast majority of edges (sender/recipient pairs) are exist because the two users are friends.* A small number of edges are introduced as noise. A small percentage of the time, a user will email someone randomly.* Emails can have multiple recipients* Emails can have a mix of employees and non-employees in dist list* Non employees use a non-DTAA email addresses; employees use a DTAA email address* Terminated employees remain in the population, and thus are eligible to be contacted as non-employees* A friendship graph edge is not implied between the multiple recipients of an email.* Unlike the previous release, we do not believe the observed email graph follows graph power laws because the power-law-conforming friendship graph is overwhelmed by the organizational graph.* Email size and attachment count are not correlated with each other.* Email size refers to the number of bytes in the message, not including attachments.* Content consists of a space-separated list of content keywords.* "Content" does not specifically refer to the subject or body. We have not made that distinction.* Each message can contain multiple topics.* Message topics are chosen based on both sender and recipient topic affinities.* Activity topic can be Send or View.

file.csvFields: id, date, user, pc, filename, activity, to_removable_media, from_removable_media, content* Each entry indicates activity (open, write, copy or delete) involving a removable media device* Content consists of a hexadecimal encoded file header followed by a space-separated list of content keywords* Each file can contain multiple topics.* File header correlates with filename extension.* The file header is the same for all MS Office file types.* Each user has a normal number of file copies per day. Deviation from normal can be considered a significant indicator.
LDAP* Fields - employee_name - First, middle, and last name - user_id - Employee user ID - email - Employee's work email address - role - Job title - projects - List of projects (currently can be zero or one project)* The next 5 are organizational units in descending order of size (i.e., business_unit defines the largest group and supervisor the smallest) - business_unit - functional_unit - department - team - supervisor* The project and supervisor fields are known to be missing from the Vegas dataset.
decoy.csv* A list of decoy files and the hosts on which they reside.* Filenames are not guaranteed to be globally unique (i.e., two hosts could each have a decoy file with the same name).
psychometric.csv* Fields: employee_name, user_id, O, C, E, A, N* Big 5 psychometric score* See http://en.wikipedia.org/wiki/Big_Five_personality_traits for the definitions of O, C, E, A, N ("Big 5").* Extroversion score drives the number of connections a user has in the friendship graph.* Conscientiousness score drives late work arrivals.* This information would be latent in a real deployment, but is offered here in case it is helpful.* A latent job satisfaction variable drives behaviors including job searching and punctuality.
Malicious actors* This data contains five instances of insider threats.
Errata:* Field Ids are unique within a csv file (logon.csv, device.csv) but may not be globally unique.

Methods

The dataset is designed to be suitable for many various Insider Threat Detection tasks. Because potential tags are provided as a part of "answers" folder, the information can be merged onto the dataset to produce a single labeled dataset. One of the methods is described in a Toward Data Science article by Dennis Chow titled "Insider Threat Detection with AI Using Tensorflow and RapidMiner Studio." ^[4]

In addition, the more common unsupervised learning algorithm can be performed. The files can be combined to increase the dimensionality, encoded, and textual data contained in email.csv can be tokenized.

The diagram on the left ^[5] demonstrates supervised, semi-supervised and unsupervised model diagrams.

First Steps

I have created multiple Anaconda environments for data exploration:

A PySpark environment
An H2O environment
A Tensorflow/Keras environment
I have loaded my data into a Pandas DataFrame using PySpark

I am interested in exploring the use of genetic algorithms ^[6] that have proven to be effective for false positive reduction techniques. Code will be uploaded into GitHub and can be accessed by following:

github.com/AnnaBanana1/DATA606

Video Screencast of My Project Pitch

PHASE II: Data Wrangling, Exploratory Analysis, Transformation, Encoding and Model Implementation

In this phase I have completed the following tasks:

- Preprocessing and Exploratory Analysis using Apache Spark and Linux

- Deep Learning with Keras and Tensorflow

- Exploration of other supervised algorithms and their ROC Curve Comparison

Preprocessing and Exploratory Analysis using Apache Spark and Linux

The dataset I selected is separates the activity data (email, web browser and removable device) from the true positive data. The data has to be merged and labeled to denote true positives and true negatives. Because of the size of my data I chose a subset to streamline processing. I selected Apache Spark because of a much faster performance compared to Pandas and to create a scalable data process.

I have selected a specific scenario to test my model on for simplicity
I have joined the different types of activity true negatives with the true positives to create a balanced set
I have explored and transformed and encoded my data to prepare for model Implementation

Deep Learning with Keras and Tensorflow

I have split my data into train, test and validation sets
Added a utility function provided by Tensorflow.org to prepare the data
Selected my labels and features
Created a feature layer
Created and compiled the binary model
Ran the model and observed the outcome

Please see the screenshot below:

My best performing model was only close to 60% accurate on balanced data which I attribute to the fact that it is synthetic and not very meaningful in nature.

Exploration of other supervised algorithms and their ROC Curve Comparison

I move on to compare some of the other algorithms, namely:

Logistic Regression
Naive Bayes
K-Nearest Neighbors
Decision Tree
Random Forest

On the left is the plot and comparison the ROC Curves.

As we can see Decision Tree and Random Forest produced the best ROC Curves though they likely overfit.

My goal for next delivery is to streamline some of the data processing in Apache Spark and run the model through an unsupervised technique such as isolation forest

Video Screencast of Phase II Delivery

PHASE III: Data Wrangling, Exploratory Analysis, Transformation, Encoding, Supervised and Unsupervised Model Implementation on Unbalanced Data

In this phase I have completed and revised the following tasks:

- Preprocessing and Exploratory Analysis (this time doing the bulk of processing in Apache Spark and generating an unbalanced 90/10 random sample)

- Deep Learning with Keras and Tensorflow on unbalanced binary data

- Exploration of other supervised algorithms and their ROC Curve Comparison

- Isolation Forest Implementation on unlabeled data

Preprocessing and Exploratory Analysis (this time doing the bulk of processing in Apache Spark and generating an unbalanced 90/10 random sample)

In this step I have:

Loaded a random sample each of my data type into a PySpark DataFrame
Created an aggregate true positives set in bash and loaded it into a PySpark DataFrame
Joined all true negative and true positive data into a unified unbalanced sample of 90/10 distribution
Transformed and encoded the data in PySpark and Pandas

Deep Learning with Keras and Tensorflow.^[7]

(the same algorithm used on unbalanced data)

Exploration of other supervised algorithms and their ROC Curve Comparison.^[8]

(unbalanced data)

As we can see Decision Tree and Random Forest produced the best ROC Curves though they likely overfit

My next goal is to use that information to research a good model for the unsupervised learning portion of my project.

As tree algorithms have performed well on supervised data, I will look at Isolation Forest model for unsupervised anomaly detection.

Isolation Forest Algorithm Overview

"Isolation forest is a machine learning algorithm for anomaly detection. It's an unsupervised learning algorithm that identifies anomaly by isolating outliers in the data."^[9]

Isolation Forest is based on the Decision Tree algorithm. It isolates the outliers by randomly selecting a feature from the given set of features and then randomly selecting a split value between the max and min values of that feature. This random partitioning of features will produce shorter paths in trees for the anomalous data points, thus distinguishing them from the rest of the data.

The main idea behind Isolation Forest algorithm for anomaly detection, is that you need fewer partitions to isolate an outlier point as demonstrated in the diagram on the left.^[10] It would take less partitions to isolate an outlier x₀ compared to x_i.

Isolation Forest Python Implementation

Below is the outline of Python data implementation steps^[11]:

Since my goal in this exercise is to utilize an unsupervised algorithm, I first drop the label column
Initiate, fit, and predict the model (set contamination to 10%)
Create an outlier list that can be exported for further analysis
Count number of records in each class (anomaly vs non-anomaly)

As a result I get 6,479 anomalous events and 58,309 non-anomalous events. In real world, anomalous events are not as common any typically account for less than a percent of data making it possible to be examined by analysts (keeping a human in the mix.)

Code and sample data are available on GitHub.

Isolation Forest 3-D Scatter Plot

Below is the outline of Python plotting steps:

In order to visualize my model performance I scale my data and use PCA to create a 3-D set
I use Matplotlib to plot my dataset on 3-D axis

Though my data is synthetic and my scatter plot is unusually geometric, my model clearly identifies visual outliers denoted by red crosses (please see the figure below)

Video Screencast of Phase III Delivery

Summary and Future Work

In summary, I have explored various supervised and unsupervised algorithms using the CERT synthetic set. My goal was to create a scalable solution using Apache Spark. I was able to transform and encode the data for analysis using PySpark but encountered issues performing model implementation steps and completed the analysis by converting Spark DataFrame to a Pandas DF. I was satisfied with Deep Learning and Isolation Forest model performance. My future goal is to create a fully scalable solution using Apache Spark exclusively using Spark Mllib and iForest. I am also interested in utilizing Deep Learning algorithm (encode and decode) for false positive outcome and noise reduction to minimize human involvement.

Code can be found on my GitHub