It is a common knowledge that a vast majority of thefts in such businesses as retail are committed by the employees. The cyber-world is no exception. That is why Insider Threat Analysis is a vital mission for most organizations. However, it is challenging to study or teach Insider Threats due to a lack of meaningful data. Employee personal records contain private information, and many of their electronic footprints can contain confidential organizational data. Therefore, “Real” data is not easy to obtain.
On the other hand, synthetic data may not be realistic enough to create a Machine Learning Model. To address the data availability issue, Software Engineering Institute developed a high-quality Insider Threat Test Dataset [2] that can be used for education, research, and testing. The dataset has six versions of progressive complexity and data sample size. In addition, the synthetic dataset attempts to address various factors that come into play when one evaluates risky behaviors. Many of the risks are described in the diagram [3] on the right.
readme.txt
The dataset is designed to be suitable for many various Insider Threat Detection tasks. Because potential tags are provided as a part of "answers" folder, the information can be merged onto the dataset to produce a single labeled dataset. One of the methods is described in a Toward Data Science article by Dennis Chow titled "Insider Threat Detection with AI Using Tensorflow and RapidMiner Studio." [4]
In addition, the more common unsupervised learning algorithm can be performed. The files can be combined to increase the dimensionality, encoded, and textual data contained in email.csv can be tokenized.
The diagram on the left [5] demonstrates supervised, semi-supervised and unsupervised model diagrams.
I have created multiple Anaconda environments for data exploration:
A PySpark environment
An H2O environment
A Tensorflow/Keras environment
I have loaded my data into a Pandas DataFrame using PySpark
I am interested in exploring the use of genetic algorithms [6] that have proven to be effective for false positive reduction techniques. Code will be uploaded into GitHub and can be accessed by following:
The dataset I selected is separates the activity data (email, web browser and removable device) from the true positive data. The data has to be merged and labeled to denote true positives and true negatives. Because of the size of my data I chose a subset to streamline processing. I selected Apache Spark because of a much faster performance compared to Pandas and to create a scalable data process.
I have selected a specific scenario to test my model on for simplicity
I have joined the different types of activity true negatives with the true positives to create a balanced set
I have explored and transformed and encoded my data to prepare for model Implementation
I have split my data into train, test and validation sets
Added a utility function provided by Tensorflow.org to prepare the data
Selected my labels and features
Created a feature layer
Created and compiled the binary model
Ran the model and observed the outcome
Please see the screenshot below:
My best performing model was only close to 60% accurate on balanced data which I attribute to the fact that it is synthetic and not very meaningful in nature.
I move on to compare some of the other algorithms, namely:
Logistic Regression
Naive Bayes
K-Nearest Neighbors
Decision Tree
Random Forest
On the left is the plot and comparison the ROC Curves.
As we can see Decision Tree and Random Forest produced the best ROC Curves though they likely overfit.
My goal for next delivery is to streamline some of the data processing in Apache Spark and run the model through an unsupervised technique such as isolation forest
In this step I have:
Loaded a random sample each of my data type into a PySpark DataFrame
Created an aggregate true positives set in bash and loaded it into a PySpark DataFrame
Joined all true negative and true positive data into a unified unbalanced sample of 90/10 distribution
Transformed and encoded the data in PySpark and Pandas
As we can see Decision Tree and Random Forest produced the best ROC Curves though they likely overfit
My next goal is to use that information to research a good model for the unsupervised learning portion of my project.
As tree algorithms have performed well on supervised data, I will look at Isolation Forest model for unsupervised anomaly detection.
Isolation Forest is based on the Decision Tree algorithm. It isolates the outliers by randomly selecting a feature from the given set of features and then randomly selecting a split value between the max and min values of that feature. This random partitioning of features will produce shorter paths in trees for the anomalous data points, thus distinguishing them from the rest of the data.
The main idea behind Isolation Forest algorithm for anomaly detection, is that you need fewer partitions to isolate an outlier point as demonstrated in the diagram on the left.[10] It would take less partitions to isolate an outlier x0 compared to xi.
Below is the outline of Python data implementation steps[11]:
Since my goal in this exercise is to utilize an unsupervised algorithm, I first drop the label column
Initiate, fit, and predict the model (set contamination to 10%)
Create an outlier list that can be exported for further analysis
Count number of records in each class (anomaly vs non-anomaly)
As a result I get 6,479 anomalous events and 58,309 non-anomalous events. In real world, anomalous events are not as common any typically account for less than a percent of data making it possible to be examined by analysts (keeping a human in the mix.)
Code and sample data are available on GitHub.
Below is the outline of Python plotting steps:
In order to visualize my model performance I scale my data and use PCA to create a 3-D set
I use Matplotlib to plot my dataset on 3-D axis
Though my data is synthetic and my scatter plot is unusually geometric, my model clearly identifies visual outliers denoted by red crosses (please see the figure below)
In summary, I have explored various supervised and unsupervised algorithms using the CERT synthetic set. My goal was to create a scalable solution using Apache Spark. I was able to transform and encode the data for analysis using PySpark but encountered issues performing model implementation steps and completed the analysis by converting Spark DataFrame to a Pandas DF. I was satisfied with Deep Learning and Isolation Forest model performance. My future goal is to create a fully scalable solution using Apache Spark exclusively using Spark Mllib and iForest. I am also interested in utilizing Deep Learning algorithm (encode and decode) for false positive outcome and noise reduction to minimize human involvement.
Code can be found on my GitHub