UHM Cyber Security and Application Research Lab

Elena' s Notes

(3/27/19)

JSON to PCAP Script: https://github.com/H21lab/json2pcap

Convert the PCAP to JSON using the following command

tshark -r input.pcap -T json -x > output.json

Then follow the instructions for the script. In theory you should be able to edit the json however you want (as long as it keeps a certain structure) and it can convert it back to pcap for you.

(3/29/19) KEY FUNCTIONS IN NETZOB

Data Alignment

Parallel Data Alignment

JSON Serialization

Automata Factories?

- Chained States Automata Factory

- One State Automata Factory

Field Split Aligned (takes in array of HEX and splits it among the fields)

Cluster By Key Field

Cluster By Size

Field Operations - Merge fields, etc.

Field Reseter - reformat fields to original

Field Split Delimiter

Find Key Fields

Entropy Measurement

Problems and Attempted Mitigations

netzob.Common.NetzobException.NetzobImportException: Error while importing data from source PCAP: This pcap cannot be imported since the layer 2 is not supported (127)

-- possible mitigations have been attempted, like trying to edit the library to add IEEE 802.11 Protocol importing capabilities (which breaks too many functions)

-- another attempt is to edit the PCAP by converting to JSON and systematically changing the JSON and translating it back to PCAP with and Ethernet header using the following script: https://www.h21lab.com/tools/json-to-pcap . Some problems come up when converting back to PCAP.

A lot of the necessary functions in netzob deal with hex. Next step will be to try the functions on HEX dumps of the captured packets.

(4.23.19)

Some interesting papers on profiling traffic with Machine Learning

Machine Learning based Traffic Classification using Low Level Features and Statistical Analysis: https://pdfs.semanticscholar.org/f835/befe987b7c266123ba7ccbc80f0c42f07322.pdf

A Survey of Machine Learning Based Packet Classification: https://pdfs.semanticscholar.org/c4cf/fd36e1958849e032202c8ae3d0c9a9098962.pdf

(4.25.19) (Explanation of Tutorial for Netzob)

Symbol (messages = <message read in by pcap>) — converts into a list of messages of type Netzob defined class ‘Abstract Message'

Data - content of the message - object
messageType
Date - int
Source - optional string
Destination - optional string
visualizationFunctions
Metadata
semanticTags

Format.splitDelimiter(symbol, ASCII(‘#’)) — split the symbolized message on the # sign (may not be this… any kind of delimiter you can see in the message)

Symbol._str_debug() — shows the symbol structure (for each field as split by the delimiter)

Example: field-0 has

- Data (Raw = b’CMDindentify’, ((0,88))) —> which means that length of this field can be 89 bytes…. we can now see that in field-0, the length can be 89, 57, 65, 105, 81, or 49 bytes.

Symbol — prints out in a more simplified structure of the message split by the delimiter

Format.clusterByKeyField(symbol, symbol.fields[0]) — cluster by the terms in field-0. You print a list of how many Keys in the KeyField there are and the values.

Format.splitAligned(sym.fields[2], doInternalSlick = True) — split the specified field according to the variations of message bytes, Relies on a sequence alignment algorithm . It will find alignment for the field you want it to analyze and try to determine static and dynamic fields for that field for each key value.

RelationFinder.findOnSymbol(sym) — for each Key in the KeyField, it will find a relationship between some fields (if it exists)

Example:

SizeRelation, between ‘value’ of:

Field

[[b ‘\n’, b ‘\t’]]

and ‘size’ of:

Field-Field-Field

[[b ‘aStrongP’, b ‘myPass’], [b ‘wd’, b ‘wd’], [b ‘ ‘, b ‘!’]]

This means there was a relation found between the value of the first field (like \n) and the size of the next three fields ( ‘aStrongP’, ‘wd’, ‘ ‘ )

It’s paired because they are both associated with one key.

Format.resetFormat(symbol) — if you want to undo all the formatting to analyze the patterns

Format.mergeFields(f1, f2) — If you know there is a certain length for some part of the data, you can merge two fields if they get split apart by the splitAligned algorithm

(5.8.19) Random Forest Classifier with Intel and Parrot Data

Select Attributes:

frame.time_relative (the amount of time that has passed in the

entire flight in seconds)

frame.time_delta (the amount of time that elapsed between the current

packet and the last packet in seconds)

frame.len (the length of the packet as an integer).

All other features were ignored for a simplistic model, but may be added later if they are found to be important.

The combined and randomized Parrot and Intel drone data was split into a training set

(60% or 379,264 examples) and a test set (40% or 252,842 examples). The RFC performed well

with 100 trees, and a tree depth of 2.

(10/3/19)

Explanation of the GINI INDEX for decisions trees: https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1

(10/4/19)

Basics of Machine Learning

ANATOMY OF A LEARNING ALGORITHM

Generally consists of 3 parts

1. Loss function
2. Optimization criterion based on the loss function (a cost function for example)
3. Optimization routine leveraging training data to find a solution to the optimization criterion.

References to gradient descent or stochastic gradient descent.

Gradient descent proceeds in epochs

Gradient descent is sensitive to the choice of the learning rate .

Slow for large datasets

BASIC PRACTICE

Feature Engineering Techniques

ONE-HOT ENCODING

Transforming categorical features (like colors) into vectors. If you have something like red, green, blue, then you can transform them into vectors red = [1,0,0], green = [0,1,0], blue = [0,0,1].

BINNING

Let feature j = 4 represent age. We can apply binning by replacing this feature with the corresponding bins. Making three new bins, “age_bin1”, “age_bin2”, “age_bin3” be added as indexes j=123, j=124, j=125, if the feature xi(4)= 7then if we assign bin2 to be all ages from 6-10 yrs old, then j = 124 is set to 1. This can help the algorithm learn with fewer examples.

NORMALIZATION

May help with features that have vastly different ranges.

EVALUATING PERFORMANCE

Most widely used metrics and tools to assess classification models:

- Confusion matrix
- Accuracy
- Cost-sensitive accuracy
- Precision / recall
- Area under the ROC curve

A confusions matrix (consisting of True Positive, False Positive, True Negative, False Negative) is used to compute two other performance metrics → precision and recall

Basic Algorithm Excerpt from Machine Learning based Traffic Classification using Low Level Features and Statistical Analysis (4.23.19):

1. Traffic Collection

2. Check size of pcap captures.

3. convert raw pcap into a processed CSV file. Can be done in many ways, the paper uses Tshark and automated shell script and code.

4. Labeling the data

5. Apply classifier

6. select attributes.

7. visualization of all attributed

8. Apply classification alg on data set.

(11/11/19)

Current Data: https://docs.google.com/spreadsheets/d/19ILJka1em-t-JHN5REilpu0Clzt_oRHtlFxVHySHhNA/edit#gid=0

Performance of simple Random Forest Classifier on different sets of data

Trial 1:

Data: Viper_1, Viper_2, Viper_3, Viper_4 --- Parrot_1, Parrot_2, Parrot_3, Parrot_4

Features: frame.time_rel, frame.length, data.length

Data Size: 87,721 samples

Train Size: 52,632

Test Size: 35,089

Trial 2:

Data: Viper_1, Viper_2, Viper_3, Viper_4 --- Intel_1, Intel_2, Intel_3, Intel_4

Features: frame.time_rel, frame.length, data.length

Data Size: 318,228 samples

Train Size: 190,936

Test Size: 127,292

Trial 3:

Data: Parrot_1, Parrot_2, Parrot_3, Parrot_4 --- Intel_1, Intel_2, Intel_3, Intel_4

Features: frame.time_rel, frame.length, data.length

Data Size: 253,849 samples

Train Size: 152,309

Test Size: 101,540

ROC Curve: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Page updated

Report abuse