(3/27/19)
JSON to PCAP Script: https://github.com/H21lab/json2pcap
Convert the PCAP to JSON using the following command
tshark -r input.pcap -T json -x > output.json
Then follow the instructions for the script. In theory you should be able to edit the json however you want (as long as it keeps a certain structure) and it can convert it back to pcap for you.
(3/29/19) KEY FUNCTIONS IN NETZOB
Data Alignment
Parallel Data Alignment
JSON Serialization
Automata Factories?
- Chained States Automata Factory
- One State Automata Factory
Field Split Aligned (takes in array of HEX and splits it among the fields)
Cluster By Key Field
Cluster By Size
Field Operations - Merge fields, etc.
Field Reseter - reformat fields to original
Field Split Delimiter
Find Key Fields
Entropy Measurement
Problems and Attempted Mitigations
netzob.Common.NetzobException.NetzobImportException: Error while importing data from source PCAP: This pcap cannot be imported since the layer 2 is not supported (127)
-- possible mitigations have been attempted, like trying to edit the library to add IEEE 802.11 Protocol importing capabilities (which breaks too many functions)
-- another attempt is to edit the PCAP by converting to JSON and systematically changing the JSON and translating it back to PCAP with and Ethernet header using the following script: https://www.h21lab.com/tools/json-to-pcap . Some problems come up when converting back to PCAP.
A lot of the necessary functions in netzob deal with hex. Next step will be to try the functions on HEX dumps of the captured packets.
(4.23.19)
Some interesting papers on profiling traffic with Machine Learning
Machine Learning based Traffic Classification using Low Level Features and Statistical Analysis: https://pdfs.semanticscholar.org/f835/befe987b7c266123ba7ccbc80f0c42f07322.pdf
A Survey of Machine Learning Based Packet Classification: https://pdfs.semanticscholar.org/c4cf/fd36e1958849e032202c8ae3d0c9a9098962.pdf
(4.25.19) (Explanation of Tutorial for Netzob)
Symbol (messages = <message read in by pcap>) — converts into a list of messages of type Netzob defined class ‘Abstract Message'
Data - content of the message - object
messageType
Date - int
Source - optional string
Destination - optional string
visualizationFunctions
Metadata
semanticTags
Format.splitDelimiter(symbol, ASCII(‘#’)) — split the symbolized message on the # sign (may not be this… any kind of delimiter you can see in the message)
Symbol._str_debug() — shows the symbol structure (for each field as split by the delimiter)
Example: field-0 has
- Data (Raw = b’CMDindentify’, ((0,88))) —> which means that length of this field can be 89 bytes…. we can now see that in field-0, the length can be 89, 57, 65, 105, 81, or 49 bytes.
Symbol — prints out in a more simplified structure of the message split by the delimiter
Format.clusterByKeyField(symbol, symbol.fields[0]) — cluster by the terms in field-0. You print a list of how many Keys in the KeyField there are and the values.
Format.splitAligned(sym.fields[2], doInternalSlick = True) — split the specified field according to the variations of message bytes, Relies on a sequence alignment algorithm . It will find alignment for the field you want it to analyze and try to determine static and dynamic fields for that field for each key value.
RelationFinder.findOnSymbol(sym) — for each Key in the KeyField, it will find a relationship between some fields (if it exists)
Example:
SizeRelation, between ‘value’ of:
Field
[[b ‘\n’, b ‘\t’]]
and ‘size’ of:
Field-Field-Field
[[b ‘aStrongP’, b ‘myPass’], [b ‘wd’, b ‘wd’], [b ‘ ‘, b ‘!’]]
This means there was a relation found between the value of the first field (like \n) and the size of the next three fields ( ‘aStrongP’, ‘wd’, ‘ ‘ )
It’s paired because they are both associated with one key.
Format.resetFormat(symbol) — if you want to undo all the formatting to analyze the patterns
Format.mergeFields(f1, f2) — If you know there is a certain length for some part of the data, you can merge two fields if they get split apart by the splitAligned algorithm
(5.8.19) Random Forest Classifier with Intel and Parrot Data
Select Attributes:
frame.time_relative (the amount of time that has passed in the
entire flight in seconds)
frame.time_delta (the amount of time that elapsed between the current
packet and the last packet in seconds)
frame.len (the length of the packet as an integer).
All other features were ignored for a simplistic model, but may be added later if they are found to be important.
The combined and randomized Parrot and Intel drone data was split into a training set
(60% or 379,264 examples) and a test set (40% or 252,842 examples). The RFC performed well
with 100 trees, and a tree depth of 2.
(10/3/19)
Explanation of the GINI INDEX for decisions trees: https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1
(10/4/19)
Basics of Machine Learning
ANATOMY OF A LEARNING ALGORITHM
Generally consists of 3 parts
Loss function
Optimization criterion based on the loss function (a cost function for example)
Optimization routine leveraging training data to find a solution to the optimization criterion.
References to gradient descent or stochastic gradient descent.
Gradient descent proceeds in epochs
Gradient descent is sensitive to the choice of the learning rate .
Slow for large datasets
BASIC PRACTICE
Feature Engineering Techniques
ONE-HOT ENCODING
Transforming categorical features (like colors) into vectors. If you have something like red, green, blue, then you can transform them into vectors red = [1,0,0], green = [0,1,0], blue = [0,0,1].
BINNING
Let feature j = 4 represent age. We can apply binning by replacing this feature with the corresponding bins. Making three new bins, “age_bin1”, “age_bin2”, “age_bin3” be added as indexes j=123, j=124, j=125, if the feature xi(4)= 7then if we assign bin2 to be all ages from 6-10 yrs old, then j = 124 is set to 1. This can help the algorithm learn with fewer examples.
NORMALIZATION
May help with features that have vastly different ranges.
EVALUATING PERFORMANCE
Most widely used metrics and tools to assess classification models:
Confusion matrix
Accuracy
Cost-sensitive accuracy
Precision / recall
Area under the ROC curve
A confusions matrix (consisting of True Positive, False Positive, True Negative, False Negative) is used to compute two other performance metrics → precision and recall
Basic Algorithm Excerpt from Machine Learning based Traffic Classification using Low Level Features and Statistical Analysis (4.23.19):
1. Traffic Collection
2. Check size of pcap captures.
3. convert raw pcap into a processed CSV file. Can be done in many ways, the paper uses Tshark and automated shell script and code.
4. Labeling the data
5. Apply classifier
6. select attributes.
7. visualization of all attributed
8. Apply classification alg on data set.
(11/11/19)
Current Data: https://docs.google.com/spreadsheets/d/19ILJka1em-t-JHN5REilpu0Clzt_oRHtlFxVHySHhNA/edit#gid=0
Performance of simple Random Forest Classifier on different sets of data
Trial 1:
Data: Viper_1, Viper_2, Viper_3, Viper_4 --- Parrot_1, Parrot_2, Parrot_3, Parrot_4
Features: frame.time_rel, frame.length, data.length
Data Size: 87,721 samples
Train Size: 52,632
Test Size: 35,089
Trial 2:
Data: Viper_1, Viper_2, Viper_3, Viper_4 --- Intel_1, Intel_2, Intel_3, Intel_4
Features: frame.time_rel, frame.length, data.length
Data Size: 318,228 samples
Train Size: 190,936
Test Size: 127,292
Trial 3:
Data: Parrot_1, Parrot_2, Parrot_3, Parrot_4 --- Intel_1, Intel_2, Intel_3, Intel_4
Features: frame.time_rel, frame.length, data.length
Data Size: 253,849 samples
Train Size: 152,309
Test Size: 101,540