Analyzing BGP AS_Path Behaviour using Python

Analyzing BGP AS_Path Behaviour using Python

Introduction

In this article, I will illustrate how BGP data, especially BGP AS-Path, can be analyzed using Python. The data collection pipeline is out of the scope of this article. However, collecting BGP data using BMP [1] can be easily done using OpenBMP [2] and Kafka.

For this article, I will use the paper written by Blazakis et al. [3] as a baseline and present Python code to analyze BGP AS-Path.

Dataset

The dataset is randomly created for demonstration. But suppose that this data was collected over a period of time using the pipeline mentioned above. Here's what it looks like. Obviously, if BGP data was collected using BMP, it will have lot more information that seen here but this will suffice for now.

Let's calculate AS-Path length for each entry and create a new column AS_Path_Len. This is easy to calculate using Pandas apply method.

Edit Distance

As mentioned in the paper, Edit Distance (aka Levenshtein Distance) is the number of edits required (add, delete, change) to make two strings equal. For example, when comparing strings s1 = 'He ate an apple.' with s2 = 'He ate an orange.', there will be 1 operation required to convert s1 to s2 (or vice versa). A similar technique could be used to calculate the Edit Distance when comparing AS-Paths for a specific prefix. However, a modification is required to _not_ treat AS Numbers as characters, rather treat them as entities. For example, when comparing '300 200 100' and '300 400 200 100', it takes 1 operation to make two AS-Paths equal. The authors of the paper called it AS-Path Edit Distance (AED).

So when we receive a new BGP Update message for prefix 1.1.1.1/32, we should calculate its AED to identify the anomaly. But compare this new data with which previous observation?

The authors used the first observation recorded as the "base" value to calculate AED but I am going to use the most frequently advertised AS-Path for that particular prefix as the base. If you think about that, it seems logical that the probability of that being a "legit" AS-Path is pretty high.

The most frequent AS-Path for 1.1.1.1/32 is:

Now, to calculate AED for prefix 1.1.1.1/32, I will use AS-Path "300 200 100" as the base value.

Anomaly Detection

Suppose a new BGP Update for prefix 1.1.1.1/32 is received with AS-Path 300 400 1456 100. To calculate its AED, I will compare this AS-Path with the base AS-Path value for prefix 1.1.1.1/32. Visually, we can determine that its AED will be 2. Let's calculate AED in Python.

This means AS-Path for new update is two off as compared to the base value. This is an anomaly but looking at the AS-Path, it also means that ASN 100 might have a moved to a new upstream ASN 1456.

However, consider the new update for prefix originated from ASN 500 and its AS-Path is 300 400 1456 500. Without factoring Origin AS, the AED will be 3; not much higher than previous case. But if we refactor the code to include the condition of Origin AS and if it doesn't match previous most frequent observations, we add a MAX_AED factor to the total.

AED 7 is quite high and should definitely be considered an anomaly.

Conclusion

AED proves to be a great metric for anomaly detection when analyzing BGP AS-Path. Of course, it has its flaws. For example, AS-Path 300 200 200 200 200 100 100 will generate an AED of 4 but looking at it, its just AS-Path prepending. Also, I have not considered Anycast prefixes here.

Further reading

[1] BMP RFC7854 - https://tools.ietf.org/html/rfc7854

[2] OpenBMP - BMP Collector - www.openbmp.org

[3] Analyzing BGP ASPATH Behaviour in the Internet - https://www.merit.edu/wp-content/uploads/2016/01/Analyzing_BGP_ASPATH.pdf

[4] Wikipedia - https://en.wikipedia.org/wiki/Edit_distance