Tuple Space Explosion:

A Denial-of-Service Attack

Against a Software Packet Classifier

Below we present the essence of our paper appear(ed) at CoNEXT, 2019.

L. Csikor, D. M. Divakaran, M. S. Kang, A. Kőrösi, B. Sonkoly, D. Haja, D.P. Pezaros, S. Schmid, G. Rétvári, "Tuple Space Explosion: A Denial-of-Service Attack Against a Software Packet Classifier", In proc. of ACM CoNEXT'19, Orlando, FL, USA, 2019, https://dl.acm.org/citation.cfm?doid=3359989.3365431.

Furthermore, some parts of the works, preliminary results and follow-up works have appeared at several venues before and after:

  • L. Csikor, V. Ujawane, and D. M. Divakaran, "On the Feasibility and Enhancement of the Tuple Space Explosion Attack against Open vSwitch", arXiv:2011.09107 [cs.CR], 2020.

  • L. Csikor, V. Ujawane, and D. M. Divakaran, “The Discrepancy of the Megaflow Cache in OVS, Final Episode,” in OVS+OVN Conference, https://www.youtube.com/watch?v=DSC3m-Bww64, 2020.

  • L. Csikor, M. S. Kang, and D. M. Divakaran, “The Discrepancy of the Megaflow Cache in OVS, Part II.” in OVS+OVN Conference, Red Hat, Westford, MA, 2019, https://bit.ly/35fFoKD.

  • L. Csikor and G. Rétvári, “The Discrepancy of the Megaflow Cache in OVS,” in Open vSwitch Fall Conference, Club Auto Sport, Santa Clara, CA, 2018, https://bit.ly/34idgVQ.

  • L. Csikor, C. Rothenberg, D. P. Pezaros, S. Schmid, L. Toka, and G. Rétvári, “Policy Injection: A Cloud Dataplane DoS Attack,” in Proceedings of the ACM SIGCOMM 2018 Conference on Posters and Demos, Budapest, Hungary, 2018, pp. 147-149, https://dl.acm.org/citation.cfm?id=3234250.


...and any further updates regarding the mitigation of the revealed attack will be published here.

Summary

Packet classification is one of the fundamental building blocks of various security primitives and thus it needs to be highly efficient and available. In this paper, we evaluate whether the de facto packet classification algorithm (i.e., Tuple Space Search scheme, TSS) used in many popular software networking stacks, e.g., Open vSwitch, VPP, HyperSwitch, is robust against low-rate denial-of-service (DoS) attacks.

We present the Tuple Space Explosion (TSE) attack that exploits the fundamental space/time complexity of the TSS algorithm. We demonstrate that the TSE attack can degrade the switch performance to as low as 12% of its full capacity with a very low packet rate (i.e., 0.7 Mbps) when the target packet classification only has simple policies, e.g., "allow a few flows but drop all others".

We also show that if the adversary has partial knowledge of the installed classification policies, she can virtually bring down the packet classifier with the same low attack rate.

The TSE attack, in general, does not generate any specific attack traffic patterns but some attack packets with randomly chosen IP headers and arbitrary message contents. This makes it particularly hard to build a signature of our attack traffic for detection. Since the TSE attack exploits the fundamental complexity characteristics of the TSS algorithm, unfortunately, there seems to be no complete mitigation of the problem. We thus suggest, as a long-term solution, to use other packet classification algorithms (e.g., hierarchical tries, HaRP, Hypercuts) that are not vulnerable to the TSE attack.

As a short-term solution, we propose MFCGuard, a monitoring system that carefully manages the entries in the tuple space to keep packet classification fast for the packets that are eventually accepted by the system.

Background

Switching Stacks for Virtualization

Enterprises increasingly offload business-critical workloads to the public cloud to benefit from low infrastructure costs, high availability, and flexible resource provisioning. Reliable and efficient service provisioning heavily depends on the ability to efficiently switch traffic between the tenants' workloads and the outside world.

In this paper, we use OVS as our running example, but the presented vulnerabilities might affect other TSS-based software switches (e.g., VPP [1], Hyperswitch [2], GSwitch [3]). OVS [4] is an open source, multilayer, production quality software switch that enables massive network automation through programmatic extensions [5]. It can be managed remotely through standardized control plane protocols [6, 7]. The OVS flow table describes the packet processing behavior to be implemented by the switching logic at a high level. Due to its flexibility, generality, and community support, OVS has been extensively used in cloud deployments [5].

The flow table of an OVS switch is an ordered set of flows, where each flow is a pair of (1) a wildcard rule, operating on specific protocol header fields (e.g., IP source address, ports) and designating packets that belong to the flow, and (2) an action, a set of packet processing primitives to be applied to packets matching the flow rule; e.g., “forward to port”, or “drop”.

Two flows in the flow table are said to overlap if there is a packet header that matches both. In this case, the matching flow that occurs first in the flow table takes precedence. For instance, in the sample ACL in Fig. 1, a packet with source IP address 10.0.0.1, source and destination ports 34521, and 443, respectively, matches both the second and the last flow entries with the first flow overriding the last one by higher priority. Accordingly, here we define several use cases according to the number of different header fields an allow rule matches on besides the only deny rule. In particular, the use case of destination port only (Dp), source port and destination port (SpDp), source IP and destination port (SipDp), and all previous three (SipSpDp), contains flow rule #1, flow rules #1 and #3, flow rules #1 and #2, and all flow rules, respectively.

In contrast, a flow table in which all rules are disjoint is order-independent because all packets have a single matching rule and equal priority (i.e., order is irrelevant). In general, this makes packet classification much simpler [8].

Most software switches (if not all) support order-dependent flow tables despite the performance benefit of order-independent tables because of the flexibility of the former. In virtualized environments (e.g., multiple tenants share a single software switch for access control), users with various networking knowledge configure the flow rules in the switches. The greater flexibility of order-dependent tables support rule wildcarding and flow priorities, which allow complex packet processing logics to be described concisely.

TSS for Fast Packet Classification

To cut down the prohibitive cost of packet classification, OVS adopts the well-known fast path/slow path separation principle [9]. The fast path comprises two layers of flow caches, and the slow path implements a complete representation of the flow table serving as a fallback when the fast path cannot decide on the fate of a packet. Only the first packet of each flow is subjected to full-blown flow-table processing, i.e., slow path, and the resulting flow-specific rules and actions are then registered in the flow caches; the rest of the flow’s packets take the fast path. This amortizes the cost of packet classification over subsequent packets of a flow, contributing to increased performance without loss of expressiveness and generality.

Within the fast path, the microflow cache implements a per-transport-connection exact-match store where lookup occurs over all header fields, while the megaflow cache (MFC) bundles multiple microflows into a single megaflow to impose common processing to the entire bundle.In this design, the microflow cache merely serves as “short-term” memory and it is often exhausted even in normal operation (by default, it contains only a couple of hundred entries). The lookup algorithm in the MFC relies on the TSS scheme [10], the prevailing packet classifier used to implement ACLs in other hypervisor switches as well (e.g., VPP [1], HyperSwitch [2], GSwitch [3]). MFC generally saves on cache entries by a single megaflow covering, say, all incoming HTTP connections regardless of the source TCP port (i.e., TCP port wildcarded). In a nutshell, this is done by collecting the entries matching on the same set of header bits into a hash in which masked packet headers can be found fast. Then, masks and associated hashes are searched sequentially until the first matching entry is found.

Note that the TSS implementation in OVS does not know about flow priorities; thus the slow path ensures that MFC entries are all disjoint to make packet classification simpler yet introducing worst case exponential complexity (i.e., exhaustive linear search in the different masks). Correspondingly, as long as the number of masks is kept in a reasonable range (e.g., couple of hundreds masks), packet processing in the fast path is close to line rate. This property of the TSS is the very logic we aim to exploit.

We show that even with a random packet sequence we can achieve the TSS implementation in OVS to produce thousands of masks significantly increasing the look up time of a given packet that can eventually bring down the whole packet processing to its knees.

Threat model

We consider a general virtualized computing environment, where a targeted software switch is used for packet processing and basic network operations. This includes a typical multi-tenant cloud infrastructure whereby tenants lease resources in the cloud to deliver public services. Tenants may use cloud management system (CMS) APIs to set up their access-control list (ACL) rules in the underlying software switch to access-control, redirect, or log accesses to different resources [11, 12, 13, 14]. We consider that the internal algorithms of the data plane fabric is fully known to adversaries.

The attacker’s goal is to send some attack packets to the virtual switch, which when subjected to the implemented ACL will exhaust the underlying resources denying access to the rest of the users. The adversary only needs to have the capability of crafting and sending IP packets with arbitrary legitimate headers without being filtered at the first hand; e.g., by her upstream or transit ISPs. Note that we do not require any privilege of the target switch for the effective DoS attack. However, having some partial, internal state of the target switch (e.g., installed ACL rules) can further improve the efficiency (i.e., less number of required attack packets). Here, we do not consider volumetric DoS attacks that congest the target‘s network bandwidth with attack traffic.

Two variants of our Tuple Space Explosion attack

In a nutshell, to achieve the highest number of MFC masks with the smallest effort, the crafted packet sequence needs to be on par with the installed flow rules (see more details in the paper). Correspondingly, being aware of the ACL itself is a key aspect to the efficiency of the TSE attack. Thus, we present two different approaches of the TSE attack each posing different requirements and targets for the attacker.

In order to explain the main differences between them, and their practical targets, we need to understand a key abstraction in a cloud environment: the per-user virtual switches tenants configure to set up their ACLs. Tenants perceive these virtualised resources as their own physical switch, however switches are only logically separated and all of them are implemented and managed by the same individual software switch instance. Therefore, all workloads happened to be scheduled to the same hypervisor inherently share the switching fabric as well (e.g., the MFC).

Co-located TSE

We build on top of this abstraction: the attacker has leased resources in the cloud, which inherently makes him/her capable of installing ACLs into its own virtual switch. Then, the shared MFC can be easily populated with new masks by targeting the known ACLs. However, co-location comes at a price that only those tenants’ workloads are affected that happened to be scheduled to the same hypervisor.

General TSE

In this approach, we alleviate the restrictions of Co-located TSE: we consider the case when the attacker has neither resources in the cloud, nor knowledge about any ACLs. Here, we investigate how much more effort an attacker needs in order to achieve the same efficiency as Co-located TSE.


Denial-of-Service

Co-located TSE

As can be seen in the figure (note the log scale on the y axis), the more MFC masks are generated, the worse the throughput becomes due to the significantly increased packet processing time. We evaluated the impact with several hardware offloading features to see whether the more efficient hardware implementation also suffer from this phenomenon; for TCP connections, hardware offloading preserves the high throughput for hundreds of masks and the performance only starts to drop afterwards. UDP traffic, on the other hand, cannot benefit from hardware offloading.

Without offloading (GRO OFF) or in case of UDP, after reaching only a handful of masks (Dp=17 masks), the throughput already halves. Having a couple of hundreds of masks (SpDp=260, SipDp=516) already brings down the overall throughput to 0.2 Gbps from the nominal throughput of 10 Gbps. Mounting thousands of masks (SipSpDp=8200) results in a complete denial-of-service attack, where new flows cannot even be established.

While hardware offloading indeed can help mitigating the effect, after thousands of masks, the performance also drops to below 1 Gbps.

Observe the effect on the flow completion time of 1GB of TCP traffic when hardware offloading is switched off (on the secondary y axis).

General TSE

While with the Co-located TSE, we can simply achieve the targeted number of MFC masks to fully bring down the packet classifier, in the General TSE (i.e., when we are not aware of the ACL rules), we can only send random packet with random contents.

Correspondingly, we calculated the expected number MFC masks that can be achieved with a given number of random packets. Observer the figure on the left hand side, where on the x axis the number of different random packets is shown, while the y axis depicts the expected (E) and measured (M) (averaged over 100 runs) MFC masks.

Observe that the more different header fields the ACL consists of the more MFC entries can be spawned with the same number of random packets. In particular, the maximum attainable MFC masks (with 50, 000 packets) are approx. 16, 121, 122, and 581 in case of Dp, SpDp, SipDp, and SipSpDp, respectively. In terms of service degradation, these results mean that General TSE can reduce the full capacity with GRO OFF to 52% (97% with GRO ON, 88% with FHO, 60% with UDP), 12% (96% with GRO ON, 87% with FHO, 15.8% with UDP), and 1% (73.5% with GRO ON, 25.5% with FHO, 3.25% with UDP), respectively.

Mitigation

Besides, several immediate yet impractical remedies might help: (i) deploying or offloading ACL implementations to a different hypervisor switch (e.g., [1, 2, 3]) or to the (ii) high-performing gateway appliance (e.g., [15]), (iii) switching the MFC completely off, or (iv) enabling advanced flow caching via DPDK-based OVS datapath [16]. However, each of the above has the following corresponding disadvantage: in case of (i) other implementations might suffer from the same attack (e.g., [1]) or (ii) they do not help against attacks initiated within the DC, for (iii) MFC has been the biggest performance improvement so far [17], and for (iv) the feature that may prevent the attack is available in select datapaths.

MFCGuard

As a more customized mitigation technique we developed MFCGuard, which monitors and modifies the MFC— if the number of masks exceed a certain threshold, it looks for patterns corresponding to a possible TSE attack in every 10 seconds according to the MFC expiration period, and wipes out those entries accordingly. We observed that monitoring (and modifying) the MFC has no overhead on the performance.

In essence, removing an entry from the MFC means that matching packets will be processed in the slow path again. Since the slow path would spark the same MFC entries again, the idea behind MFCGuard was to constantly keep those entries out of the fast path. In practice, however, we observed that once an MFC entry is deleted then it will never be sparked again, i.e., matching packets will always be processed by the slow path. Such undesired, unexpected and undocumented behavior [18] can have serious performance penalties; although MFC entries can be manually re-injected.

There are several requirements MFCGuard needs to meet: (i) entries covering the useful traffic should never be deleted. Furthermore, (ii) according to the available resources, we can only remove select flows from the MFC to find a balance between the maximum performance of the fast path and the increased resource utilization by the slow path; both impacting the overall quality of the run services. Due to requirement (i) MFCGuard will only remove entries with drop action! With this simple yet important requirement, only adversarial packets will be subjected to the slow path, keeping the fast path accelerated for the useful traffic flows (cf. Alg. 2 in §11.4). In our current implementation, we have therefore focused on (i), and keep you updated about (ii) here.

We evaluated the efficiency of MFCGuard in all use cases (by deleting all drop rules) and observed that once the MFC is “cleaned”, the performance of the victim’s traffic goes back to its baseline. As the slow path is becoming much more involved, we evaluated the system’s load in such cases.

Results are depicted on the left hand side, where x and y axes show the attack rate and the corresponding CPU usage of the slow path daemon (ovs-vswitchd), respectively. It can be seen that as long as the attack rate is less than 1, 000 pps (< 1 Mbps) the slow path only consumes 15% of the CPU; note, this packet rate can be enough to bring down OVS).

However, when the packet rate is 10, 000 pps, the CPU load jumps up to ≈ 80% (this rate in case of General TSE would be enough to degrade the full capacity to 10%). We can conclude that our current MFCGuard implementation is already capable of efficiently mitigating both TSE attacks as long as the attacking rate is low. If the attack rate is much above 10, 000 pps, the attack becomes a volumetric attack, for which there are multiple solutions to detect and handle (e.g., excess amount of packets and over-provisioning).

Further updates

Below, we are summarizing some approaches to alleviate the issue either as a mitigation or as a prevention technique. We are already working on some of them, while others are collected here from other "collaborators" who have raised some new approaches during disseminating the work.

Sophisticated MFCGuard

Currently, MFCGuard simply removes all "bad-looking" entries from the MegaFlow Cache, however as discussed above, it can put much more pressure on the CPU if too many flows are sent up to the slow path. A sophisticated algorithm would look for the exact attack flows by applying the findings of the TSE attack itself, i.e., according to the flow table, it is possible to see which masks are the results of a possible TSE attack.

Possible disadvantage: During the generic TSE attack where not just the necessary packets to spawn the relevant MFC masks and hashes are sent, but more and higher entropy random packets, masks can look like completely different and identifying which packets are part of the TSE attack is much more difficult.

Prioritize

Masks and associated hashes corresponding to the benign traffic should be prioritized during the linear search process. This would mean that even if we blow up the tuple space, the benign traffic would hit an entry during the packet classification in a(n upper-) bounded number of steps. Such masks (and associated hashes) should be derived from the flow table. For instance, in case of having two allow rules matching on the TCP destination port and source IP address then these masks would be (80/ffff, 10.0.0.1/ffffffff).

Possible disadvantage: However, one might consider then the number of allow rules can be increased to a certain amount, where the above-mentioned upper bound would also scale up, at some point, still having the issue (i.e., when the number of allow rules reaches much more than 100).

Dynamically set MFC masks timeout

Generally, each masks and entry in the MFC has a timeout of 10 seconds, i.e., an entry is kept warm for 10 seconds, if there is no more matching packets, the entry will be freed up to make space for more relevant flows. Here, we consider that whenever a new MFC mask (and hash) has been created, it should have a timeout of 1 second. If there is a hit for the same mask, then its timeout can be increased to 2 seconds. Then, the whole process for each particular mask can repeat until the max timeout (e.g., 10 second) is reached.

Possible disadvantage: It only makes the attack difficult in the beginning! If the attacker keeps sending the same traffic with the same rate (or maybe in a higher rate if needed) in the beginning, then s/he can reach the timeout of 10 seconds quite easily in a short time frame (i.e., depending on the rate required for a specific packet sequence, time frame can be within ten(s) of seconds or might be in a couple of minutes).


Interested in some hands-on experience?

Check out our repo at: https://github.com/cslev/ovsdos

Multimedia materials

Please, find some recorded presentations about disseminating the details at different venues.

ACM SIGCOMM 2018 (Demo)

This video summarizes the very essence of our finding in the beginning of the project. The material has been submitted as a compulsory and complementary material to the proposed ACM SIGCOMM DEMO.

L. Csikor, C. Rothenberg, D. P. Pezaros, S. Schmid, L. Toka, and G. Rétvári, “Policy Injection: A Cloud Dataplane DoS Attack,” in Proceedings of the ACM SIGCOMM 2018 Conference on Posters and Demos, Budapest, Hungary, 2018, pp. 147-149, https://dl.acm.org/citation.cfm?id=3234250.

OVS Fall Conference 2018

This talk @ OvS Con summarizes the Co-located TSE attack.

L. Csikor and G. Rétvári, “The Discrepancy of the Megaflow Cache in OVS,” in Open vSwitch Fall Conference, Club Auto Sport, Santa Clara, CA, 2018, https://bit.ly/34idgVQ.

OVS Fall Conference 2019

This second talk @ OvS Con summarizes the General TSE attack.

L. Csikor, Min Suk Kang, Dinil Mon Divakaran, “The Discrepancy of the Megaflow Cache in OVS, Part II.” in OVS+OVN Conference, Red Hat, Westford, MA, 2019, https://bit.ly/35fFoKD.

OVS Fall Conference 2020

This is our final talk @ OvS Con about OVS with DPDK and the TSE attack.

L. L. Csikor, V. Ujawane, and D. M. Divakaran, “The Discrepancy of the Megaflow Cache in OVS, Final Episode,” in OVS+OVN Conference, https://www.youtube.com/watch?v=DSC3m-Bww64, 2020.

References

[1] FD.io, “VPP - Vector Packet Processing,” https://docs.fd.io/vpp/19.01/index.html.

[2] K. K. Ram, A. L. Cox, M. Chadha, and S. Rixner, “Hyper-Switch: A Scalable Software Virtual Switching Architecture,” in Usenix ATC, 2013, p. 12.

[3] M. Varvello, R. Laufer, F. Zhang, and T. Lakshman, “Multi-Layer Packet Classification with Graphics Processing Units,” in Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies - CoNEXT ’14. Sydney, Australia: ACM Press, 2014, pp. 109–120. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2674005.2674990

[4] B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme, J. Gross, A. Wang, J. Stringer, P. Shelar, K. Amidon, and M. Casado, “The design and implementation of Open vSwitch,” in NSDI, 2015, pp. 117–130.

[5] A Linux Foundation Collaborative Project, “Production Quality, Multilayer Open Virtual Switch,” http://www.openvswitch.org/, Accessed: June 2019.

[6] OpenFlow Switch Specifications v.1.4.0, The Open Networking Foundation, 2013.

[7] B. Pfaff and B. Davie, “The Open vSwitch database management protocol,” RFC 7047, 2013.

[8] K. Kogan et al., “SAX-PAC: scalable and expressive packet classification,” in SIGCOMM, 2014, pp. 15–26.

[9] P. Newman, G. Minshall, and T. L. Lyon, “IP switching – ATM under IP,” IEEE/ACM Trans. Netw., vol. 6, no. 2, pp. 117–129, 1998.

[10] V. Srinivasan, S. Suri, and G. Varghese, “Packet classification using tuple space search,” in SIGCOMM, 1999, pp. 135–146.

[11] Cloud Native Computing Foundation, “Network Policies,” https://kubernetes.io/docs/concepts/services-networking/network-policies.

[12] Istio, “Authentication Policy,” 2018, https://istio.io/docs/reference/config/istio.authentication.v1alpha1.

[13] Istio, “Traffic Routing,” 2018, https://istio.io/docs/reference/config/istio.networking.v1alpha3.

[14] Istio, “Ingress Controller,” 2018, https://istio.io/docs/tasks/traffic-management/ingress.html.

[15] Amazon Web Services, Elastic Load Balancing features. https://aws.amazon.com/elasticloadbalancing/features/#Details_for_Elastic_Load_Balancing_Products, Accessed in Jun 2019.

[16] DPDK. Membership Library. https://doc.dpdk.org/guides/prog_guide/member_lib.html.

[17] Pettit, J., "Accelerating Open vSwitch to Ludicrous Speed. Blog post: Network Heresy - Talses of the network reformation", 2014. http://networkheresy.com/accelerating-open-vswitch-to-ludicrous-speed/.

[18] Ben Pfaff. [ovs-discuss] ovs-dpctl del-flow works strange. Mailing list archive, https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048887.html, 2019 June

Levente Csikor, Dinil Mon Divakaran, Min Suk Kang

NUS-Singtel Cyber Security Research & Development LabNational University of Singapore
Last update: 01/08/2022