Abstracts

Building a Secure Internet: RPKI’s Stumbling Speedrun to the Top

Haya Schulmann, Goethe Universitaet Frankfurt; Niklas Vogel, Goethe Universitaet Frankfurt; Donika Mirdita, Technische Universitaet Darmstadt; Jens Friess, Technische Universitaet Darmstadt; Michael Waidner, Technische Universitaet Darmstadt

NDSS 2024

The presentation is based on a number of research publications that appeared at NDSS, USENIX Security , ACM CCS and SIGCOMM.

In this talk we will focus on Internet routing with Border Gateway Protocol (BGP) – the mechanism that controls how data is routed from the sending to the receiving network. Invented in the late 1980s and is still in use today, BGP is a particularly important example of how the Internet is developing, and of the key challenges of Internet security. BGP is inherently insecure, but back then this was not considered a problem. This changed completely with the commercialization of the Internet in the 2000s. Disruptions to routing pose severe risks to any online activity. In the early 2010s the Resource Public Key Infrastructure (RPKI) was invented as a mechanism to secure routing information. Following the Internet’s design philosophy, RPKI started as an experimental technology, and it is still not widely deployed.

The US was one of the first countries to recognize the threat that vulnerable Internet routing introduces to national security, and made securing Internet routing one of its strategic priorities in cyberspace. RPKI was named the key mechanism for securing the nation’s Internet routing – amplifying the need for quickly improving the maturity of RPKI.

In this talk we will provide an overview of the current state of RPKI deployment, examine hurdles to its broader adoption, propose a novel, resilient RPKI architecture utilizing mechanisms from distributed computing, and conclude with steps forward for securing the Internet infrastructure, from a scientific as well as from a national, political perspective.

2SYN: Congestion-Aware Multihoming

Kfir Toledo, Technion and IBM Research; Isaac Keslassy, Technion

NOMS 2025

When sending flows to arbitrary destinations, current multihoming routers adopt simple congestion-oblivious mechanisms. Therefore, they cannot avoid congested paths.

In this talk, we introduce 2SYN, the first congestion-aware multihoming algorithm that works for any destination.

We explain how it dynamically selects a preferred path for new connections, even given previously-unseen destinations. We further demonstrate that it can be easily implemented in Linux. Finally, in a real-world experiment with either LTE or a wired link, we show how 2SYN dynamically adapts to the quality of the connection and outperforms alternative approaches. Thus, 2SYN helps companies better manage their networks by leveraging their multihoming capabilities.

GenSync - A Framework for Reconciling Remote Data

Xingyu Chen, Novak Boškov, Ari Trachtenberg, David Starobinski, Boston University

IEEE Computer 2023

Consistent replication of data among remote devices is an important primitive for a wide range of applications, including dissemination of transactions in distributed ledgers, file sharing, backend maintenance of cloud storage, offloading of IoT storage, and gossip-based status updates in wireless sensor networks. To date, each community (and sometimes even different groups within the same community!) produces its own home-grown implementations of this primitive, resulting in the reimplementation of similar protocols from the literature.

Gensync is an umbrella framework for unifying and simplifying the use of various data reconciliation protocols from the recent literature. Provided as an open-source C++ middleware library, it can be used as an Application Programming Interface for applications that require the reconciliation primitive. It can also be utilized as a research tool for experimentally comparing and benchmarking different reconciliation solutions, in situ, within an existing application.

We will present this framework and describe its use in a number of applications from our own research. In the process, we will also highlight an updated comparison of the different reconciliation protocols that have thusfar been implemented as part of the framework.

The core works underpinning this talk are taken from:

* Boškov, Novak, Ari Trachtenberg, and David Starobinski. "Enabling Cost-Benefit Analysis of Data Sync Protocols." Computer 56.10 (2023): 62-71.

* Boškov, Novak, Ari Trachtenberg, and David Starobinski. "GenSync: A new framework for benchmarking and optimizing reconciliation of data." IEEE Transactions on Network and Service Management 19.4 (2022): 4408-4423.”

Joint PIR-Coding for Secure Retrieval Over Wireless Channels

Or Elimelech, Ben-Gurion University of the Negev; Asaf Cohen, Ben-Gurion University of the Negev

arXiv

In this paper, we address the problem of Private Information Retrieval (PIR) over a public Additive White Gaussian Noise (AWGN) channel.

In such a setup, the server's responses are visible to other servers.

Thus, a curious server can listen to the other responses, compromising the user's privacy.

Indeed, previous works on PIR over a shared medium assumed the servers cannot instantaneously listen to other responses.

To address this gap, we present a novel randomized lattice -- PIR coding scheme that jointly codes for privacy, channel noise, and secure from curious servers that may listen to other responses.

We demonstrate that a positive PIR rate is achievable even in cases where the channel to the curious server is stronger than the channel to the user.

Routing for Large ML Models

Ofir Cohen, Hebrew University of Jerusalem; Jose Yallouz, Technion; Michael Schapira, Hebrew University of Jerusalem; Shahar Belkar, Huawei Network IO Innovation Lab; Tal Mizrahi, Technion

Training large language models (LLMs), and other large machine learning models, involves repeated communication of large volumes of data across a data center network. The communication patterns induced by these training process exhibit high regularity and persistence, giving rise to significant opportunities for optimizing the manner in which flows are

routed across the network. We present an algorithmic framework for quantifying network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically optimizing routing with respect to this global metric.

Nearly Optimal Local Algorithms for Constructing Sparse Spanners of Clusterable Graphs

Reut Levi, Reichman University; Moti Medina, Bar-Ilan University; and Omer Tubul, Bar-Ilan University

APPROX/RANDOM 2024

In this paper, we study the problem of locally constructing a sparse spanning subgraph (LSSG), introduced by Levi, Ron, and Rubinfeld (ALGO'20). In this problem, the goal is to locally decide for each e ∈ E if it is in G' where G' is a connected subgraph of G (determined only by G and the randomness of the algorithm). We provide an LSSG that receives as a parameter a lower bound, ϕ, on the conductance of G whose query complexity is Õ(√n/ϕ²). This is almost optimal when ϕ is a constant since Ω(√n) queries are necessary even when G is an expander. Furthermore, this improves the state of the art of Õ(n^{2/3}) queries for ϕ = Ω(1/n^{1/12}).

We then extend our result for (k, ϕ_in, ϕ_out)-clusterable graphs and provide an algorithm whose query complexity is Õ(√n + ϕ_out n) for constant k and ϕ_in. This bound is almost optimal when ϕ_out = O(1/√n).

Blank Space: Adaptive Causal Coding for Streaming Communications Over Multi-Hop Networks

Adina Waxman, Faculty of ECE Technion; Shai Ginzach, Rafael; Aviel Glam, Rafael; Alejandro Cohen, Faculty of ECE Technion

arXiv

In this work, we introduce Blank Space AC-RLNC (BS), a novel Adaptive and Causal Network Coding (AC-RLNC) solution designed to mitigate the triplet trade-off between throughput-delay-efficiency in multi-hop networks. BS leverages the network’s physical limitations considering the bottleneck from each node to the destination. In particular, BS introduces a light-computational re-encoding algorithm, called Network AC-RLNC (NET), implemented independently at intermediate nodes. NET adaptively adjusts the Forward Error Correction (FEC) rates and schedules idle periods. It incorporates two distinct suspension mechanisms: 1) Blank Space Period, accounting for the forward-channels bottleneck, and 2) No-New No-FEC approach, based on data availability. The experimental results achieve significant improvements in resource efficiency, demonstrating a 20% reduction in channel usage compared to baseline RLNC solutions. Notably, these efficiency gains are achieved while maintaining competitive throughput and delay performance, ensuring improved resource utilization does not compromise network performance.

(Submitted to ISIT 2025)

The Cost of Retrying: Operation and Exploitation of Retry Mechanisms in Cloud Applications

Jhonatan Tavori, Tel-Aviv University; Anat Bremler-Barr, Tel-Aviv University; Hanoch Levy, Tel-Aviv University; Ofek Lavi, Tel-Aviv University

coNEXT 2023

Microservice-based cloud application design has moved away from traditional monolithic architectures by breaking applications into smaller, independently deployable services with well-defined interfaces. Modern cloud applications commonly adopt this framework. Retry mechanisms are frequently used in cloud microservices architectures as a method for recovering from transient errors, including network failures and service overloading.

This research examines the operation of cloud retry mechanisms under deliberate DDoS attacks and their effects on application performance and operational costs. We further focus on the economic aspects and demonstrate that enabling such mechanisms improperly might be counterproductive and expose the system to substantial, quadratic economic damage in the presence of attacks.

We explore the impact of retry mechanisms on systems with distinct auto-scaling strategies, study an analytic model for two services operating in tandem, and highlight the resource inefficiencies and rising operational costs caused by excessive retries in poorly coordinated services. Several attack scenarios are also discussed.

This talk is based on an extended abstract published in ACM SIGCOMM CoNEXT 2023.

Non-uniformity is All You Need: Efficient and Timely Encrypted Traffic Classification With ECHO

Shilo Daum, HUJI; Tal Shapira, TAU; Anat Bremler-Barr, TAU; David Hay, Princeton

With 95% of Internet traffic now encrypted, an effective approach to classifying this traffic is crucial for network security and management. This paper introduces ECHO-a novel optimization process for ML/DL-based encrypted traffic classification that can significantly improve many suggested classification schemes. ECHO targets both classification time and memory utilization and incorporates two innovative techniques. The first component, HO (Hyperparameter Optimization of binnings), aims at creating efficient traffic representations. While previous research often uses representations that map packet sizes and packet arrival times to fixed-sized bins, we show that non-uniform binnings are significantly more efficient. These non-uniform binnings are derived by employing a hyperparameter optimization algorithm in the training stage. HO significantly improves accuracy given a required representation size, or, equivalently, achieves comparable accuracy using smaller representations. Then, we introduce EC (Early Classification of traffic), which enables faster classification using a cascade of classifiers adapted for different exit times, where classification is based on the level of confidence. EC reduces the average classification latency by up to 90%. Remarkably, this method not only maintains classification accuracy but also, in certain cases, improves it. Using three publicly available datasets, we demonstrate that the combined method, Early Classification with Hyperparameter Optimization (ECHO), leads to a significant improvement in classification efficiency.

The Observer Effect in Computer Networks

Tal Mizrahi, Technion; Michael Schapira, HUJI; Yoram Moses, Technion

ANRW 2024

Network measurement involves an inherent tradeoff between accuracy and overhead; higher accuracy typically comes at the expense of greater measurement overhead (measurement frequency, number of probe packets, etc.). Capturing the “right” balance between these two desiderata – high accuracy and low overhead – is a key challenge. However, the manner in which accuracy and overhead are traded off is specific to the measurement method, rendering apples-to-apples comparisons difficult. To address this, we put forth a novel analytical framework for quantifying the accuracy-overhead tradeoff for network measurements. Our framework, inspired by the observer effect in modern physics, introduces the notion of a network observer factor, which formally captures the relation between measurement accuracy and overhead. Using our “network observer framework”, measurement methods for the same task can be characterized in terms of their network observer factors, allowing for apples to apples comparisons. We illustrate the usefulness of our approach by showing how it can be applied to various application domains and validate its conclusions through experimental evaluation.

Broadcast Approach Meets Network Coding for Ultra-Reliable and Low-Latency Data Streaming

Ofek Cohen, Technion; Dr Alejandro Cohen, Technion

For data streaming applications, existing solutions are not yet able to close the gap between high data rates and low delay. This work considers the problem of data streaming under mixed delay constraints over a single communication channel with delayed feedback. We propose a novel layered adaptive causal random linear network coding (LAC-RLNC) approach with forward error correction. LAC-RLNC is a \emph{variable-to-variable coding} scheme, i.e., variable recovered information data at the receiver over variable short block length and rate is proposed. Specifically, for data streaming with base and enhancement layers of content, we characterize a high dimensional throughput-delay trade-off managed by the adaptive causal layering coding scheme. The base layer is designed to satisfy the strict delay constraints, as it contains the data needed to allow the streaming service. Then, the sender can manage the throughput-delay trade-off of the second layer by adjusting the retransmission rate a priori and posterior as the enhancement layer, which contains the remaining data to augment the streaming service's quality, is with the relaxed delay constraints.

We provide numerical evidence that the layered network coding strategy significantly boosts performance. Specifically, our results indicate that LAC-RLNC, when compared to the traditional non-layered method, achieves up to a twofold reduction in the mean and max in-order delivery delay for the base layer, nearing the theoretical lower limit, while maintaining similar results in the enhancement layer, and a slight improvement in throughput. These findings are corroborated by our analytical work, which produces bounds aligning with simulation data and facilitates the management of the throughput-delay trade-off.

Cloud Data Geolocation in Three-Tier Architectures

Aviram Zilberman, Cyber@BGU and Jerusalem College of Technology, Adi Offer, Cyber@BGU; Asaf Shabtai, Cyber@BGU; Yuval Elovici, Cyber@BGU; Rami Puzis, Cyber@BGU; Andikan Otung, Fujitsu; Hideki Mitsunobu, Fujitsu;

Cloud data geolocation has become increasingly important due to privacy regulations that require strict control over the storage and processing of client data. Traditional methods for geolocating cloud data mainly focus on validating the data location claims of major cloud service providers such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure. However, as these providers now comply with strict regulatory standards such as the General Data Protection Regulation (GDPR), traditional geolocation methods have become less relevant. In contrast, third-party applications built on these cloud services may not adhere to the same regulatory frameworks, and conventional geolocation methods do not account for the complexities of their three-tier architectures. As a result, users of these applications lack reliable ways to verify data geolocation and compliance with privacy regulations.

This paper is the first to define the problem of data geolocation in three-tier architectures and proposes a novel method for identifying the exact data center in which cloud-based applications store data files. The proposed method includes four algorithm variants that analyze data retrieval times. Evaluations conducted in a real-world environment on various cloud-based applications demonstrate the algorithm's ability to geolocate files, achieving a success rate of 94%. These findings underscore the proposed method's effectiveness in addressing cloud data geolocation challenges within complex three-tier network environments.

A Flushing Attack on the DNS Cache.

Yehuda Afek and Anat Bremler-Barr, Tel-Aviv University; Shoham Danino, Reichman University; Yuval Shavitt, Tel-Aviv University

USENIX Security 2024

A severe vulnerability in the DNS resolver's cache is exposed here, introducing a new type of attack, termed DNS CacheFlush. This attack poses a significant threat as it can easily disrupt a resolver's ability to provide service to its clients.

DNS resolver software incorporates various mechanisms to safeguard its cache. However, we have identified a tricky path to bypass these safeguards, allowing a high-rate flood of malicious but seemingly existent domain name resolutions to thrash the benign DNS cache. The resulting attack has a high amplification factor, where with a low rate attack it produces a continuous high rate resource records insertions into the resolver cache. This prevents benign request resolutions from surviving in the DNS LRU cache long enough for subsequent requests to be resolved directly from the cache. Thus leading to repeated cache misses for most benign domains, resulting in a substantial delay in the DNS service. The attack rate amplification factor is high enough to even flush out popular benign domains that are requested at a high frequency (∼ 100/1sec). Moreover, the attack packets introduce additional processing overhead and all together the attack easily denies service from the resolver's legitimate clients.

In our experiments we observed 95.7% cache miss rate for a domain queried once per second under 8,000 qps attack on a resolver with 100MB cache. Even on a resolver with 2GB cache size we observed a drop of 88.3% in the resolver benign traffic throughput.

A result of this study is a recommendation to deny and drop any authoritative replies that contain many server names, e.g., a long referral response, or a long CNAME chain, before the resolver starts any processing of such a response.

Batch-Schedule-Execute: On Optimizing Concurrent Deterministic Scheduling for Blockchains

Yaron Hay, Technion; Roy Friedman, Technion

SRDS 2024

Executing smart contracts is a compute and storage-intensive task, which currently dominates modern blockchain's performance. Given that computers are becoming increasingly multicore, concurrency is an attractive approach to improve programs' execution runtime. A unique challenge of blockchains is that all replicas (miners or validators) must execute all smart contracts in the same logical order to maintain the semantics of State Machine Replication (SMR).

In this work, we study the maximal level of parallelism attainable when focusing on the conflict graph between transactions packaged in the same block. This exposes a performance vulnerability that block creators may exploit against existing blockchain concurrency solutions, which rely on a total ordering phase for maintaining consistency amongst all replicas. To facilitate the formal aspects of our study, we develop a novel generic framework for Active State Machine Replication (ASMR) that is strictly serializable. We introduce the concept of graph scheduling and the definition of the minimal latency scheduling problem, which we prove to be NP-hard. We show that the restricted version of this problem for homogeneous transactions is equivalent to the classic Graph Vertex Coloring Problem, yet show that the heterogeneous case is more complex. We discuss the practical implications of these results.

We add additional results developed since the original publication of the paper that strengthen its results and implications.

CertainSync: Rateless Set Reconciliation with Certainty

Tomer Keniagin, Technion; Eitan Yaakobi, Technion; Ori Rottenstreich, Technion

Set reconciliation is a fundamental task in distributed systems, particularly in blockchain networks, where it enables the synchronization of transaction pools among peers and facilitates block dissemination. Existing traditional set reconciliation schemes are either statistical, providing success probability as a function of the communication overhead and the size of the symmetric difference, or require parametrization and estimation of the size of the symmetric difference, which can be prone to error. In this paper, we present CertainSync, a novel reconciliation framework that, to the best of our knowledge, is the first to guarantee successful set reconciliation without any parametrization or estimators in use. The framework is rateless and adapts to the unknown symmetric difference size. The set reconciliation is guaranteed to be completed successfully whenever the communication overhead reaches a lower bound derived from the symmetric difference size and the universe size. Our framework is based on recent constructions of Invertible Bloom Lookup Tables (IBLTs) ensuring successful element listing as long as the number of elements is bounded. We provide a theoretical analysis to prove the certainty in the set reconciliation for multiple constructions. The approach is also validated by simulations, showing the ability to synchronize sets with efficient communication costs while maintaining reconciliation guarantees compared to other baseline schemes for set reconciliation. To further improve communication overhead for large universes as blockchain networks, CertainSync is extended with a universe reduction technique to minimize communication overhead. We compare and validate the extended framework UniverseReduceSync against the basic CertainSync framework through simulations using real blockchain transaction hash data from the Ethereum blockchain network. The results illustrate a trade-off between improved communication costs and maintaining reconciliation guarantees without relying on parametrization or estimators, offering a comprehensive solution for set reconciliation in diverse scenarios.

Based on a paper accepted to ACM SIGMETRICS 2025.

Towards offloading encrypted traffic classification onto a SmartNIC

Daniel Gilkarov, Ariel University; Ran Dubin, Ariel University

Traditional network traffic analysis usually involves transferring packets from the network interface card (NIC) to the CPU for classification using traditional Deep Packet Inspection (DPI), advanced machine learning, or deep learning flow classification. However, fully encrypted network traffic, such as Encrypted Service Name Indicator (E-SNI) and DNS over HTTPS (DoH), renders traditional DPI solutions obsolete. The advanced analysis solutions currently in use rely on Artificial Intelligence (AI), which consumes more data and adds complexity, ultimately reducing the overall efficiency of the analysis. The increased data consumption for the AI algorithms increases the required resources, power consumption, and rack space. This paper proposes a new ultrafast network traffic classification method using a SmartNIC. Our solution performs comprehensive traffic analysis directly on the SmartNIC, eliminating the need for CPU-level intervention. This approach reduces the need for additional compute instances by efficiently utilizing the SmartNIC as a traffic analysis solution. These advancements are a significant step forward in enhancing the efficiency of network traffic classification. The solution achieves a notable analysis throughput of 56 Gbps (Gigabit per second), and 100K flows per second with minimal packet loss using 7 BlueField2 ARM cores. We provide a detailed analysis of limitations and ways to improve the analysis throughput.

A Parallel Algorithm and Scalable Architecture for Routing in Benes Networks

Rami Zecharia, Tel Aviv University; Yuval Shavitt, Tel Aviv University

INFOCOM 2024

Beneš/CLOS networks are widely used in data-centers, NoCs, multiprocessors, and parallel systems. Beneš networks have recently gained interest in the field of optical circuit switching, driven by low-cost, high-speed 2×2 MZI switches in silicon photonics.

Sequential routing algorithms for Beneš networks have a time complexity of O(Nlog2N), while parallel algorithms, developed to satisfy the stringent timing requirements of high-performance switching networks, achieve O((log2N)2). However, both approaches incur high connectivity complexity of O(N2log2N) (number of wires connecting processing elements), limiting their scalability to high-radix optical switches. Adaptive algorithms, which establish paths upon demand rather than computing routing for an entire input permutation as in parallel or sequential algorithms, suffer from low utilization preventing their practical deployment.

We developed three routing algorithms for Beneš networks: parallel and sequential algorithms integrated with a unique and scalable hardware architecture, and an adaptive algorithm that guarantees path establishment when one exists.

• Parallel routing algorithm having a connectivity complexity of O(N2) is proposed, a O(log2N) reduction in wiring, which enables scalability to high-radix optical switches. The lower connectivity complexity is achieved by trading-off routing utilization and completion time. The proposed parallel algorithm offers ~95% routing utilization while completing within processing time comparable to the fastest parallel algorithm

• Sequential routing algorithm using the same reduced connectivity complexity of O(N2) by using the same hardware proposed for the parallel algorithm, achieves 100% routing with a time complexity of O(N3/5) with success probability of 99.9999%

• Adaptive algorithm that improves utilization by 47%-88%, increases throughput by 20%-30%, and reduces latency by 60%-70% compared to best prior work, while maintaining the same time complexity as prior work, thereby enabling practical deployment

X-ISA: A Novel Instruction Set Architecture for Switch Dataplanes

Roy Zemach, Xsight Labs; Eli Elmoalem, Xsight Labs; Gal Malach, Xsight Labs

This work introduces the X-Switch Instruction Set Architecture (X-ISA), a fully programmable architecture designed to enhance network programmability. X-ISA offers a robust foundation for software-driven packet processing, supporting both current and emerging networking protocols and applications.

Traditional pipeline-based switches either use fixed-function stages optimized for specific tasks or programmable stages confined to preassigned resources. Both approaches limit flexibility in supporting new protocols and scalability. In contrast, X-ISA provides a native programmable run-to-completion approach with complete decoupling between the program and the resources it uses. For example, it supports parallel operations and dynamic resource allocation without code changes. This approach improves execution latency and enhances scalability while reducing code development and maintenance time.

X-ISA is publicly available under the Mozilla Public License version 2 (MPLv2) and is supported by the X-Switch family of devices from Xsight Labs.

Revolutionizing Datacenter Networks via Reconfigurable Topologies

Chen Avin, BGU; Stefan Schmid, TU Berlin

arXiv

With the popularity of cloud computing and data-intensive applications such as machine learning, datacenter networks have become a critical infrastructure for our digital society. Given the explosive growth of datacenter traffic and the slowdown of Moore's law, significant efforts have been made to improve datacenter network performance over the last decade. A particularly innovative solution is reconfigurable datacenter networks (RDCNs): datacenter networks whose topologies dynamically change over time, in either a demand-oblivious or a demand-aware manner. Such dynamic topologies are enabled by recent optical switching technologies and stand in stark contrast to state-of-the-art datacenter network topologies, which are fixed and oblivious to the actual traffic demand. In particular, reconfigurable demand-aware and 'self-adjusting' datacenter networks are motivated empirically by the significant spatial and temporal structures observed in datacenter communication traffic. This paper presents an overview of reconfigurable datacenter networks. In particular, we discuss the motivation for such reconfigurable architectures, review the technological enablers, and present a taxonomy that classifies the design space into two dimensions: static vs. dynamic and demand-oblivious vs. demand-aware. We further present a formal model and discuss related research challenges. Our article comes with complementary video interviews in which three leading experts, Manya Ghobadi, Amin Vahdat, and George Papen, share with us their perspectives on reconfigurable datacenter networks.

(To be published in the Communication of the ACM June 2025)

Competitive Distributed Caching in Data-center Network Switches

Elioz Geller, BGU; Chen Avin, BGU; Gabriel Scalosub, BGU

Modern data-center networks (DCNs) are based on the paradigm of Software-Defined Networking (SDN), where an SDN controller manages the forwarding rules available at the switches.

In large DCN deployments, one must often handle huge forwarding policies containing hundreds of millions of forwarding rules.

Due to the limited memory available at the switches, it is impossible to store the entire policy in every switch, and packets that do not have a corresponding rule at the switch are forwarded to the SDN controller for further handling, which significantly affects performance in terms of both latency and throughput.

In this work, we view the forwarding tables at the switches as {\em caches}, where the controller dynamically determines which rules to store at each switch so as to optimize system performance, depending on the underlying traffic.

We present several online algorithms, and provide upper bounds on their competitive ratio.

We further perform an extensive evaluation study were we consider the performance of our proposed algorithms, and heuristics, based on the insights attained from our analytical study.

Our results show that our proposed algorithms exhibit as much as 10x improved performance compared to SoTA solutions, while supporting significantly higher insertion rates.

How to Reduce the Delay in All-Reduce with Scarce In-network Aggregation

Raz Segal, BGU; Chen Avin, BGU; Gabriel Scalosub, BGU

Modern machine learning (ML) and big data (BD) applications rely on distributed computations executed over HPC clusters and datacenters networks. The distributed computing process critically depends on intermediate \emph{Reduce} tasks where partial results generated by numerous workers need to be efficiently combined. Minimizing the overall task completion time in these scenarios is essential for the performance of such applications. In-network aggregation (INA) has emerged as a promising approach to accelerate these processes by offloading aggregation tasks from hosts to network devices. However, deploying and maintaining many INA devices can be costly and might not be universally available, necessitating careful resource allocation strategies.

In this paper, we investigate the problem of optimizing the Reduce primitives in distributed computing by placing a limited number of INA switches in the network. We focus on tree-based network topologies where we consider the Delay-Minimization with Bounded In-network Computing (D-BIC) problem and explore fundamental bounds on the achievable task completion time. We first show that different standard optimization metrics, i.e., utilization, congestion, and delay, can lead to different placement strategies. We then turn to study upper bounds on the delay of Reduce operations, a metric that was not studied before in this context. We introduce the Path-Minimization with Bounded In-network Computing (P-BIC) problem, which implies an upper bound on the delay. We then present an algorithm that optimally solves P-BIC, which yields an $h(T)$-approximation for the best possible delay, and further displays significant performance improvements compared to alternative approaches. Our theoretical analysis and simulation results highlight the effectiveness of our proposed solution in achieving near-optimal task completion times for distributed aggregation tasks, thus improving the efficiency of distributed ML and BD applications.

Fast Real-World Classification of ECH-Enabled Applications

Yizhak Kahana, Allot; Yaakov Stein, Allot; Oren Glickman, Bar Ilan University

Encrypted Client Hello (ECH) reduces information leakage in encrypted packet connections, thereby complicating traditional Deep Packet Inspection (DPI) methods. While various machine learning approaches have been proposed for classifying ECH network connections, prior research typically requires lengthy observation times and focuses on a limited set of applications in constrained environments. This study demonstrates a highly accurate classification mechanism capable of identifying a large number of applications in real-world traffic within one second. To train our models we created a novel dataset derived from operational network traffic, ensuring relevance and diversity in training and evaluation. Furthermore, we propose a novel hierarchical approach that leverages protocol-specific knowledge, enabling scalable and efficient classification in operational settings. Our findings highlight the effectiveness of leveraging short Per Packet Information (PPI) sequences and domain-specific features to overcome the challenges posed by ECH, achieving performance superior to that of existing methods.

Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

Benjamin Fuhrer, NVIDIA; Yuval Shpigelman, NVIDIA; Chen Tessler, NVIDIA Technion; Shie Mannor, NVIDIA Technion; Gal Chechik, NVIDIA Bar-Ilan University; Eitan Zahavi, NVIDIA; Gal Dalal, NVIDIA

CCGrid 2023

As communication protocols evolve, datacenter network utilization increases. As a result, congestion is more frequent, causing higher latency and packet loss. Combined with the increasing complexity of workloads, manual design of congestion control (CC) algorithms becomes extremely difficult. This calls for the development of AI approaches to replace the human effort. Unfortunately, it is currently not possible to deploy AI models on network devices due to their limited computational capabilities. Here, we offer a solution to this problem by building a computationally-light solution based on a recent reinforcement learning CC algorithm. We reduce the inference time of RL-CC by x500 by distilling its complex neural network into decision trees. This transformation enables real-time inference within the μ-sec decision-time requirement, with a negligible effect on quality. We deploy the transformed policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used in production, RL-CC is the only method that performs well on all benchmarks tested over a large range of number of flows. It balances multiple metrics simultaneously: bandwidth, latency, and packet drops. These results suggest that data-driven methods for CC are feasible, challenging the prior belief that handcrafted heuristics are necessary to achieve optimal performance.

Scalable, Near-Optimal Virtual Network Embedding

Oleg Kolosov, Technion; Gala Yadgar, Technion; David Breitgand, IBM; Dean Lorenz, IBM; Rasoul Behravesh, Fondazione Bruno Kessler; Danny Raz, Technion & Google

Virtual Network Embedding (VNE) is central to network virtualization. Given a set of requests for instances of virtual networks, VNE is required to map a maximum number of these requests to an underlying physical network, subject to capacity constraints. Each node in a virtual network instance represents a virtualized function and each link represents communication between the two functions. A solution to VNE maps each virtual function to a physical node and each virtual link to a path in the physical network. In the offline setting, all requests are known in advance, which allows the solution to schedule mappings for all requests. In the online setting, requests for instances of virtual networks arrive one by one and each request should be either embedded or rejected, individually. VNE is NP-hard in either setting. In this talk, we overview our recent work on online and offline VNE. We present two novel algorithms PRANOS (INFOCOM’24) and TANTO (INFOCOM’25) for the offline VNE setting, and OLIVE (ICDCS’25) for the online VNE setting. These solutions combine high scalability with near-optimality, as demonstrated through extensive evaluations. We show that our solutions can efficiently handle large realistic topologies, while supporting demands that are larger than previously reported results by several orders of magnitude, in terms of the number of users and the request rate.

Page updated

Google Sites

Report abuse