Abstract.
Current scientific applications, such as collaborative research experiments in biology and instrument observations in astronomy, may quickly generate Tera bytes (TB) of data daily. This Scientific Big Data (SBD) is transferred to processing and storage sites, which may be located at distances ranging between hundreds of meters to thousands of kilometers. SBD is transferred over traditional, shared networks, and despite the availability of high-speed, low latency links, adding more bandwidth is not a solution because shared networks do not meet the requirements for SBD transmission on schedule and transfer protocols also impose further restrictions that dramatically reduce throughput. The overall result for both, scientists and network architects, is frustrating: SBD is exchanged using flash drives and hard drives even when nodes are interconnected and separated by a few blocks.
In this proposal, the problem of transferring SBD through a traditional network infrastructure is tackled. This problem is not trivial since the efficient exchange of SBD in a shared network must account for the limited network resources and its network dynamics, the time-varying nature of the massive scientific data, and must also harmonize the coexistence of different types of traffic. Thus, the main goal of this proposal is to provide a network design methodology to transform a traditional general-purpose network, into an advanced cyberinfrastructure suitable for exchanging SBD traffic at a sustained high throughput. Specific aims of this proposal are: (i) Exploit Science DeMilitarized Zone (DMZ) design patterns to create network architectures, equipped with specialized devices, capable of efficiently transferring SBD; (ii) Design traffic management algorithms and flow control algorithms, based on mathematical models and online network status data, for dynamically provide high throughput exchange of SBD; and (iii) Simulate and prototype in a testbed network architectures and algorithms for traffic management and flow control of SBD. To achieve these goals, this proposal relies on two key innovations: The novel Science DMZ network design paradigm and the data-plane programmable switch technology. The Science DMZ paradigm provides guidelines to redesign a section of the network to support SBD traffic exclusively and adds specialized network devices, high-throughput Data Transfer Nodes (DTNs), and monitoring tools to achieve high performance. Besides, data-plane programmable switches are cutting-edge network devices that provide tools for engineering network protocols, managing network resources, and monitoring the network state.
The methodology proposed in this grant considers the following. The Science DMZ design patterns will be used to create shared data paths, which will exchange different data traffic types, and friction-free data paths, which will transfer the packet-loss—sensitive SBD. These friction-free data paths will be used to connect the specialized DTNs exchanging only SBD using high-performance communication software. Besides, DTNs will be optimally located in the network topology to minimize packet losses, thereby improving the end-to-end throughput. The open-source programmable switch technology will be used to efficiently allocating the network bandwidth and buffers. Programmable switches will also be used to design network protocols for either managing network flows or supplying information to DTNs so they manage end-to-end transmissions. Besides, network performance will be measured directly from programmable switches so that traffic managers can use online information about the network state. Further, data sources at the DTNs will also be modeled to supplement the network state with prior information, while the network will be mathematically modeled using queueing theory. These models will be used to formulate optimization problems for designing shared data paths with optimal Quality-of-Service (QoS) for each traffic class and friction-free data paths with minimal packet losses. The designed architectures will be simulated using Mininet and Riverbed Modeler. A testbed network will be prototyped to validate simulations in a controlled scenario.
The expected results of this project will be the following: (i) novel methodologies for designing cyberinfrastructures capable of exchanging SBD; (ii) novel network protocols with minimal packet losses; (iii) new mathematical models for massive scientific data sources; (iv) novel packet parsing and processing programs for programmable switches; (v) presenting the network design methodology to stakeholders, such as ALMA and ESO astronomical observatories and REUNA the data network connecting universities in Chile, to design a Science DMZ network in Chile; and (vi) disseminating the key findings to the general public using mass media.
Publications.
Funding agency: FONDECYT.
Program: FONDECYT Regular 2020.
Grant number: 1201495.
Funding period: April 2020 — March 2023.
PI: Jorge E. Pezoa.