Advanced Coding Techniques for Fast Failure Recovery in Distributed Storage Systems

Contact me at sonhoang.dau@rmit.edu.au if you are interested in this project.

Project description

A recent report by Cisco [1] estimated that globally, the amount of data stored in data centres will quintuple by 2020 to reach 915 exabytes from 171 exabytes in 2015. Much more advanced science and technology must be developed to cope with that staggering growth in the demand for data storage.

Distributed storage systems form an essential component in every contemporary data centre, guaranteeing data availability, scalability, and cost-effectiveness. Just like a RAID-based system where a number of physical disks are combined into a single storage unit [2], a distributed storage system is formed by networking a large number of storage servers (a.k.a. storage nodes), which are located at different racks within a single data center or even across geographically dispersed data centers around the globe. In order to avoid data loss and to increase data availability, each stored file is split into several data blocks, which are then replicated, or more often, transformed into coded blocks using some erasure codes, e.g. Reed-Solomon codes [3]. These blocks are then distributed to different storage nodes across the system. A user can recover the file by retrieving coded blocks stored at any suitable set of storage nodes.

In modern distributed storage systems, which consists of thousands of inexpensive and unreliable storage devices, failure has become the norm rather than the exception, as stated in a study from Google [4]. A failure can be temporary, caused by software bugs and upgrades, loss of network connectivity, power outages, or any other non-disk hardware failures [5], in which case the data is not affected. It is also common for failure to actually be a disk-failure, and in this case, the data stored in the node containing that failed disk is permanently lost. For instance, in a 3000-node data analytics cluster at Facebook where petabytes of information are stored, it is quite common to have 20-40 node-failures per day that trigger repair jobs [6].

Figure 1. Examples of storage systems using Reed-Solomon codes.

Replication and erasure codes guard the stored data against software and hardware failures by adding an appropriate level of redundancy. Erasure codes are a favourable choice due to their lower storage overhead, their much higher read/write throughputs [7] thanks to parallelism, and their higher mean-time-to-failures [8], compared to replication. Indeed, erasure codes are currently employed by a number of prominent companies such as Google, Facebook, Baidu, Yahoo, Backblaze, Amazon, and Microsoft, to protect their storage systems (see figure above).

In the presence of failures, the two most critical system operations are degraded reads to temporarily unavailable data and recovery from node-failures, according to a recent study by researchers from Microsoft, Johns Hopkins University, and University of Tennessee [9]. In both operations, it is required to reproduce the data stored in an unavailable/failed node via the data stored in other available nodes, which is often an extremely resource-consuming process. For example, it is required to access and download 2,56 GB in order to reconstruct only one data block of size 256 MB, encoded by a Reed-Solomon code, in the Facebook’s f4 storage system [10]. This creates a considerable source of overhead in disk input/output (I/O) and network bandwidth, and significantly slows down the recovery process, given that node unavailability is becoming more frequent as the system expands. In fact, it was observed [11] that in a data-warehouse cluster in production at Facebook, a median of nearly 200 Terabytes of data is transferred across different racks every day just to repair Reed-Solomon-coded data, which amounts to 10%-20% of the total network traffic in the cluster.

The overarching aim of this project is to investigate advanced coding techniques that help speed up and secure the failure recovery process of distributed storage systems. To illustrate the impact of the recovery rate on data availability, as emphasized in a recent study from Google [5], a 10% reduction in recovery time may result in a considerable 19% reduction in data unavailability. Apart from data availability, this project also aims to incorporate data confidentiality and integrity into the storage and the recovery process, as a complement to the traditional cryptographic methods. As a matter of fact, data availability, confidentiality, and integrity form the most essential information security aspects for a cloud storage service (see [12], Ch. 4). More specifically, the project objectives are:

to design new repair schemes for erasure codes that
- reduce the repair bandwidth, i.e. the amount of data to be downloaded from the available storage nodes during the failure recovery process;
- reduce the repair I/O (the disk input/output), i.e. the amount of data to be accessed at the available storage nodes during the recovery process; note that this can be greater than the repair bandwidth;
to evaluate the benefit of the new repair schemes via statistic models and implementations on open-source storage platforms (e.g. the Hadoop Distributed File System);
to enhance data confidentiality and integrity levels of the underlying erasure codes.

The project addresses fundamental theoretical questions about the structure of erasure codes, in particular, Reed-Solomon codes, with respect to their repair capability and limitation, as well as tackles the practical problem of improving the recovery performance of distributed storage systems. We focus on reducing the repair bandwidth and the disk I/O, the two most constrained resources during the recovery process. In fact, as reported by the researchers from Purdue University and AT&T [13], network transfer and disk read constitute more than 98% of the total reconstruction time (network transfer alone accounts for at least 80%) in the Quantcast File System. It was also observed in another study [9] that disk read always takes at least nine times longer than computation. The proposed project builds upon my recent research, [14], [15], [16], [17], [18], [19], in which we obtained a noticeable 30% reduction in the repair bandwidth for the Reed-Solomon code currently employed by Facebook’s f4 storage system.

References

[1] “Cisco Global Cloud Index: Forecast and Methodology, 2015–2020,” 2016, available online at https://goo.gl/Fb16wJ.

[2] D. Patterson, G. A. Gibson, and R. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” SIGMOD Conferences, 1988.

[3] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” Journal of the Society for Industrial and Applied Mathematics, volume 8, number 2, pages 300–304, 1960.

[4] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google file system,” in Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), pages 29–43, 2003.

[5] D. Ford, F. Labelle, F..I. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan, “Availability in globally distributed file systems,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2010.

[6] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur, “XORing Elephants: Novel Erasure Codes for Big Data,” in Proceedings of the VLDB Endowment, volume 6, number 5, 2013.

[7] R. Li, Z. Zhang, K. Zheng, and A. Wang, “Progress Report: Bringing Erasure Coding to Apache Hadoop”, available online at

https://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/, 2016.

[8] H. Weatherspoon and J. Kubiatowicz. Erasure coding vs. replication: A quantitative comparison. Workshop on Peer-to-Peer Systems, 2002.

[9] O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang, “Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads,” in Proceedings of the 10th Usenix Conference on File and Storage Technologies (FAST), 2012.

[10] S. Muralidhar et al., “f4: Facebook’s warm BLOB storage system,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 383–398, 2014.

[11] K. V. Rashmi, N. B. S. D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran, “A ”Hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers,” in Proceedings of the ACM Conference SIGCOMM, pages 331–342, 2014.

[12] T. Mather, S. Kumaraswamy, and S. Latif, Cloud Security and Privacy: An Enterprise Perspective on Risks and Compliance, O'Reilly Media, Inc., 2009.

[13] S. Mitra, R. Panta, M-R. Ra, S. Bagchi, “Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage”, in Proceedings of the European Conference on Computer Systems (EuroSys), Article No. 30, 2016.

[14] H. Dau, I. Duursma, H. M. Kiah, and O. Milenkovic, “Repairing Reed-Solomon codes with multiple erasures,” IEEE Transactions on Information Theory, volume 64, number 10, pp. 6567-6582, 2018.

[15] H. Dau, I. Duursma, H. M. Kiah, and O. Milenkovic, “Repairing Reed-Solomon codes with two erasures,” in Proceedings of the IEEE International Symposium on Information Theory, pp. 351-355, 2017.

[16] I. Duursma and H. Dau, “Low bandwidth repair of the RS(10,4) Reed-Solomon code,” invited by the Information Theory and Applications Workshop (ITA), San Diego, California, 2017, available at http://ita.ucsd.edu/workshop/17/files/paper/paper_3783.pdf.

[17] H. Dau and O. Milenkovic, “Optimal repair schemes for some families of full-length Reed-Solomon codes,” in Proceedings of the IEEE International Symposium on Information Theory, pp. 346-350, 2017.

[18] S. H. Dau, I. Duursma, and H. Chu, “On the I/O costs of some repair schemes for full-length Reed-Solomon codes,” IEEE International Symposium on Information Theory (ISIT), pages 1700-1704, 2018.

[19] H. Dau and E. Viterbo, “Repair schemes with optimal I/O costs for full-length Reed-Solomon codes with two parities,” IEEE Information Theory Workshop (ITW), 2018.