The Design and Implementation of a Log-Structured File System (Mendel Rosenblum and John K. Ousterhout, SOSP 1991)
- The log is the final home of data.
- The most difficult challenge in the design of a log-structured file system is to ensure there are always large extents of free space available for writing new data.
- The problem that a log-structured file system solves is that it can utilize as high as 65%~75% of a disk's raw bandwidth while Unix file system can only utilize 5~10% of a disk's raw bandwidth for writing new data. The rest of time is spent seeking. What is the disk bandwidth utilization for ext, btrfs and xfs?
- File system design principals: file system design is governed by two general forces: technology, which provides a set of basic building blocks and workload, which determines a set of operations that must be carried out efficiently.
- Larger file caches alter the workload presented to the disk by absorbing a greater fraction of the read requests. This could also be true because of the introduction of flash-based SSDs.
- A clever way to identify which checkpoint is effective and most recent: store the checkpoint time as the last block of the checkpoint region. So, if the checkpoint fails, the time will not be updated. During reboot, the system reads both checkpoint regions and uses the one with the most recent time.
* People have always been saying that processors are improved at a much faster speed than other components in computers. As a result, that makes systems become unbalanced. What are the criteria that define a "balanced" system? [investigate each component and decide which is the bottleneck?]
* File caches can be used to cache reads and writes. How to balance between them?
- A Trace-Driven Analysis of the Unix 4.2 BSD File System (SOSP 1985, John Ousterhout)
- A case for Redundant Arrays of Inexpensive Disks. (SIGMOD 1988)
- Measurements of a Distributed File System (John Ousterhout, SOSP 1991)
- A Fast File System for Unix (1984)
- Disk Scheduling revisited (1990, Chen Peter)
- Log Files: An extended File service exploiting write-once storage (1987)
The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms
The authors demonstrated that by adding prefetching into cache replacement algorithms, it will affect the hit ratio of different algorithms differently: some are improved significantly while others may not.
The take-home message from this paper is that we should take prefetching into consideration when designing cache replacment algorithms.
More take-home messages:
1. hit ratio is not a good indicator for final performance while the number of disk requests gives a more accurate predication. A higher hit ratio does not always lead to few disk requests.
2. Prefetching achieves its most benefit when accesses are sequential and of small sizes. If accesses are large and sequential, prefetching brings little improvements.
This is because when prefetching is not enabled, for sequential and small size I/Os, it does not provide much opportunities for small I/Os to be clustered into larger I/Os. When prefetching is enabled,
it allows more larger I/Os and more in-flight I/Os which provides more opportunities for being clustered into larger I/Os, thus improving the efficiency of disks.
3. database applications are dominated by random I/Os.
1. A study of integrated prefetching and caching strategies. (SIGMETRICS, 1995)
2. Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. (TOCS, 1996)
3. A low-overhead, high-performance unified buffer management schme that exploits sequential and looping references. (OSDI 2000)
4. The Multi-queues replacement algorithm for second-level buffer caches. (ATC 2001)
Multi-Dimensional Storage Virtualization
The goal of stonehenge is to provide virtual disks with guaranteed attributes in multi-dimensions, including availability (degree of replication), bandwidth (IOPS), capacity, delay and elasticity.
The main idea focuses on how to guarantee disk bandwidth using the virtual clock algorithm. delay guarantee is converted into bandwidth guarantee. The virtual Clock scheduling can give
QoS guarantee in terms of bandwidth while it may not be able to utilize disk bandwidth efficiently. To improve the disk utilization, the authors proposed to maintain two queues, one is QoS
queue, maintained by the virtual clock algorithm. The second is the utilization queue, ordered by the CSCAN order. When deciding which request to service next, it first check whether
servicing the head of request in the utilization queue will violate the finish time of the head request in the QoS queue. It will dispatch the head request in the utilization queue if no such
violation. Otherwise, it will dispatch the head request in the QoS queue. Another important admission control algorithm is also described.
Though the paper claims that it supports multi-dimensional virtualization, it relies on bandwidth virtualization essentially.
1. More than an interface: SCSI and ATA, (FAST '03)
2. Performance evaluation of two new disk scheduling algorithms for real-time systems. (UM-CS-1990-007)
3. Dynamic locality improvement techniques for increasing effective storage performance. (Winsdor Hsu, UCB/CSD-03-1223)
4. Facade: virtual Storage devices with performance guarantees. (FAST '03)
5. Resource overbooking and application profiling in shared hosting platforms. (OSDI '02)
Reducing impact of data fragmentation caused by in-line deduplication
The main idea is to selectively rewrite duplicate blocks during the write process so that the fragmentation for the latest version of the backup can be reduced. Duplicates for previous versions of the backup are removed by some background process. Essentially, the fragmentation of the latest version of the backup is reduced at the expense of previous versions of the backup. It uses a utility metric to decide whether to rewrite a duplicate block or not. The utility metric for a duplicated block is measured by the extent of overlap between the disk context and the stream context of that block. If the overlap is high, then not to rewrite this block. If the overlap is low, then re-write this duplicate block. Simulation is done to demonstrate the effectiveness of their approach, without no real-system evaluation.
RADOS: A Scalable, Reliable Storage Services for Petabyte-scale Storage Clusters
The main idea seems to be that by agreeing on the global cluster map, each client node and data node knows where to read/write and replicate data, thus achieving high scalability without a central metadata node. (SAN and NAS are not scalable because of ...)
In order to leave low-level block allocation and security enforcement to individual storage node, we need to expose a high-level interface, not block-level interface to upper layers. That is why we decide to expose an object-level interface. Then filesystem and block device layers are built based on object storage interfaces.
* Ursa minor: versatile cluster-based storage, (FAST '05)
* FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. (W. J. Bolosky, OSDI '02)
* Wide-area cooperative storage with CFS, (M. F. Kaashoek, R. Morris, and I. Stoica, SOSP '01)
* Self-* storage: Brick-based storage with automated administration. (G. R. Ganger, CMU-CS-03-178)
* A cost-effective, high-bandwidth storage architecture, (G. A. Gibson, E. Riedel, ASPLOS '98)
* Glacier: Highly durable, decentralized storage despite massive correlated failures, (P. Druschel, NSDI '05)
* The design and evaluation of network raid protocols, (IBM Research, 04)
* Reo: A generic raid engine and optimizer, (FAST '07)
* Boxwood: Abstractions as the foundation for storage infrastructure, (Lidong Zhou, OSDI '04)
* Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility, (P. Druschel, SOSP '01)
* Chord: A scalable peer-to-peer lookup service for internet applications, (I. Stoica, R. Morris, and M. F. Kaashoek, H. Balakrishnan. SIGCOMM '01)
* Managing scalability in object storage systems for HPC linux clusters, (G. Gibson, MASS '04)
* The HP AutoRAID hierarchical storage system, (J. Wilkes, SOSP '05)
* Kybos: self-management for distributed brick-based storage, (T. M. Wong, IBM research report, 2005)
* The design and implementation of AQuA: an adaptive quality of service aware object-based storage device, (S. A. Brandt, MASS '06)
* Reliability mechanisms for very large storage systems, (E. L. Miller, D. Long, S. A. Brandt, MASS '03)
DRAM Errors in the Wild: A Large-Scale Field Study (Google)
This paper analysed the DRAM correctable and uncorrectable errors in Google production clusters and over 2 years. It made the following observations
1. The incidence of memory errors is much higher than previously reported
2. There are strong correlation between correctable errors, between correctable errors and uncorrectable errors.
3. The incidences of correctable errors increases as the DIMM ages.
4. No strong correlation between temperature and memory error is observed.
5. Memory errors are strongly correlated with the utilization of the system
Efficient QoS for Multi-Tiered Storage Systems
this paper argues that if a workload has a better SSD hit ratio, it should get its IOPS improved by that amount. The IOPS improved should not be shared with other workloads: other workloads should not get the free lunch from that workload.
1. reward scheduling for qos in cloud application (IEEE on Cluster, Cloud, and Grid computing, 2012)
2. Application-sensitive qos scheduling in storage servers. (ACM on Parallelism in Algorithms and architecture, 2012)
3. PARDA: proportional allocation of resources for distributed storage access. (FAST 09)
4. pclock: an arrival curve based approach for qos in shared storage systems. (SIGMETRICS 2007)
5. mclock: handling throughput variability for hypervisor IO scheduling. (OSDI 10)
6. Iterposed proportional sharing for a storage service utility. (SIGMETRICS 2004)
7. Triage: performance differentiation for storage systems using adaptive control. (2005)
8. FIOS: a fair, efficient flash i/o scheduler. (FAST 2012)
9. Efficient guaranteed disk request scheduling with fahrrad. (Eurosys 2008)
10. High THroughput Disk Scheduling with Fair Bandwidth Distribution. (IEEE Trans. on computers)
26. Zygaria: Storage Performance as managed resource. (2006)
27. Storage performance virtualization via throughput and latancy control. (Trans. Storage 2006)
Decentralized DeDuplication in SAN Cluster File Systems
* A quite good reading
* A central index is kept and each host is responsible for duplicate elimination for files it has locks. No cross-server communication is needed
* deduplication is optional and out of band for performance purpose
A Low-bandwidth Network File System
Primary Data Deduplication - Large Scale Study and System Design
1. Decentralized deduplication in SAN cluster file systems. (USENIX ATC 2009)
2. Measurement and analysis of large-scale network file system workloads. (USENIX ATC 2008)