Research

Rapid Prototyping and Evaluation of Intelligence Functions of Active Storage Devices [pdf][slides]

Active storage devices further improve their performance by executing "intelligence functions," such as prefetching and data deduplication, in addition to handling the usual I/O requests they receive. Significant research has been done to develop effective intelligence functions for the active storage devices. However, laborious and time-consuming efforts are usually required to set up a suitable experimental platform to evaluate each new intelligence function. Moreover, it is difficult to make such prototypes available to other researchers and users to gain valuable experience and feedback.

To overcome these difficulties, we propose IOLab, a virtual machine (VM) based platform for evaluating intelligence functions of active storage devices. The VM based structure of IOLab enables the evaluation of new (and existing) intelligence functions for different types of OSes and active storage devices with little additional effort. IOLab also supports real-time execution of intelligence functions, providing users opportunities to experience latest intelligence functions without waiting for their deployment in commercial products. Using a set of interesting case studies, we demonstrate the utility of IOLab with negligible performance overhead except for the VM's virtualization overhead.

(click image to enlarge)

Fig. 1. The overview of a system with IOLab.

Fig. 1 depicts a structural overview of a system with IOLab, which consists of an application layer, a host OS layer, and

a block device layer.

Application layer

There are two applications running on a host OS: a virtual machine monitor (VMM) and IOLab.

  • VMM: IOLab uses I/O requests generated by real applications as its input I/O trace. Instead of executing a target application directly on the host OS, IOLab runs it on the guest OS managed by the VMM. This design allows IOLab to effectively separate the implementation of an intelligence function from a specific OS. IOLab is designed to support a VMM that: (1) uses full virtualization mode, (2) uses a file-backed virtual disk image (VDI), and (3) accesses the VDI using file I/O system calls.

  • IOLab: intercepts file-level I/O requests from the VM to the VDI file that would otherwise be sent directly to the host OS. The intercepted file I/Os actually contain the information of block-level accesses to the VDI. An intelligence function is running inside IOLab, analyzing all the I/O requests it receives to extract useful information. Based on the thus-obtained information, the intelligence function can modify the original I/O requests or create new I/O requests, all of which are gathered and reordered according to their priorities before they are sent to the host OS.

Host OS layer

The host OS receives I/O requests from IOLab and passes them to the associated component block device while keeping the received I/O stream as unmodified as possible. Linux Fedora 14 x64 (2.6.37 kernel) with the EXT4 file system is used as the host OS.

Block device layer

To express various types of active storage devices, IOLab uses a set of commodity block devices, which we call "component block

devices," that are connected to a host machine via the device drivers of the host OS. For example, using two component block devices---a HDD and a SSD---IOLab can express four types of active storage devices: a HDD, a SSD, a HDD with a flash cache, and a SSD+HDD hybrid drive.

Case Study 1: Application Prefetcher (please refer to the paper for more case studies including OS booting and hybrid HDDs.)

The application prefetcher optimizes application launch performance by prefetching all the data blocks necessary for starting the

application in an optimized fashion just before its launch process begins. The application prefetcher has been included in the Windows XP and its subsequent versions, which we will call Windows prefetcher.

The execution of the application prefetcher involves the following phases.

  • Learning phase: monitors and logs all the I/O requests generated during the launch of each target application to determine the set of data blocks necessary for its launch.

  • Post-processing phase: creates an application launch sequence using the information obtained from the learning phase. It reorders the data blocks using a predefined sort key (e.g., inode number and file offset), and stores the resulting sequence in the reserved system folder (e.g., C:\WINDOWS\PREFETCH} for the Windows prefetcher).

  • Prefetching phase: When it detects a new launch of the target application that has an application launch sequence file, this phase (1) immediately pauses the launch process; (2) fetches the data blocks in the order specified in the sequence file; and (3) resumes the launch process.

(click image to enlarge)

Fig. 2. Disk block accesses of the Windows prefetcher and the IOLab prefetcher (OS: Windows XP, application: Word 2007, used device: WD6400AAKS.)

To see how the Windows prefetcher works in practice, we captured the output I/O requests from the block cache of IOLab while

launching MS Office Word 2007 on the HDD. To ensure a cold start scenario, we flushed the page cache of both the guest OS and the host OS before launching the application. Fig. 2(a) shows the disk block accesses with the Windows prefetcher disabled. In this case, the launch time was measured to be 8.3s. Once we enabled the Windows prefetcher, the launch time was reduced to 5.7s (Fig. 2(b)), clearly showing the benefit of the Windows prefetcher.

Despite the performance improvement, Fig. 2(b) shows that there still exists a considerable number of random block accesses even with the Windows prefetcher, which is far from what we expected. As the Windows OS is not an open source, it is difficult to analyze its behavior in detail. Instead, we decided to implement our own application prefetcher on \IOLab.

We configured IOLab so that it performs the application prefetch as described above, which we call the IOLab prefetcher. Given below is a detailed account of how we configure the IOLab prefetcher.

  • Launch sequence creation: For this we can use existing method to automatically and accurately mine the block correlation information. For simplicity, in this experiment, we manually captured the I/O request sequence generated during the application launch.

  • Application launch detection: The VM does not inform IOLab when the target application is launched. Hence, we configured IOLab to initiate an I/O dispatcher immediately when 3 consecutive block requests from the VM match with the captured sequence.

  • Application prefetcher generation: The above-created I/O dispatcher is configured to fetch the blocks of the captured sequence in their sorted order of LBAs.

  • Launch control: We set the priority of the I/O dispatcher higher than the VM to pause the launch process of the target application running on the Windows OS. Upon completion of the prefetching, IOLab resumes the target application by responding to the I/O requests from the VM.

We conducted experiments on the WD6400AAKS HDD, and plotted the resulting disk accesses in Fig. 2(c). The dashed box in Fig. 2(c) indicates that

the IOLab prefetcher made the LBAs of I/O requests monotonically increased as intended. As a result, the application launch time is reduced to 4.4s (a 23% improvement over the Windows prefetcher). However, some I/O requests still occur after the completion of the prefetcher because they were not included in the captured I/O request sequence. The prefetch hit ratio was observed to be 95.4% (by dividing the number of I/O requests IOLab prefetched by the total number of I/O requests from the VM).

A Hybrid PRAM and STT-RAM Cache Architecture for Extending the Lifetime of PRAM Caches [pdf]

To extend the lifetime of phase change RAM (PRAM) caches, we propose a hybrid cache architecture that integrates a relatively small capacity of spin transfer torque RAM (STT-RAM) write buffer with a PRAM cache. Our hybrid cache improves the endurance limitation of the PRAM cache by judiciously redirecting the write traffic from an upper memory layer to the STT-RAM write buffer. We have demonstrated through simulation that the proposed hybrid cache outperforms existing write-traffic reduction schemes with the same area overhead. Moreover, our approach is orthogonal to the existing schemes, providing an effective way of investing die area for cache lifetime extension by being used in combination with them.

(click image to enlarge)

Fig. 1. The structure of the proposed hybrid cache.

Fig. 1 depicts the structure of the hybrid cache, where a cache set (the dash-lined box) consists of n PRAM cache blocks and a STT-RAM write buffer. The PRAM cache partition follows the same organization as a conventional n-way set associative cache. The only exception is that each cache block has an owner bit (o-bit), which denotes whether data is stored in the PRAM cache block or in the STT-RAM write buffer.

As n PRAM cache blocks share a single STT-RAM write buffer, a conflict occurs when a write operation is to be performed to a PRAM cache block not owning the STT-RAM write buffer. We call this conflict a "write buffer miss." To reduce write buffer misses, we should dynamically change the ownership of the STT-RAM write buffer according to the write access pattern.

We propose a threshold-based ownership management policy to detect a cache block causing successive write buffer misses.

We deploy a miss counter to each STT-RAM write buffer to keep track of the number of write buffer misses of the last write-accessed cache block. The miss counter consists of two variables: CB_last and N_miss. The former denotes the last write-accessed cache block that hits on the cache, and the latter the number of successive write buffer misses. CB_last acquires the ownership of the STT-RAM write buffer when its value of N_miss reaches a predefined threshold N_th.

(click image to enlarge)

Fig. 2. The flowchart of the proposed hybrid cache.

Fig. 2 shows the flowchart of the hybrid cache.

(click image to enlarge)

Fig. 3. Write traffic breakdown (y-axis is normalized to DW).

    • DW: differential write scheme

    • DI: data inverting scheme

    • HC: hybrid cache (proposed)

    • n_lower: write traffic due to cache block loads from the main memory

    • n_upper: write traffic due to cache block flushes from the L1 data cache

    • n_swap: write traffic due to swap operations from STT-RAM write buffers

    • n_wb: write traffic redirected to the STT-RAM write buffer partition

Fig. 3 shows the write traffic breakdown of the simulation results, where "Total" represents the accumulated bit changes of all 11 applications. For "Total", the bit change reductions of the PRAM cache partition were 22% for DI and 42% for HC, respectively. The total amount of bit changes of HC including n_wb was 99\% of DW, showing that HC preserves the amount of the original write traffic. The cache areas reported by the simulator were 4.99 mm2 for DW, 6.02 mm2 for DI, and 6.07 mm2 for HC.

(click image to enlarge)

Fig. 4. Normalized bit changes and lifetimes of various cache configurations.

    • DW(c: capacity): We changed the capacity of the baseline cache with DW (c=4MB, 8MB, 16MB).

    • DI(n: inverting unit size): We tested a set of inverting unit sizes for DI (n=64, 32, 16, 8, 4, 2).

    • HC+DI(n): DI(n) is applied together with HC (n=0 means HC without DI).

One important characteristic of the write traffic reduction schemes discussed here is that they are orthogonal to each other, allowing us to configure HC+DI(n). Fig. 4 shows that HC+DI(n) outperforms other cache configurations with the same die size.

For example, HC+DI(4) achieves 39% longer lifetime than that of DI(2) with only 4% larger die area.

Exploiting SSD parallelism to accelerate application launch on SSDs [pdf]

Using an optimised application prefetcher to accelerate the application launch on solid-state drives (SSDs) by exploiting the SSD parallelism is proposed. The proposed prefetcher was implemented on the Linux OS and achieved a 37% reduction of prefetcher execution time, which corresponds to an 18% reduction of application launch time.

(click image to enlarge)

Fig. 1: Proposed two-phase prefetcher (mi: metadata, di: normal data, ci: CPU computation).

In Fig. 1, the following depencenies are assumed: m1 should be fetched prior to d1, d2, and d3; m4 prior to d4 and d5.

(click image to enlarge)

Fig. 2: Visulazation of SSD queue depth (application: Eclipse)

Fig.2 shows that the two-phase prefetcher increased the effective queue depth to 30.6, quite close to the theoritical maximum value of 31 (SATA-2 specification).

(click image to enlarge)

Fig. 3: Measured preefether exectution time and application launch time.

Fig. 3 shows that the proposed two-phase prefetcher reduced the prefetcher execution by 37% compared with the baseline prefetcher. As a result, the two-phase prefetcher reduced application launch time by 18% compared with the cold start launch time.

FAST: Quick Application Launch on Solid-State Drives [pdf][slides]

Application launch performance is of great importance to system platform developers and vendors as it greatly affects the degree of users' satisfaction. The single most effective way to improve application launch performance is to replace a hard disk drive (HDD) with a solid state drive (SSD), which has recently become affordable and popular. A natural question is then whether or not to replace the traditional HDD-aware application launchers with a new SSD-aware optimizer. We address this question by analyzing the inefficiency of the HDD-aware application launchers on SSDs and then proposing a new SSD-aware application prefetching scheme, called the Fast Application STarter (FAST). The key idea of FAST is to overlap the computation (CPU) time with the SSD access (I/O) time during an application launch. FAST is composed of a set of user-level components and system debugging tools provided by the Linux OS (operating system). In addition, FAST uses a system-call wrapper to automatically detect application launches. Hence, FAST can be easily deployed in any recent Linux versions without kernel recompilation. We implemented FAST on a desktop PC with a SSD running Linux 2.6.32 OS and evaluated it by launching a set of widely-used applications, demonstrating an average of 28% reduction of application launch time as compared to PC without a prefetcher.

(click image to enlarge)

Fig. 1: CPU and SSD usage for various launch scenarios.

(click image to enlarge)

Fig. 2: Measured application launch time for each launch scenario.

Endurance-Aware Design of Phase Change Memory Caches [pdf]

Phase change memory (PCM) is one of the most promising technologies among emerging non-volatile random access memories. Implementing a cache memory using PCM has potential to provide many benefits such as high density, non-volatility, low leakage power, and high immunity to soft error. However, its disadvantages such as high write latency, high write energy, and limited write endurance prevent it from being used as a drop-in replacement of an SRAM cache. In this work, we studied a set of techniques to design an energy- and endurance-aware PCM cache. We also modeled the timing, energy, endurance, and area of PCM caches and integrated them into a PCM cache simulator to evaluate the techniques. Experiments show that our PCM cache design can achieve 8% of energy saving and 3.8 years of lifetime compared with a baseline PCM cache having less than an hour of lifetime.

(click image to enlarge)

Fig. 1: Proposed PCM cache design.

(click image to enlarge)

Fig. 2: Proposed write traffic reduction schemes.

(click image to enlarge)

Fig. 3: The effect of data inverting scheme on the lifetime and the energy consumption (BL: baseline cache, RBW: read-before-write scheme, DIn: data inverting scheme with the sub-block size of n).

(click image to enlarge)

Fig. 4: Heat maps of the PCM cache that visualize the effect of wear-leveling schemes.

Improving Application Launch Times with Hybrid Disks [pdf]

We introduce a novel block allocation method for a hard disk drive and flash memory, aiming at reducing application launch time. While conventional block allocation methods pin the entire application launch sequence on the flash memory, our method pins subsequences on the flash memory that minimizes the seek time the most effectively. This enables us to use a small capacity of flash memory but achieve significant application launch time reduction. We model the latency of a hybrid disk, analyze the behavior of application launch sequences, and formulate the choice of the optimal subsequences to be pinned as an integer linear programming problem. Experiments show that if we pin 26% of the application launch sequences into flash memory, our approach reduces hard disk access times and application launch times by 48% and 31%.

(click image to enlarge)

Fig. 1: Disk head movement comparison (application: Evolution). For (b), (c), and (d), 10% of application launch sequence are allowed to be pinned to the flash cache. (FCFP: First-Come First-Pinned, SCF: Small-Chucks-First)