(Abstract) In N-body programs, trajectories of simulated particles have chaotic patterns if errors are in the initial conditions or occur during some computation steps. It was believed that the global properties (e.g., total energy) of simulated particles are unlikely to be affected by a small number of such errors. In this paper, we present a quantitative analysis of the impact of transient faults in GPU devices on a global property of simulated particles. We experimentally show that a single-bit error in non-control data can change the final total energy of a large-scale N-body program with ~2.1% probability. We also find that the corrupted total energy values have certain biases (e.g., the values are not a normal distribution), which can be used to reduce the expected number of re-executions. In this paper, we also present a data error detection technique for N-body programs by utilizing two types of properties that hold in simulated physical models. The presented technique and an existing redundancy-based technique together cover many data errors (e.g., >97.5%) with a small performance overhead (e.g., 2.3%).
K. S. Yim, Z. Kalbarczyk, and R. K. Iyer, "Pluggable Watchdog: Transparent Failure Detection for MPI Programs," In Proceedings of the IEEE International Parallel and Distributed Processing Symposium(IPDPS), pp. 489-500, May 2013. (Acceptance Ratio: 21.8% = 108/494) [Paper] [Slide] [IEEExplore] (Adaptive techniques and more data described in the slide appendix are from my dissertation.)
(Abstract) This paper presents a framework and its techniques that can detect various types of runtime errors and failures in MPI programs. The presented framework offloads its detection techniques to an external device (e.g., extension card). By developing intelligence on the normal behavioral and semantic execution patterns of monitored parallel threads, the presented external error detectors can accurately and quickly detect errors and failures. This architecture allows us to use powerful detectors without directly using the computing power of the monitored system. The separation of hardware of the monitored and monitoring systems offers an extra advantage in terms of system reliability. We have prototyped our system on a parallel computer system by using an FPGA-based PCI extension card as a monitoring device. We have conducted a fault injection experiment to evaluate the presented techniques using eight MPI-based parallel programs. The techniques cover ~98.5% of faults, on average. The average performance overhead is 1.8% for techniques that detect crash and hang failures and 6.6% for techniques that detect SDC failures.
(Abstract) This paper presents a fault-tolerant, programmable voter architecture for software-implemented N-tuple modular redundant (NMR) computer systems. Software NMR is a cost-efficient solution for high-performance, mission-critical computer systems because this can be built on top of commercial off-the-shelf (COTS) devices. Due to the large volume and randomness of voting data, software NMR system requires a programmable voter. Our experiment shows that voting software that executes on a processor has the time-of-check-to-time-of-use (TOCTTOU) vulnerabilities and is unable to tolerate long duration faults. In order to address these two problems, we present a special-purpose voter processor and its embedded software architecture. The processor has a set of new instructions and hardware modules that are used by the software in order to accelerate the voting software execution and address the identified two reliability problems. We have implemented the presented system on an FPGA platform. Our evaluation result shows that using the presented system reduces the execution time of error detection codes (commonly used in voting software) by 14% and their code size by 56%. Our fault injection experiments validate that the presented system removes the TOCTTOU vulnerabilities and recovers under both transient and long duration faults. This is achieved by using 0.7% extra hardware in a baseline processor.
(Abstract) High performance and relatively low cost of GPU-based platforms provide an attractive alternative for general purpose high performance computing (HPC). However, the emerging HPC applications have usually stricter output correctness requirements than typical GPU applications (i.e., 3D graphics). This paper first analyzes the error resiliency of GPGPU platforms using a fault injection tool we have developed for commodity GPU devices. On average, 16-33% of injected faults cause silent data corruption (SDC) errors in the HPC programs executing on GPU. This SDC ratio is significantly higher than that measured in CPU programs (<2.3%). In order to tolerate SDC errors, customized error detectors are strategically placed in the source code of target GPU programs so as to minimize performance impact and error propagation and maximize recoverability. The presented HAUBERK technique is deployed in seven HPC benchmark programs and evaluated using a fault injection. The results show a high average error detection coverage (~87%) with a small performance overhead (~15%).
Flash Memory and Nonvolatile RAM Management and Optimization
(Abstract) Hybrid storage architecture is one efficient method that can optimize the I/O performance, cost, and power consumption of storage systems. Thanks to the advances in semiconductor and optical storage technology, its application area is being expanded. It tries to store data to the most proper medium by considering I/O locality of the data. Data management between heterogeneous storage media is important, but it was manually done by system users.This paper presents an automatic management technique for a hybrid storage architecture. A novel software layer is defined in the kernel between virtual and physical file systems. The proposed layer is a variant of stackable file systems, but is able to move files between heterogeneous physical file systems. For example, by utilizing the semantic information (e.g., file type and owner process), the proposed system optimizes the I/O performance without any manual control. Also as the proposed system concatenates the storage space of physical file systems, its space overhead is negligible. Specific characteristics of the proposed systems are analyzed through performance evaluation.
L.-z. Han, Y. Ryu, and K. S. Yim, "CATA: A Garbage Collection Scheme for Flash Memory File Systems," Lecture Notes in Computer Science (LNCS), 4159:103-112, September 2006. (SCIE, Impact Factor: 1.21) [Publisher]
(Abstract) Semiconductor scientists and engineers ideally desire the faster but the cheaper non-volatile memory devices. In practice, no single device satisfies this desire because a faster device is expensive and a cheaper is slow. Therefore, in this paper, we use heterogeneous non-volatile memories and construct an efficient hierarchy for them. First, a small RAM device (e.g., MRAM, FRAM, and PRAM) is used as a write buffer of flash memory devices. Since the buffer is faster and does not have an erase operation, write can be done quickly in the buffer, making the write latency short. Also, if a write is requested to a data stored in the buffer, the write is directly processed in the buffer, reducing one write operation to flash storages. Second, we use many types of flash memories (e.g., SLC and MLC flash memories) in order to reduce the overall storage cost. Specifically, write requests are classified into two types, hot and cold, where hot data is vulnerable to be modified in the near future. Only hot data is stored in the faster SLC flash, while the cold is kept in slower MLC flash or NOR flash. The evaluation results show that the proposed hierarchy is effective at improving the access time of flash memory storages in a cost-effective manner thanks to the locality in memory accesses.
(Abstract) Flash memory based embedded computing systems are becoming increasingly prevalent. These systems typically have to provide an instant start-up time. However, we observe that mounting a file system for flash memory takes 1 to 25 seconds mainly depending on the flash capacity. Since the flash chip capacity is doubled in every year, this mounting time will soon become the most dominant reason of the delay of system start-up time. Therefore, in this paper, we present instant mounting techniques for flash file systems by storing the in-memory file system metadata to flash memory when unmounting the file system and reloading the stored metadata quickly when mounting the file system. These metadata snapshot techniques are specifically developed for NOR- and NAND-type flash memories, while at the same time, overcoming their physical constraints. The proposed techniques check the validity of the stored snapshot and use the proposed fast crash recovery techniques when the snapshot is invalid. Based on the experimental results, the proposed techniques can reduce the flash mounting time by about two orders of magnitude over the existing de facto standard flash file system.
(Abstract) Flash memory based SmartMedia Card is becoming increasingly popular as data storage for mobile consumer electronics. Since flash memory is an order of magnitude more expensive than magnetic disks, data compression can be effectively used in managing flash memory based storage systems. However, compressed data management in flash memory is challenging because it only supports page-based I/Os. For example, when the size of compressed data is smaller than the page size, internal fragmentation occurs and this degrades the effectiveness of compression seriously. In this paper, we developed a flash compression layer (FCL) for the SmartMedia Card systems. FCL stores several small compressed pages into one physical page by using a write buffer. Based on prototype implementation and simulation studies, we show that the proposed system offers the storage of flash memory more than 140% of its original size and expands the write bandwidth significantly.
(Abstract) Wireless sensor networks are based on a large number of tiny sensor nodes, which collect various types of physical data. These sensors are typically energy-limited and low-power operation is an important design constraint. In this paper, we propose a novel routing and reporting scheme based on sample data similarities commonly observed in sensed data. Based on reliable transport protocols, the proposed scheme takes advantage of the spatial and temporal similarities of the sensed data, reducing both the number of sensor nodes that are asked to report data and the frequency of those reports. Experimental results show that the proposed scheme can significantly reduce the communication energy consumption of a wireless sensor network while incurring only a small degradation in sensing accuracy.
(Abstract) In a wireless sensor network, sensor devices are connected by unreliable radio channels. Thus, the reliable packet delivery is an important design challenge. The existing sensor-to-base reliable transport mechanism, however, depends on a centralized manager node, incurring large control overheads of synchronizing reporting frequencies. In this paper, we present a decentralized reliable transport (DRT) with two novel decentralized reliability control schemes. First, we propose an independent reporting scheme where each sensor node stochastically makes reporting decisions. Second, we describe a cooperative reporting scheme where every sensor node implicitly cooperates with its neighbors for the uniform reporting. In the reporting step, DRT uses a reliable MAC channel, which is specifically optimized for reducing the energy dissipation. Experimental results show that DRT satisfies the desired delivery rate reliably in a decentralized manner while it significantly reduces the energy consumption of the radio device and the communication time.
Full version of this paper written in Korean is in Proc. Humantech Thesis Prize (HTP), Feb. 2002.
(Abstract) This paper discusses a host-independent network system where a network interface card is utilized in an efficient way. By eliminating protocol stack processing overheads from host system, the proposed system improves the communication speed by 11-36% under heavy network and CPU loads.