Reference Papers:
Cloud computing offers services on demand across the Internet with the help of diverse models and layers of abstraction. The dynamic condition of cloud roots different unforeseen deterrents and failures. The capacity of system to react normally to a sudden hardware and programming fault is called fault tolerance. It is a main issue to assure accessibility and consistency of critical services plus execution of scientific application. Failure identification is important in scientific application in order to make the system robust, executable without any delay and reliable. There are various failure detection methods to increase fault tolerance capability of scientific application. The target of this paper is to center around different systems that are utilized for fault tolerance in scientific applications. This review paper aims at summarizing the current state of the art of existing fault tolerant techniques in scientific applications.
Cloud computing is gaining an increasing popularity and the number of user’s applications and interactions with Cloud resources have grown considerably making Cloud services more susceptible to failure. Therefore, fault tolerance is an important property in order to achieve reliability, availability and required quality of service. Several studies were interested in Cloud fault tolerance issues. These studies proposed solutions including checkpoint and replication that are focused on failure of a single Cloud resource, namely virtual machines, and virtualization issues. Such solutions are insufficient given Cloud complexity. Unlike previous work, we propose a solution that takes into consideration not only virtual machines but also physical machines. Our solution is a hybrid fault tolerance strategy that inherits good features and overcome the limitations of traditional fault tolerance strategies. We evaluate the efficiency of our fault tolerance strategy compared to replication and checkpoint strategies by using CloudSim and we prove that our strategy leads to better results.
Fault tolerance is a major challenge that should be considered to ensure good performance of cloud computing systems. In this paper, the problem of tolerating faults in cloud computing systems is addressed so that failures can be avoided in the presence of faults and the monetary profit of the cloud is maintained. A framework is proposed in order to achieve reliable platform of cloud applications. An algorithm for selecting the most suitable fault tolerance technique is presented. Another algorithm for selecting the most reliable virtual machines for performing customers’ requests is presented.
Large scale data management systems utilize State Machine Replication to provide fault tolerance and to enhance performance. Fault-tolerant protocols are extensively used in the distributed database infrastructure of large enterprises such as Google, Amazon, and Facebook, as well as permissioned blockchain systems like IBM's Hyperledger Fabric. However, and in spite of years of intensive research, existing fault-tolerant protocols do not adequately address all the characteristics of distributed system applications. In particular, hybrid cloud environments consisting of private and public clouds are widely used by enterprises. However, fault-tolerant protocols have not been adapted for such environments. In this paper, we introduce SeeMoRe, a hybrid State Machine Replication protocol to handle both crash and malicious failures in a public/private cloud environment. SeeMoRe considers a private cloud consisting of nonmalicious nodes (either correct or crash) and a public cloud with both Byzantine faulty and correct nodes. SeeMoRe has three di
erent modes which can be used depending on the private cloud load and the communication latency between the public and the private cloud. We also introduce a dynamic mode switching technique to transition from one mode to another. Furthermore, we evaluate SeeMoRe using a series of benchmarks. The experiments reveal that SeeMoRe's performance is close to the state of the art crash fault-tolerant protocols while tolerating malicious failures.
Cloud computing is a novel technology in the field of distributed computing. Usage of Cloud computing is increasing rapidly day by day. In order to serve the customers and businesses satisfactorily, fault occurring in datacenters and servers must be detected and predicted efficiently in order to launch mechanisms to tolerate the failures occurred. Failure in one of the hosted datacenters may propagate to other datacenters and make the situation worse. In order to prevent such situations, one can predict a failure proliferating throughout the cloud computing system and launch mechanisms to deal with it proactively. One of the ways to predict failures is to train a machine to predict failure on the basis of messages or logs passed between various components of the cloud. In the training session, the machine can identify certain message patterns relating to failure of data centers. Later on, the machine can be used to check whether a certain group of message logs follow such patterns or not. Moreover, each cloud server can be defined by a state which indicates whether the cloud is running properly or is facing some failure. Parameters such as CPU usage, memory usage etc. can be maintained for each of the servers. Using this parameters, we can add a layer of detection where in we develop a decision tree based on these parameters which can classify whether the passed in parameters to the decision tree indicate failure state or proper state.
Distributed Systems have swiftly evolved from network of personal computers to cluster and then to grid, moving on to the era of cloud computing and now the latest one as Inter- net of things (IoT). With these rapid enhancements, the scale and complexity of systems providing cloud computing services have also increased tremendously. The major challenge faced by cloud service providers today is to provide an efficient, cost-effective, and reliable solution for seamless delivery of services to users. To achieve this research community is constantly working hard on different related issues like scheduling, power consumption, high availability, customer retention, resource provisioning, reliability and minimizing the probability of failures, etc. Reliability of service is an important parameter. With a large number of components in the cloud, the probability of failures is becoming a norm rather than an exception while delivering services to users. This emphasizes the need to develop fault tolerant schemes for cloud environment to deliver the required level of reliability. In this work, we have proposed a novel fault detection and mitigation approach. The novelty of approach lies in the method of detecting the fault based on running status of the job. The detection algorithm periodically monitors the progress of job on virtual machines (VMs) and reports the stalled job due to failed VM to fault tolerant manager (FTM). This not only reduces the resources wastage but ensures timely delivery of services to avoid any penalty due to service level agreement (SLA) violation. The validation of the proposed approach is done using CloudSim simulator.The performance analysis reveals the effectiveness of the proposed approach.
Fault tolerance is the ability of a system to respond swiftly to an unexpected failure. Failures in a cloud-computing environment are normal rather than exceptional, but fault detection and system recovery in a real time cloud system is a crucial issue. To deal with this problem and to minimize the risk of failure, an optimal fault tolerance mechanism was introduced where fault tolerance was achieved using the combination of the Cloud Master, Compute nodes, Cloud load balancer, Selection mechanism and Cloud Fault handler. In this paper, we proposed an optimized fault tolerance approach where a model is designed to tolerate faults based on the reliability of each compute node (virtual machine) and can be replaced if the performance is not optimal. Preliminary test of our algorithm indicates that the rate of increase in pass rate exceeds the decrease in failure rate and it also considers forward and backward recovery using diverse Software tools. Our results obtained are demonstrated through experimental validation thereby laying a foundation for a fully fault tolerant IaaS Cloud environment, which suggests a good performance of our model compared to current existing approaches.
Cyclic Redundancy Check (CRC) leverages to detect the error of digital data throughout generation, transmission, storage or processing. CRCs are widely used for being simple to execute in binary appliances, crafty to mathematical simulation as well as exclusively performance oriented at identifying generic deviation occured due to intrusion in communication channels. Commonly, hardware implementation of Cyclic Redundancy Check (CRC) computations rely on the Linear Feedback Shift Registers (LFSRs). LFSR framework processes bits serially that is one message bit per clock cycle but while considering high-speed data communications, serial implementation speed is significantly inadequate which causes delay. In this research, a hardware architecture is proposed for parallel computation. Its architecture is not polynomial dependent. After testing its functionality using ModelSim, it is implemented in Altera DE1 FPGA (Field Programmable Gate Array) board and analyzed using Quartus II, TimeQuest Timing Analyzer and Power Play Power Analyzer tools. It is found that the designed took 2771 LEs (Logical Elements), it has 102 pins and consumed 120.68 mW power. Functionality test and FPGA implementation showed that CRC was computed in single clock pulse of frequency of 23.71 MHz and its throughput is 1.656 Gbps. It can be configured for a different polynomial at any time externally. The focus of the research is to represent an efficient, better throughput along with compact systematic interpretation for parallel CRC hardware which will alleviate the flaws including the challenges of the existing CRC checker which will be prominent for next generation high speed communication.
Now-a-days, cloud computing is being used in a variety of fields, whether it is storage, computation, education etc. It has emerged from a larger number of technologies like utility computing, grid computing, cluster computing etc. It offers a number of advantages like on-demand access, resource pool, device independence etc. Also, it suffers from various cons like security, workflow management, fault tolerance. Here, a novel model (HAFTRC) has been proposed, which is providing high adaptive fault tolerance in real time cloud computing. The model is based on computing the reliabilities of the virtual machines on the basis of the cloudlets, mips, ram and bandwidth etc. Whosoever virtual machine has the highest reliability is chosen as the winning virtual machine. If at the end, there are two virtual machines, whose reliabilities comes out to be same ,then the winning machine is chosen base on the priority that is assigned to them.
Data integrity verification is becoming a major challenge in cloud storage which can’t be ignored. This paper proposes an optimized variant of CRC (Checker Redundancy Cyclic) verification algorithm based on HDFS to improve the efficiency of data integrity verification in cloud storage through the research of CRC checksum algorithm and data integrity verification mechanism of HDFS. A new method is formulated to establish the deformational optimization and to accelerate the algorithm by researching characteristics of generating and checking the algorithm. Moreover, this method optimizes the code to improve the computational efficiency according to data integrity verification mechanism of HDFS. A data integrity verification system based on Hadoop is designed to verify proposed method. Experimental results demonstrate that proposed HDFS based CRC algorithm was able to improve the calculation efficiency and the utilization of system resource on the whole and outperformed well compared to existing models in terms of accuracy and time.
Additional check bits, which are commonly attached to the message’s input data, are normally used to minimize the error during data transmission. The receiver system implements a checking algorithm to determine if an error was occurred in the received data. This algorithm will correct a corrupted bit and recover the original message. An enhanced error detection correction code was presented to better detect and correct the corrupted conveyed bits. It improves the existing limitations of utilizing cyclic redundancy checking (CRC), Hamming code, and other checksum techniques. Also, it reduced the length of the redundancy bits which exists in CRC, the overhead of interspersing of the redundancy bits in a typical Hamming code, and the system resources such as processor time and bandwidth in checksum techniques. This paper was synthesized and simulated using the Xilinx Spartan 6 (XC7Z020-2CLG4841) FPGA. Results show that the resource utilization of the designed memory architecture using EEDC is lower compared to CRC, Hamming, and Checksum algorithms.
Cloud computing software systems are customizable using various configurations based on users needs. Performance of such software systems vary depending on selected configurations and interaction among these configurations. Most of the users of a software system, that consists of several hundreds of configurations, use only very few configurations that are common. Failure of such important or common configurations of any software system brings down its performance severely. However we can improve the performance, reliability and fault tolerance of cloud computing software systems by identifying the important or commonly used configurations and enabling them with suitable fault tolerant schemes. In this research paper we propose a technique to identify very important and frequently used configurations that play vital role in software systems then provide them fault tolerance to improve performance and fault tolerance of configurable cloud software systems.
As cloud storage systems increase in scale, hard drive failures are becoming more frequent, which raises reliability issues. In addition to traditional reactive fault tolerance, proactive fault tolerance is used to improve a system’s reliability. However, there are few studies which analyze the reliability of proactive cloud storage systems, and they typically assume an exponential distribution for drive failures. This paper presents closed-form equations for estimating the number of data-loss events in proactive cloud storage systems using RAID-5, RAID-6, 2-way replication, and 3-way replication mechanisms, within a given time period. The equations model the impact of proactive fault tolerance, operational failures, failure restorations, latent block defects, and drive scrubbing on the systems reliability, and use time-based Weibull distributions to represent processes (instead of homogeneous Poisson processes). We also design a Monte-Carlo simulation method to simulate the running of proactive cloud storage systems. The proposed equations closely match time-consuming Monte-Carlo simulations, using parameters obtained from the analysis of field data. These equations allow designers to efficiently estimate system reliability under varying parameters, facilitating cloud storage system design.
Cloud computing technology has become an integral trend in the market of Information Technology. Cloud computing virtualization and its Internet-based lead to various types of failures to occur and thus the need for reliability and availability has become a crucial issue. To ensure cloud reliability and availability, a fault tolerance strategy should be developed and implemented. Most of the early fault tolerant strategies focused on using only one method to tolerate faults. This paper presents an adaptive framework to cope with the problem of fault tolerance in cloud computing environments. The framework employs both replication and checkpointing methods in order to obtain a reliable platform for carrying out customer requests. Also, the algorithm determines the most appropriate fault tolerance method for each selected virtual machine. Simulation experiments are carried out to evaluate the framework’s performance. The results of the experiments show that the proposed framework improves the performance of the cloud in terms of throughput, overheads, monetary cost and availability.
The need for a robust data center that is fault tolerant can never be overemphasized, especially nowadays that the advent of big data traffic, internet of things and other on-demand internet applications are on the increase. The rate at which these data are transferred across the internet is worrisome, and a thing of concern to the data center developers. The emergence of ubiquitous computing has also aided to the increase in traffic across the internet, because computing occurs more now by use of any device, in any location, and in any format. These issues have compounded the management of Cloud Data Center used for storage, transfer, and analysis of data across the cloud; as a result, the data center network devices become prone to failures, which automatically impacts on its performance. Nevertheless, several researchers have come up with solutions, though not sufficient to mitigate these issues. Therefore, on our part, we realised that architectural design of data center network is the bedrock of having a fault tolerant, reliable, robust, and congestion free network. So, this paper, which is an extension of our previous works, based on an improved version of Fat Tree (called Z-node); we proposed a Hybrid fat tree design and compared it with Single fat tree design, for client to server communication pattern such as HTTP and EMAIL applications. The simulation results obtained with different device failures and traffic rate patterns, show that the Hybrid fat tree design performed better than the Single fat tree design, hence will be the best bet for the transfer and analysis of big data in cloud data center network.
Cloud is the epitome of business computing as it is expeditiously taking over the long-established and conventional systems as a more efficient, reliable, elastic, cost-effective and scalable panacea. However, this introduces a continuous sense of uncertainty into the computational process. Cloud presents numerous challenges, with one of them being the Quality of Service management (QoS). Cloud is dynamic; therefore uncertainty can arise from various factors such as resource availability, unmatched predictions, unreliable data or unexpected faults in the system. The existing models fail to provide an efficient mechanism that detect, measure the uncertainty that has occurred and then rectify it. To determine the potency of the model proposed, parameters like Availability and Reliability are considered to render reliable and seamless cloud service.
A justifiably trustworthy provisioning of cloud services can only be ensured if reliability, availability, and other dependability attributes are assessed accordingly. We present a structured approach for deriving fault injection campaigns from a failure space model of the system. Fault injection experiments are selected based on criteria of coverage, efficiency and maximality of the faultload. The resulting campaign is enacted automatically and shows the performance impact of the tested worst case non-failure scenarios. We demonstrate the feasibility of our approach with a fault tolerant deployment of an OpenStack cloud infrastructure.
The increasing popularity of Cloud computing as an attractive alternative to classic information processing systems has increased the importance of its correct and continuous operation even in the presence of faulty components. In this paper, we introduce an innovative, system-level, modular perspective on creating and managing fault tolerance in Clouds. We propose a comprehensive high-level approach to shading the implementation details of the fault tolerance techniques to application developers and users by means of a dedicated service layer. In particular, the service layer allows the user to specify and apply the desired level of fault tolerance, and does not require knowledge about the fault tolerance techniques that are available in the envisioned Cloud and their implementations.