Cloud computing system is a new paradigm of transparent distributed system. It handles the resources on a larger scale with cost effective and location independent manner. Since the use of cloud computing is increasing in broad spectrum of applications, fault free services are required. The cloud computing is more effective and reliable when it is more fault-tolerant and more adaptable to meet the demand. Fault tolerance is an effective step that permits a system to continue operation even in faulty environments. It ensures system reliability by improving the fault detection and recovery mechanism. To ensure fault tolerance in the cloud, there are reactive and proactive fault tolerance. The reactive fault tolerance requires error recovery after faults are detected [59]. On the other hand proactive fault tolerant technique prevents the faults by predicting it beforehand [60]. Fault tolerance considers effective steps to prevent failure. It ensures system reliability [7] by improving the error detection and correction mechanism [54].
The fault tolerant technique considers various parameters in cloud environment such as throughput, performance, availability, usability, response-time, availability, scalability, reliability, security and service level agreement (SLA) [61].
· Throughput– Throughput is an important metric for defining the performance of different fault tolerant techniques. It is a metric for measuring the time for sending and receiving data.
· Response Time- The total time of input time, processing time and transmission time is called the response time.
· Scalability– The services are up or down depending on requirements of clients. The horizontal or vertical scaling is increased horizontally or vertically. For this, the availability of resources is improved smoothly.
· Availability- Availability is defined by sign A(t). It is proportional to reliability in a definition system. The definition of mean time to failure (MTTF) is given in Equation 2.1 [62] and mean time between failures (MTBF) is called the availability.
MTTF = dt (2.1)
· Usability- The user level satisfaction from the available resources and proper utilization to achieve a goal with effectiveness and efficiency is called the usability.
· Reliability- The system receives correct or acceptable results within deadline. The reliability of system depends of running application smoothly. It is defined as sign R(t) in Equation 2.2 [62].
R (t) = dt (2.2)
· Overhead- The overhead can be associated to movement of cloudlets and inter system communication. It should be minimized for effective fault tolerance in cloud environment.
The reactive fault tolerance needs error recovery after faults are noticed. It reduces the effect of failure on a system whenever the failure has occurred in that system [11]. There are several effective reactive fault tolerance such as:
· Checkpointing: It is a screenshot of the full state of the process. It runs the failed system from the recently checked point rather than from initial state as shown in Figure 2.5 [63].
Figure 2.5: Checkpointing and rollback technique [16].
· Replication: Failed tasks are re-executed by replicas with different resources. This technique has a primary virtual machine and other one is replica (or secondary) virtual machine. When a cloudlet is failed to execute on primary virtual machine, the replica re-executes the cloudlet from the initial state. It needs more overhead than hundred percent [11] [31].
· Job migration: If the hosts, VMs or PEs are failed, then it should be migrated to new entities. On the event of resource failure, the jobs are migrated to a new virtual machine.
· SGuard: This technique is based on backward or rollback recovery mechanism [11].
· Retry: The failed work is re-executed using the same resources in real time and it is called the retry. If the cloudlet is failed or canceled, then it will be resubmitted [64].
· Task Resubmission: The failed tasks are approved either to the similar machine or other machine [64].
· Backward Recovery: It is a rollback technique that starts backward processing from a prior state. It needs extra time for rolling it back [64].
Proactive fault tolerance prevents the faults preemptively and changes the mistrust components. There are several techniques to recover from failure are given in [50] [65]:
· Software Rejuvenation: The failed tasks or systems are worked from initial step. It is called reboot system and every moment the system begins with a new state [66].
· Self-healing Proactive Fault Tolerance is defined as failure of an instance of cloud applications successively on multiple virtual machines [65].
· Preemptive Migration and Proactive Fault Tolerance: Applications have a feedback-loop mechanism which always monitors and resolves faults which is called preemptive migration. It proactively replaces the mistrust components [65] [67].
· Forward Fault Recovery: It is a scheme that can proceed forward even a fault is occurred. The fault is detected later by duplex system and recovered by re-execution or detected and recovered by triple modular redundancy (TMR) [68]. Others proactive mechanism uses triple modular redundancy, error correcting code, single error correction and double error detection (SEC-DED) etc.
The CoW-PC algorithm minimized the checkpointing overhead by placing the checkpoints in memory. The success or failed status of a virtual machine depends on adaptive reliability calculation [69]. Xia used a CRC based technique in cloud storage for verification of data integrity. A widespread summary of a fault tolerance in cloud computing is given in [1]. It emphasizes different significant concepts, architectural details and techniques. M. Amoon et al. [7] used selection of fault tolerant algorithm to detect and prevent faults for responding customer requests. They observe the overhead of replication and checkpointing technique for increasing number of customers. M. Azaiez et al. [2] proposes a hybrid fault tolerant model that consists of checkpointing and replication techniques. B. Mohammed et al. [70] proposed an integrated virtualized failover strategy that managed the faults reactively. The faults are detected and recovered using the checkpointing technique. However, the overhead of checkpointing can degrade the performance of a system. Jhawar et al. [71] implements a fault tolerant system that consists of a replication manager, a fault detection and recovery manager. They use the gossip and heartbeat algorithm to detect the faults. S. Rajesh et al. [72] propose a technique that improves the reliability. It has a forward and backward recovery mechanism and it can calculate the reliability of node and takes decision based on reliability.
Jain explains a method that uses fault detection and tolerant systems (FDTS). This technique uses heartbeat algorithm(s) and gossip algorithms to detect whether the application is working smoothly or not. J. Liu et al. [73] illustrated the proactive fault tolerance methodology against five interrelated methods in terms of the overall overheads such as network resource consumption, transmission, and total execution time. K. Nivitha et al. [74] developed a dynamic fault monitoring algorithm for virtual machine.
To ensure the fault tolerance in cloud, there are two types of available technique (i) reactive fault tolerance and proactive fault tolerance. It is a policy that detects the fault after it is occurred, such as checkpointing, replication, retry and task resubmission etc. Proactive fault tolerance prevents the faults by predicting beforehand, such as software rejuvenation, load balancing, preemptive migration and self-healing etc. Proactive fault tolerance is a forward recovery mechanism [1]. This technique prevents faults by predicting them. More time and power are saved for proactive fault tolerance [6] [10].
Table 2.1: Comparison of reactive and proactive fault tolerance
Comparison
Reactive Fault Tolerance
Proactive Fault Tolerance
Define
Faults detection and correction after faults are occurred.
It prevents faults beforehand.
Recovery Mechanism
It is backward recovery
It is a forward recovery
Time complexity
Time complexity is more
Time complexity is less than.
Error detection
More than two steps
Less than two steps
Hardware
It needs more hardware
It needs less hardware.
Undetectable Errors
Less
Probability is 1 – , r=17,18….64
Power cost
More than proactive fault tolerance
Less than
Overhead
More
Less than
Existing Techniques
Checkpointing, replication, retry and task resubmission etc.
Software rejuvenation, load balancing, preemptive migration and self-healing etc.
Bartholomew and Oscar utilizes CRC64 optimization which improves the verification reliability. However, the network bandwidth has some negative effects when data transmission increased in a huge amount. Kumar and Raj have proposed that the reliability depends on the probability of error detection capability within time. E. Abdelfattah et al. [59] proposed a technique which execute the failed tasks by the best reliable node. Reject message is sent back if it cannot be recovered. R. Buyya et al. [14] proposed a scheme which can work whenever the demand of cloud users are variable on scalable and virtualized entities. They explain the relationship among entities and events. They illustrate the performance between the federated and without federated network.
Enhancing availability of replicated data
Data replication is useful in a database system for two reasons. It can improve the performance and increase the reliability of information. By accessing the copy in the nearest site, expensive remote access can be avoided. By storing critical data at multiple locations, the data may still be available even if some machines are down. Availability and consistency are competing goals in the management of replicated data. It is desirable to have a high data availability while the database is still consistent in users' view. On the other hand, correct schemes that provide high availability may suffer performance penalties. Thus, when designing a replica management protocol, it is important to take all these three aspects into account. My research interests in this area include the design of efficient replica control protocol that can provide high data availability and some related theoretical aspects.