Literature Reviews

The effective fault tolerance system is a smoothly running mechanism in cloud without hazard to clients that assists in cloud services between cloud provider and cloud users. Faults are raw materials problems as shown in Figure 2.1. It is the real reason of failure. Error is an indication or symbol of a fault. It refers to difference between actual output and expected output. Failure means external behavior is incorrect and a single error could be caused of various failures. It is made for system damaged. In a simple words, the results of fault are made errors that are reasons of failures [16]. By the practice of keyword as fault tolerance in cloud and other related keyword, I have hunted different research work through popular research work.

Figure 2.1: Faults, errors and failures relationship [18].

The faults are caused for errors. A fault is occurred multiple errors. The unexpected results created the system failure. A single fault may be caused for system failure [31].

1.1 Background Study of Cloud Computing

Cloud Computing is a distributed network for enabling convenient, ubiquitous, on-demand permit to a shared pool of configurable computing resources such as servers, services, networks, storage and applications [32]. Various features are comprised in cloud computing such as, on-demand access, cost reduction, minimum management effort, scalability and device or location independence. It has four deployment models (i) public cloud model (ii) private cloud model (iii) hybrid cloud model (iv) community cloud model. The fault tolerance techniques are applied to detect and recover faults in these deployment models. The cloud give three services (i) Software-as-a-Service (SaaS) (ii) Platform-as-a-Service (PaaS) (iii) Infrastructure-as-a-Service (IaaS). Different faults are occurred in cloud platform and obstructs to give the best services to cloud users [10].

1.1.1 Characteristics of Cloud Computing

There are several characteristics of cloud computing such as device and location independence, on-demand services, scalability, guaranteed quality of service, virtualization, security, multi-tenancy and fault tolerance [33].

· Scalability and on-demand services: Cloud users are assumed on demand access of resources and services over cloud. The horizontal and vertical scaling are up or down demanding on number of cloud users [33].

· User-centric interface: The interfaces of cloud are not dependent on location of cloud users. They can be opened by well-established interfaces such as internet browsers and web services [33].

· Guaranteed Quality of Service: Cloud computing promises quality of service for users by assured performance, processing capacity, bandwidth and memory size [33].

· Autonomous system: Cloud users can reconfigure and associate software according to their requirements [33].

· Cost: There is no essential investment of capital expenditure or any upfront savings in cloud. The cloud computing follows pay-as-go-model. The services are finished on the basis of needed.

· Virtualization: Virtual utilization of resources are improved by distribution the servers and storage devices [33].

· Multi-tenancy: A large number of users shares the resource and allows for centralization mechanism [34].

· Loose coupling: The resources are inaccurately joined as one resource functionality barely affects the operational of different resources.

· Reliable Delivery: The reliable delivery of information is needed between resources in the TCP/IP architecture. The cloud organization is used the private network protocols [35].

· High Security: These are continued on the above characteristics. To execute the loose coupling the jobs are needed high security, whenever the parts of cloud are damaged.

1.1.2 Cloud Computing Services

Different type of services can be consisted into three type categories that are presented by the cloud providers [36].

· Software as a Service (SaaS): In this model, a complete application is delivered on demand to the cloud users. Clients need not invest upfront before using the applications. They use the software subscription based or followed the model Pay-as-you-go, such as Google, Salesforce and Microsoft [9] etc.

· Platform as a Service (PaaS): In this model, the integrated development environment tools are provided to develop own business policy. Cloud developer develops the application and run it in cloud environment to supply the customers. A predefined configuration of operation system and application server are delivered to cloud users. For an example, Force.com and Google's App Engine are providing as platform [37].

· Infrastructure as a Service (IaaS): The virtualization of resources are provided to run the application called the Infrastructure as a Service (IaaS). The resources are virtual server, host, machine, storage, and computing capacity etc. The cloud users deploy their own applications in cloud infrastructure such as Amazon and Go Grid [38] etc.

1.1.3 Deployment Models in Cloud Computing

The public cloud model, private cloud model, hybrid cloud model and community cloud model are deployed in cloud computing. The deployment models of cloud are distinct structure.

Public Cloud: Cloud users are joined to internet and taking access of cloud space certainly. It offers of availability of cloud resources and follows the "Pay As You Go" model. The advantages of public cloud are cost effectives, reliability, high scalability, flexibility, and utility style costing and location independence. The disadvantages of public cloud are low security, less customizable and resources are shared publicly etc. The clients share the same virtual infrastructure with limited formation, security safeties, and availability adjustments [4].

Private Cloud: The specific and limited access in an organization or a particular group is called the private cloud. The main advantages are higher security and privacy, more control, improved reliability and cost and energy efficiency. But the scalability is more costly and area of operation is restricted. It is used highly price than public cloud [4].

Hybrid Cloud: The hybrid cloud is consisted of public and private cloud. The scalability, security, cost efficiency, flexibility and security services are provided to customers. This cloud is used partially public and private [4].

Community Cloud: The organization with common requirements shares system and services from a specific community called community cloud. It reduces the cost among the associations. It shares files, resources and infrastructure among organizations from a specific community [4].

1.2 Cloud Computing based Concepts

The cloud computing platforms are convinced the cloud computing more flexible, more reliable and many types of services [5]. Cloud computing based technologies are itemized below.

Ø Distributed system

Ø Grid computing

Ø Virtualization technology

Ø Client-Server model

Ø Utility computing

1.2.1 Distributed System

A collection of self-determining components are positioned on different machines to achieve common goals. It shares messages with each other in order. It is a distributed network to resolve large tasks by distributing these in the distributing systems. In shortly, the cloud computing is a distributed transparent system that computes the processing, storage and management of data. Some popular examples of distributed systems are facebook, World Wide Web (WWW) and ATM [39].

1.2.2 Grid Computing

Grid Computing is a distributed calculating in which a collection of computers are connected from different locations with each other to accomplish a common objective. These computer resources are different networks and different geographically regions. Grid computing breakdowns the complex tasks into smaller pieces, which are circulated for processing to complete a common objective [10].

1.2.3 Virtualization Technology

Virtualization is a well-known technology which refers a single physical instance of applications or resources among several or group of organizations or tenants (customers). When a huge customer request to access the resources of cloud, it provides a pointer of specific virtual resources. Every virtual machine is a unique from a physical machine. Virtual machines deliver an environment that is logically divided from the underlying hardware. The virtual machines are created from a host. This virtual machine is accomplished by a software or firmware, which is identified as hypervisor [16].

1.2.4 Client-Server Model

The client-server application is a well-known application model that distributes tasks or loads among the service suppliers of a resource called the servers. The clients give the request and servers send the response to clients. When the clients send a request for service through the internet, the servers accept the request for processing and delivers the results to the client. The server shares the resources to clients but clients do not do it any, such as E-mail and World Wide Web [40] etc.

1.2.5 Utility Computing

Utility computing is constructed on the traditional on Pay-as-you-used-model. It deals a computing assets on-demand as a metered provision. The utility computing concept is based on cloud computing and grid computing [14].

1.2.6 Basic Components of Cloud Computing

There are several elements consisted in the cloud computing, such as clients, cloudlets, datacenter broker, cloud information services (CIS) and datacenters (virtual hosts, virtual machines and processor elements (PEs)) as shown in Figure 2.2 [41] [42].

Figure 2.2: Basic components of cloud architecture [41] [42].

· Clients: These are typically the computers, mobile phones and thin browser which are used by the end users. These devices can be used by the end user to accomplish the information on cloud [6].

· Cloud Information Service (CIS): Firstly, cloud information service in which datacenter is registered and the information of the cloud components are stored in a table of CIS.

· Datacenter Broker: The datacenter broker acts as a coordinator between software-as-as-services (SaaS) and cloud providers. The main responsibility of broker collects the available resources and provides quality of service to clients of cloud system. Thirdly, CIS sends the acknowledgement to the broker about available resources of cloud [14].

· Datacenter: Fourthly, broker connects to datacenter. A datacenter is a collection of virtualized hosts, virtual machines, processing elements, virtual networks and virtual storage. A datacenter consists of architecture X86, operating system, and virtual machine monitor (VMM), host list, memory, bandwidth and storage [43] etc.

· Host: A host consists of multiple virtual machines. The capacity of host is defined million instructions per second. There are multiple virtual machines in a host. The processing elements are processed the cloudlets and sends the result to broker [44].

· Virtual Machine: The VMs are allocated in a host with the best-fit mechanism. The parameters of the host are processing capacity usually measured in million instructions per second (MIPS), memory size is in megabyte (MB), storage size is in terabyte (TB) and communication bandwidth is in megabyte per second (Mbps) [2].

· Cloudlet: The cloudlet is an application which consists of million instructions (it is also known as a task such as social networking, content delivery and business application etc.). These cloudlets are executed by processing element (PE). The parameters of a cloudlet are cloudlet Id, user Id, length (in million instructions), number of PEs, input and output size (in MB). Sixthly, the cloudlets are submitted to virtual machine and are executed by PEs [45].

1.3 Taxonomy of Faults in Cloud Computing

Different types of faults may be occurred in the cloud computing, such as transient and intermittent faults may occur in processing elements (PEs) of virtual machine (VMs) [8]. These types of faults may also occur in hosts and datacenters [7]. Different types of faults usually are occurred in processing elements and memory modules. Unavailability of host, datacenter and VM may be occurred due to disk full or disk error [46]. The permanent faults are physical damage of host, PEs and memory [47].

Figure 2.3: Single bit, multiple bits and burst bits errors [48].

Data must be transmitted with acceptable accuracy and may be tainted during transmission. For this reason an error detection mechanism is necessary to receive the single-bit, multiple-bit, and burst-bit error free data [15]. Single-bit error is occurred whenever a single bit of data are altered either from 1 to 0 or from 0 to 1 as shown in Figure 2.3 (a). Multiple-bit errors are occurred one bit of different position are changed as shown in Figure 2.3 (b). Whenever multiple bits are erroneous, they are called burst bits errors. The error consecutive multiple bits are erroneous called the burst-bit errors. The consecutive bits are erroneous (e.g. B5, B4, B3 and B2) which is called burst-bit error as shown in Figure 2.3 (c) [20].

Figure 2.4: Different faults occurred in cloud computing [49] [50].

Faults are occurred different environments of cloud computing. It has mainly divided three environments (i) hardware faults (ii) communications faults and (iii) software faults as shown in Figure 2.4. Faults are hampered to acquire the best services for cloud users.

1.3.1 Hardware Faults

The hardware faults are transient fault, intermittent fault, and disk full, processing elements (PEs) fault, memory fault and storage fault [51] [52]. Transient faults may effect of network connectivity, availability of services and response time errors. The virtual machine is either down or over utilized, hardware might be damaged, and the acknowledgement might be timeout etc.

Hardware faults are occurred in cloud environment such as:

1. Transient faults may cause of errors such as:

Ø Single bit

Ø Multiple bits

Ø Burst bits

2. Intermittent faults are occurred in system suddenly for short period of time. It may cause of malfunctioning problems. So it should be solved by replacing suspicions components.

3. Permanent faults are physical damage of:

Ø Machine

Ø Processor elements (PEs)

Ø Memory

4. Physical machine fault – It has multiple virtual machines, processor elements and memory etc. Whenever the permanent faults are occurred, faults are hampered to give proper services.

5. Disk full – It is one kind of omission fault. When the disk is full, we cannot write any data.

6. Machine fault- The permanent faults are occurred in memory, processor elements of a machine. It may be service down to work smoothly.

7. Processor elements fault- The processing capacity of processor elements is executed a unit million instruction per second (MIPS). It may be busier, when the cloudlets are executed.

8. Memory faults- The single bit, multiple bit and burst bits are occurred to re-write for transient faults.

Siddiqui has proposed [46] a single bit error detection scheme based on hardware fault tolerance for huge data in cloud computing. This scheme usages concurrent the error detection (CED) mechanism that is able to detect hardware faults. D. Mittal et al. [53] proposed heartbeat protocol is used for hardware and software faults algorithm to detect. The. M. K. Gokhroo et al. [54] proposed fault detection and mitigation using two fault detection time algorithms (fault detection time-algorithm1 and time-algorithm 2). These two algorithms identify the faults and correct them.

1.3.2 Software Faults

Software faults are software state transition faults, early and late timing faults, timing overhead, protocol incompatibilities, data fault, logical fault, numerical exception, operating system faults, link timeout fault, user defined exception and unhandled exception faults [16] [51] [55].

Different types of software faults are occurred in cloud computing environment such as:

1. Data faults – The raw data are occurred different types of faults. These errors are single bit, multiple bits and burst bits errors.

2. Unexpected input - Whenever data are taken input from different sources, the different types of unexpected errors are occurred in cloud computing.

3. Logical faults - Logical faults signify the explicit or implicit effect of physical errors on the performance of a system. The system cannot detect specific way.

4. Operating system fault - The operating system may be corrupted or suffered by different types of problems such as malware, viruses, spyware, and virus affected software etc. It changes the state of control flow and data flow direction.

5. Numerical exception - Numeric exceptions is occurred whenever some mathematically operations are on an infinite value. Sometime the sign value is changed and created the unexpected events.

6. User defined exception – In software operation, different types of faults are recovered such as try-catch-throw method, blocking mechanism, N-version programming and self-checking method etc.

7. Application faults – It is defined the unexpected behavior of an application. We cannot find out the expected the physical outlook.

Chinnaiah proposed [55] an algorithm that achieved reliability for depth critical configuration of a software system. It caused unpredictable behavior and performance anomalies in software systems. A. Ledmi et al. [56] discussed the optimizing fault tolerance in the distributed system. M. A. Rouf et al. [12] discussed the state-of-theart techniques to combat with soft errors using different techniques broadly categorized in three types: (i) software based schemes, (ii) hardware based schemes, and (iii) hardware and software based co-design schemes. S. Jaswal et al. [57] proposed model for fault tolerance in cloud environment. Trusted model in cloud is the mostly on-demand accessed that supports in building protected communication in cloud computing.

1.3.3 Communication Faults

The communication faults are sending and receiving omission faults, early or late timing faults, packet corruption and packet loss faults. The transient faults may cause single bit, multiple bits and burst-bits errors [58].

Some communication faults are given below [20]:

1. Omission faults - One directional fault is arisen in denial of service attacked and disk full called the omission fault.

2. Packet corruption – The single bit, burst bits and multiple bits are corrupted into the packets during data transmitted through the channel. The control flow and data flow are changed. Mainly packet corruption is occurred for noise and temperature reason [15].

3. Packet loss - Whenever one or more packets of data are not reached from source to destination, a packet loss is identified due to network errors [15].

4. Timeout fault – The early and late fault are occurred for timeout fault. For this, the acknowledgement is not reached in timely and session time out is appeared of a link [13].

5. Network congestion – Whenever network traffic exceeds the capacity and bandwidths are insufficient, network congestion error is occurred in a system [28].

6. Protocol incompatibilities – The policy violation problems are occurred between two heterogeneous networks [28].

1.4 Fault Tolerance in Cloud Computing

Cloud computing system is a new paradigm of transparent distributed system. It handles the resources on a larger scale with cost effective and location independent manner. Since the use of cloud computing is increasing in broad spectrum of applications, fault free services are required. The cloud computing is more effective and reliable when it is more fault-tolerant and more adaptable to meet the demand. Fault tolerance is an effective step that permits a system to continue operation even in faulty environments. It ensures system reliability by improving the fault detection and recovery mechanism. To ensure fault tolerance in the cloud, there are reactive and proactive fault tolerance. The reactive fault tolerance requires error recovery after faults are detected [59]. On the other hand proactive fault tolerance technique prevents the faults by predicting it beforehand [60]. Fault tolerance considers effective steps to prevent failure. It ensures system reliability [7] by improving the error detection and correction mechanism [54].

1.4.1 Metrics for Fault Tolerance

The fault tolerance technique considers various parameters in cloud environment such as throughput, performance, availability, usability, response-time, availability, scalability, reliability, security and service level agreement (SLA) [61].

· Throughput– Throughput is an important metric for defining the performance of different fault tolerant techniques. It is a metric for measured the time for sending and receiving data.

· Response Time- The total time of input time, processing time and transmission through media channel time are called the response time.

· Scalability– The services are up or down depending on requirements of clients. The horizontal or vertical scaling is increased horizontally or vertically. For this, the availability of resources is improved smoothly.

· Availability- Availability is defined by sign A(t). It is proportional to reliability in a system. The divination of mean time to failure (MTTF) in Equation 2.1 [62] and mean time between failures (MTBF) is called the availability.

MTTF = dt (2.1)

· Usability- The user level satisfaction from the available resources and proper utilization to achieve a goal with effectiveness and efficiency called the usability.

· Reliability- The system is achieved the correct or acceptable results within deadline called the reliability. The reliability of system depends to run application smoothly. It is defined as sign R(t) in Equation 2.2 [62].

R (t) = dt (2.2)

· Overhead- The overhead can be associated of cloudlets movements and inter system communication. It should be minimized for effective fault tolerance in cloud environment.

1.4.2 Reactive Fault Tolerance

The reactive fault tolerant needs error recovery after faults are noticed. It reduces the effect of failure on a scheme whenever the failure has occurred in a system [11]. There are several effective reactive fault tolerance such as:

· Checkpointing: It is a screenshot of the full state of the process. It runs the failed system from the recently checked point rather than from initial state as shown in Figure 2.5 [63].

Figure 2.5: Checkpointing rollback technique [16].

· Replication: Failed tasks are re-executed by replicas with different resources. This technique has a primary virtual machine and other one is replica (or secondary) virtual machine. When a cloudlet is failed to execute on primary virtual machine, the replica re-executes the cloudlet from the initial state. It needs more overhead than hundred percent [11] [31].

· Job migration: If the hosts, VMs or PEs are failed, then it should be migrated to new entities. On the event of resources failure, the jobs are migrated to a new virtual machine.

· SGuard: This technique is based on backward or rollback recovery mechanism [11].

· Retry: The failed work is re-executed using the same resources in real time and it is called the retry. If the cloudlet is failed or canceled, then it will be resubmitted [64].

· Task Resubmission: The failed tasks are approved either to the similar machine or other machine [64].

· Backward Recovery: It is a rollback technique that starts backward processing from a prior state. It needs extra time for rolling it back [64].

1.4.3 Proactive Fault Tolerance

Proactive fault tolerance prevents the faults proactively and changes the mistrust components. There are several techniques to recover from failure such as [50] [65]:

· Software Rejuvenation: The failed tasks or systems are worked from initial step. It is called reboot system and every moment the system begins with a new state [66].

· Self-healing Proactive Fault Tolerance is defined of failure of an instance of cloud applications successively on multiple virtual machines measured automatically [65].

· Preemptive Migration Proactive Fault Tolerance: Applications have a feedback-loop mechanism which always monitors and resolves faults which is called preemptive migration. It proactively replaces the mistrust components [65] [67].

· Forward Fault Recovery: It is a scheme that can proceed forward even a fault is occurred. The fault is detected later by duplex system and recovered by re-execution or detected and recovered by triple modular redundancy (TMR) [68]. Others proactive mechanism are voter triple modular, error correcting code, double bit error detection and single bit error detection etc.

1.4.4 Drawback of Reactive Fault Tolerance

The CoW-PC algorithm minimized the checkpointing overhead by placing the checkpoints in memory. The success or failed status of a virtual machine depends on adaptive reliability calculation [69]. Xia used a CRC based technique in cloud storage for verification of data integrity. A widespread summary of a fault tolerance in cloud computing is given in [1]. It emphasizes different significant concepts, architectural details and techniques. M. Amoon et al. [7] used selection of fault tolerant algorithm to detect and prevent faults for responding customer requests. They observe the overhead of replication and checkpointing technique for increasing number of customers. M. Azaiez et al. [2] proposes a hybrid fault tolerant model that consists of checkpointing and replication techniques. B. Mohammed et al. [70] proposed an integrated virtualized failover strategy that managed the faults reactively. The faults are detected and recovered using the checkpointing technique. However, the overhead of checkpointing can degrade the performance of a system. R. Jhawar et al. [71] implements a fault tolerant management system that consists of a replication manager, a fault detection and recovery manager. They use the gossip and heartbeat algorithm to detect the faults. S. Rajesh et al. [72] propose a technique that improves the reliability. It has a forward and backward recovery mechanism and it can calculate the reliability of node and takes decision based on reliability.

1.4.5 Drawback of Proactive Fault Tolerance

Jain explains a method that uses fault detection and tolerant systems (FDTS). This technique uses heartbeat algorithm(s) and gossip algorithms to detect whether the application is working smoothly or not. J. Liu et al. [73] illustrated the proactive fault tolerance methodology against five interrelated methods in terms of the overall overheads such as network resource consumption, transmission, and total execution time. K. Nivitha et al. [74] developed a dynamic fault monitoring algorithm for virtual machine.

1.4.6 Comparison of Reactive and Proactive fault tolerance

Fault tolerance is an effective step that enables a system to continue operation even in faulty environment. To ensure the fault tolerance in cloud, there are two types of available technique (i) reactive fault tolerance and proactive fault tolerance. It is a policy that detect the fault after it is occurred, such as checkpointing, replication, retry and task resubmission etc. Proactive fault tolerance prevents the faults by predicting beforehand, such as software rejuvenation, load balancing, preemptive migration and self-healing etc. Proactive fault tolerance is a forward recovery mechanism [1]. This technique prevents faults by predicting them. More time and power are saved for proactive fault tolerance [6] [10].

Table 2.1: Comparison of reactive and proactive fault tolerance

Comparison

Reactive Fault Tolerance

Proactive Fault Tolerance

Define

Faults detection and correction after faults are occurred.

It prevents faults beforehand.

Recovery Mechanism

It is backward recovery

It is a forward recovery

Time complexity

Time complexity is more

Time complexity is less than.

Error detection

More than two steps

Less than two steps

Hardware

It needs more hardware

It needs less hardware.

Undetectable Errors

Less

Probability is 1 – , r=17,18….64

Power cost

More than proactive fault tolerance

Less than

Overhead

More

Less than

Existing Techniques

Checkpointing, replication, retry and task resubmission etc.

software rejuvenation, load balancing, preemptive migration and self-healing etc.

Example

CRC, Checksum and parity checking technique etc.

Effective fault tolerance (EFT) and Aggressive fault tolerance (AFT)

1.4.7 Reliability and Availability Issues in Cloud Computing

Bartholomew and Oscar utilizes CRC64 optimization which improves the verification reliability. However, the network bandwidth has some negative effects when data transmission increased in a huge amount. Kumar and Raj have proposed that the reliability depends on the probability of error detection capability within time. E. Abdelfattah et al. [59] proposed a technique which execute the failed tasks by the best reliable node. Reject message is sent back if it cannot be recovered. R. Buyya et al. [14] proposed a scheme which can work whenever the demand of cloud users are variable on scalable and virtualized entities. They explain the relationship among entities and events. They illustrate the performance between the federated and without federated network.

1.5 Different Types of Existing Error Detection Techniques

There are different types of existing errors detection schemes such as parity checker, checksum and cyclic redundancy check (CRC) [19].

1.5.1 Parity Checking

The parity checking is generally used for single bit error detection. Either even parity or odd parity checking is generally used for single bit error detection. The sending code word is N+R bits, where R and N is the parity bit and main data bits respectively. On the receiver side accept the data bits if the parity bit is valid otherwise rejected.

Suppose k data bits of a message are D1, D2 ……. appended a parity bit p and n=k+1 is the length of a codeword. Parallel XOR operations generate parity P =D₁D₂ ….... [19]_.

1.5.2 Two-dimension Parity Checking

Parity check bits are calculated to generate single bit for each row and all columns as shown in Figure 2.6. Every row and column is created parity bit. The row wise streams are appended and sent to receiver. After receiving, these data are compared using two-dimension parity [19]. The limitation of it is given below:

Ø When two bits in one data are occurred error and same position in another data unit of two bits are also occurred, this parity checker is unable to detect the error.

Ø It is unable to detect the four bits errors or more.

Figure 2.6: Two-dimensional parity checking.

1.5.3 Checksum Technique

The checksum method is based on binary addition of all the sub partition of the data. The summation of data in the blocks is calculated. Then one’s complement of the summation is computed. The summation and it’s one’s complement are added to the end of the data units as a redundancy which is defined the checksum. The extended redundancy field with data are sent to the receiver as given in Equation 2.1. On the receiver side, the summation of all blocks and 1`s complement of this result is computed. Finally accept the pattern if the one’s complement of summation is zero otherwise sending for re-execution as given in Equation 1.2 [19].

Suppose, total binary bits = n and k is the number of segments of n data and block of messages come from original data of m. Summation of all blocks (S), One’s Complement of Checksum , Generated Codeword , and Summation of all blocks after appended Checksum .

Summation of all blocks S = (2.1)

One’s Complement of summation of blocks = . Generated codeword = number of n binary stream and appended with .

On the receiving side, the is a block from the codeword of checksum.

Summation on receiving side = (2.2)

Accept the result if it is = 0, otherwise it is neglected in Equation 2.2. The Time complexity = O(S + ) of checksum [22].

1.5.4 Cyclic Redundancy Check Technique

CRC error detection technique is mainly based on the binary division as shown in Figure 2.7. The polynomial divisor G(x) operates on the binary division with original data and appends a sequence of redundant extra zero and added the remainder called the CRC. If the length of polynomial divisor G(x) is n, the extra (n-1) zeros are appended with original data. On the receiver side, the received codeword is divided by the same polynomial divisor G(x). Finally accept if the remainder is zero otherwise requesting for resending or re-execution as shown in Figure 2.8 [19].

Figure 2.7: Cyclic redundancy checking technique [19] [58].

Let

Let G(x) is the CRC polynomial divisor and the maximum degree of polynomial code n-1 indicates the polynomial W= + + + ……+ where, W and x is coefficient and variable respectively [75].

W(x) = (2.3)

Figure 2.8: CRC error detection technique using binary division [19] [58].

On the receiving side, when the codeword is not divided by polynomial G(x), it may be changed to E(x) error bits. R(x) is the receiving data in Equation 2.5. So we can find an easy way [75].

R(x) = + (2.4)

If error E(x) is zero, then data are received error-free.

R(x) = + 0

Complexity = O (2× + ) (2.5)

So, the time complexity of CRC is more than other as given in Equation 2.5 [22] [58].

1.5.5 Hamming Code Mechanism

Hamming code is a mechanism for error detection and correction. When the huge data are transmitted from the sender to the receiver, it may be errors occurred. Redundant extra bits are generated and added with the information bits of data. It ensures error free data transmission from source to destination [19].

The formula of redundant bits is calculated ≥ m + r + 1, here r is defined the redundant bits, m is used for data bits. For an example number of data bits is 8, the amount of redundant bits is calculated by formula ≥ 8 + 4 + 1 and the amount of redundant bits are 4 [19].

1.6 Problems Statement of Existing Systems

Many researchers are working on fault detection and recovery mechanism. Their functionalities, architecture are talked about cloud system. It has some complication to each paper that has not mentioned by the researchers. However, the major complications are noted follows:

1) The boundaries of hybrid fault tolerance model are to detect and correct errors after the errors are occurred. The reliability depends to finish tasks within deadline. The calculation of reliability is defined strongly [19].

2) The reactive fault tolerant technique has limitations (i) wastage of more resources. (ii) rescheduling the failed tasks are in available VM [75].

3) The CRC is unable to detect the burst length errors maximum degree of divisor > r (r=16, 17,…64) [19].

4) The probability of undetectable errors of CRC is 1 – [19] [75].

5) Checksum technique only detects single bit and multiple bits errors. But burst bits are not detected [19].

6) The CRC and checksum techniques [19]:

Ø CRC uses complex shift register and checksum uses binary adder circuit [19].

Ø Detect the errors after multiple steps.

Ø Are reactive fault tolerance.

1.7 Research Gap or Challenges

We have faced the following investigation gap or challenges from the above existing works. We have worked a challenge such as the CRC16 improves the reliability by detecting at most 16 bits of burst errors [19]. But whenever the burst bit errors is greater than 16 bits, the probability of undetectable error is 1 – , where r is 17,18, ... 64 etc.

a) To implement the best fitting packetizing mechanism is required.

b) The rearrangement of errors packets is more difficult to maintain.

c) The overhead is increased for more errors of packets.

d) We have implemented the proposed EFT architecture to extend the datacenter broker.

e) Aggressive fault tolerance (AFT) is used to tolerate faults recovery using a smart decision agent.

f) Aggressive fault detection is applied to detect errors using heartbeat algorithm.

1.8 Discussion

We have developed data packetizing (DP) and data unpacketizing (DUP) algorithm that minimize the best fitting packetizing. This DP algorithm is used for perfectly packetizing without wasted of memory. Every packets have a unique sequence number and window size helps to arrange the packets using sequence numbers. More errors packets are created more overhead. To solve this problems, we have analyzed the overhead. Our proposed overhead analysis is better than existing techniques. To implement the EFT scheme to extend the datacenter, we have proposed the effective fault tolerance mechanism algorithm. CRC and Checksum can detect errors after processing. Data are appended with error detection code. Then these data are sent and received by server. Next results are processed and found any errors. It detects faults after faults are occurred. So CRC and Checksum are reactive fault tolerance. To evaluate the performance and result analysis of Effective fault tolerance (EFT) and Aggressive fault tolerance (AFT) in cloud computing. The EFT scheme can detect and correct errors in only communication network layer. The EFT scheme is compared with existing system CRC and checksum techniques. But the AFT scheme can detect and correct errors in hardware, software and communication network layer. The AFT scheme is compared with existing checkpointing, replication and re-submission techniques. We have evaluated the performance.

Page updated

Google Sites

Report abuse