Cloud Research Lab Bangladesh - Aggressive Fault Tolerance in Cloud Computing using Smart Decision Agent

Cloud Research Lab Bangladesh

Aggressive Fault Tolerance in Cloud Computing using Smart Decision Agent.

Response Letter for AFT Paper

Paper Link: Aggressive Fault Tolerance in Cloud Computing using Smart Decision Agent

Reference Papers:

Effect of Fault Tolerance in the Field of Cloud Computing

2. A review paper on fault tolerance in cloud computing.

With the immense growth of internet and its users, Cloud computing, with its incredible possibilities in ease, Quality of service and on-interest administrations, has turned into a guaranteeing figuring stage for both business and non-business computation customers. It is an adoptable technology as it provides integration of software and resources which are dynamically scalable. The dynamic environment of cloud results in various unexpected faults and failures. The ability of a system to react gracefully to an unexpected equipment or programming malfunction is known as fault tolerance. In order to achieve robustness and dependability in cloud computing, failure should be assessed and handled effectively. Various fault detection methods and architectural models have been proposed to increase fault tolerance ability of cloud. The objective of this paper is to propose an algorithm using Artificial Neural Network for fault detection which will overcome the gaps of previously implemented algorithms and provide a fault tolerant model.

3. Fault Tolerance in Distributed Systems: A Survey

Distributed systems can be homogeneous (cluster), or heterogeneous such as Grid, Cloud and P2P. Several problems can occur in these types of systems, such as quality of service (QoS), resource selection, load balancing and fault tolerance. Fault tolerance is a main subject regarding the design of distributed systems. When a hardware or software failure occurs in the system, it causes a failure and we call it, in this case, a fault. Moreover, in order to allow the system to continue its functionalities, even in the presence of these faults, they must find techniques, which tolerate failure; the goal of these techniques is to detect and to correct these errors. In this paper, we introduce at first an overview of the basic concepts of distributed systems and their failures types, then we present, in a detailed manner, the different techniques that tolerate fault, used to identify and to correct faults in different kinds of systems such as: cluster, grid computing, Cloud and P2P systems.

4. Detecting and mitigating faults in cloud computing environment

Distributed Systems have swiftly evolved from network of personal computers to cluster and then to grid, moving on to the era of cloud computing and now the latest one as Inter- net of things (IoT). With these rapid enhancements, the scale and complexity of systems providing cloud computing services have also increased tremendously. The major challenge faced by cloud service providers today is to provide an efficient, cost-effective, and reliable solution for seamless delivery of services to users. To achieve this research community is constantly working hard on different related issues like scheduling, power consumption, high availability, customer retention, resource provisioning, reliability and minimizing the probability of failures, etc. Reliability of service is an important parameter. With a large number of components in the cloud, the probability of failures is becoming a norm rather than an exception while delivering services to users. This emphasizes the need to develop fault tolerant schemes for cloud environment to deliver the required level of reliability. In this work, we have proposed a novel fault detection and mitigation approach. The novelty of approach lies in the method of detecting the fault based on running status of the job. The detection algorithm periodically monitors the progress of job on virtual machines (VMs) and reports the stalled job due to failed VM to fault tolerant manager (FTM). This not only reduces the resources wastage but ensures timely delivery of services to avoid any penalty due to service level agreement (SLA) violation. The validation of the proposed approach is done using CloudSim simulator.The performance analysis reveals the effectiveness of the proposed approach.

5. Adaptive Framework for Reliable Cloud Computing Environment

Cloud computing technology has become an integral trend in the market of Information Technology. Cloud computing virtualization and its Internet-based lead to various types of failures to occur and thus the need for reliability and availability has become a crucial issue. To ensure cloud reliability and availability, a fault tolerance strategy should be developed and implemented. Most of the early fault tolerant strategies focused on using only one method to tolerate faults. This paper presents an adaptive framework to cope with the problem of fault tolerance in cloud computing environments. The framework employs both replication and checkpointing methods in order to obtain a reliable platform for carrying out customer requests. Also, the algorithm determines the most appropriate fault tolerance method for each selected virtual machine. Simulation experiments are carried out to evaluate the framework’s performance. The results of the experiments show that the proposed framework improves the performance of the cloud in terms of throughput, overheads, monetary cost and availability.

6. Using proactive faulttolerance approach to enhance cloud service reliability

The large-scale utilization of cloud computing services for hosting industrial/enterprise applications has led to the emergence of cloud service reliability as an important issue for both cloud service providers and users. To enhance cloud service reliability, two types of fault tolerance schemes, reactive and proactive, have been proposed. Existing schemes rarely consider the problem of coordination among multiple virtual machines (VMs) that jointly complete a parallel application. Without VM coordination, the parallel application execution results will be incorrect. To overcome this problem, we first propose an initial virtual cluster allocation algorithm according to the VM characteristics to reduce the total network resource consumption and total energy consumption in the data center. Then, we model CPU temperature to anticipate a deteriorating physical machine (PM). We migrate VMs from a detected deteriorating PM to some optimal PMs. Finally, the selection of the optimal target PMs is modeled as an optimization problem that is solved using an improved particle swarm optimization algorithm. We evaluate our approach against five related approaches in terms of the overall transmission overhead, overall network resource consumption, and total execution time while executing a set of parallel applications. Experimental results demonstrate the efficiency and effectiveness of our approach.

7. Comparison of Various Fault Tolerance Techniques for Scientific Workflows in Cloud Computing

Cloud computing has become a developing platform for personal or individual computing. For enhancing capabilities of cloud, fault tolerance is a very essential task. Fault tolerance is one of the major issue among all the issues of cloud computing. The principle advantages of having fault tolerance in cloud computing incorporate lower cost, failure recovery and enhanced performance metrics. In this paper, we have reviewed various techniques for optimising the fault tolerance in scientific workflows. The fault tolerance strategy reduces the cost, scheduling overhead and scalability etc. It is impossible to satisfy all the requirements for effective fault tolerance as it violates the service level agreement.

8. Review on Fault Tolerance Techniques in Cloud Computing.

9. A Survey on Fault Tolerant Techniques and Issues in Recent Generation Processors.

Processors in embedded systems, in microcomputers and even in servers are highly induced by soft errors because of miniaturization of VLSI circuits and reduction of voltage levels. Recent literature finds that the most of the system’s downtime is caused by soft errors. The soft errors influence the pipeline and hence affect the program’s data flows and control flows. In the modern multicore era, the densities of transistors are increasing, consequently the vulnerability of processors to soft errors is also increasing. However, due to the pressure of performance improvement and price reduction, the reliability issue is often ignored in the recent generations of processors. Error detection mechanism is important because erroneous execution can be catastrophic for safety critical application. In this article, we have reviewed different types of fault tolerant and fault-injection techniques to understand the reliability issues in modern microprocessors.

10. A systematic review of fault tolerance solutions for communication errors in open source cloud computing

Cloud systems, as any other system, must be reliable. This means that the system should respond correctly in presence of failures, which are quite probable in a distributed, largely independent, system as cloud systems are. Thus, it is important that cloud systems become fault tolerant, ensuring safe recovery from failures. Since failures in clouds may come from several different sources, although a major role comes from communication failures, the techniques that can be applied to assure reliability are also very different. This survey presents a systematic review of solutions to provide fault tolerance in open source clouds. Our goal with this review is to provide to cloud managers a guided approach to choose a solution for a given problem or system.

11. A reactive fault tolerance approach for cloud computing

Reliability is a critical requirement for any system. To achieve high reliability the fault tolerance must be accomplished. Fault tolerance refers to the task must be executed even in occurring the fault. Cloud computing has emerged that grants users with access to remote computing resources. Although the current development of the cloud computing technology there are more challenges and chances of errors occur during execution. In this paper, the proposed model tolerates the faults by using replication and resubmission techniques. Then it decides which the best virtual machine depending on the reliability assessments. Then it reschedules the task once the failure occurs to the highest reliability processing node instead of replicating this task to all available nodes. Additionally, we compare our proposed model with another model that used replication and resubmission without any improvement. And we evaluate the experiments by a CloudSim simulator. We conclude that the proposed model can provide comparable performance with the traditional replication and resubmission techniques.

12. A fault tolerance manager with distributed coordinated checkpoints for automatic recovery

Components for High Performance Computing are continuously increasing to achieve more performance and satisfy scientific application users demands. To reduce the Mean Time To Repair in these systems and increment high availability, Fault Tolerance (FT) solutions are required. The checkpoint/restart approach is a widely used mechanism in FT solutions. One of the most used technique to take checkpoints in parallel applications implemented using Message Passing Interface is the coordinated checkpoints. In this paper a Fault Tolerance Manager (FTM) for coordinated checkpoint files is presented, to provide users automatic recovery from failures when losing computing nodes. This proposal makes the configuration of FT simpler and transparent for users without knowledge of their application implementation. Furthermore, system administrators are not required to install libraries in their cluster to support FTM. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. This approach is particularly useful in IaaS cloud environments, where users have to pay for centralized stable storage services. This work is based on RADIC, a wellknown architecture to provide fault tolerance in a distributed, flexible, automatic and scalable way. Experimental results shows the benefits of the presented approach in a private cluster and a well-known cloud computing environment, Amazon EC2.

13. Hybrid fault tolerance model for cloud dependability.

Cloud computing is gaining an increasing popularity and the number of user’s applications and interactions with Cloud resources have grown considerably making Cloud services more susceptible to failure. Therefore, fault tolerance is an important property in order to achieve reliability, availability and required quality of service. Several studies were interested in Cloud fault tolerance issues. These studies proposed solutions including checkpoint and replication that are focused on failure of a single Cloud resource, namely virtual machines, and virtualization issues. Such solutions are insufficient given Cloud complexity. Unlike previous work, we propose a solution that takes into consideration not only virtual machines but also physical machines. Our solution is a hybrid fault tolerance strategy that inherits good features and overcome the limitations of traditional fault tolerance strategies. We evaluate the efficiency of our fault tolerance strategy compared to replication and checkpoint strategies by using CloudSim and we prove that our strategy leads to better results.

14. An Integrated Virtualized Strategy for Fault Tolerance in Cloud Computing Environment

Cloud fault tolerance allows the cloud an ability to keep performing its functions correctly even if faults are occurring in the system. This becomes an important property that enables a complete system to continue functioning in the event of one or more faults for high availability of virtual machines or in life critical systems. A fault-tolerant design may allow the system to possibly function at a reduced level, rather than failing completely. As a major concern in guaranteeing availability and reliability of critical services or application execution in cloud environment, cloud fault tolerance research focuses on detection and recovery strategies. However, in order to minimize impacts and anticipate failures to proactively handle them, a model called an Integrated Virtualized Failover strategy (IVFS) was introduced where fault tolerance was realized using redundancy, checkpoint/replay and fault manager. In this paper, we critically analyze this model and proposed a model that tolerate faults based on the reliability of each computing node or virtual machine, removing these from the availability list if the performance is not optimal. The results of our algorithm presents an increase in pass rates and considers forward/backward recovery using diverse software tools. Our simulation results suggest a good performance compared to current existing models. The results are demonstrated through experimental validation with a critical analysis, laying the foundation for a fully fault tolerant IaaS Cloud environment.

15. Fault Detection and Prediction in Cloud Computing .

Cloud computing is a new technology in distributed computing. Usage of Cloud computing is increasing quickly day by day. In order to help the customers and businesses agreeably, fault occurring in datacenters and servers must be detected and predicted efficiently in order to launch mechanisms to bear the failures occurred. Failure in one of the hosted datacenters may broadcast to other datacenters and make the situation of poorer quality. In order to prevent such circumstances, one can predict a failure flourishing throughout the cloud computing system and launch mechanisms to deal with it proactively.

16. Trust and fault tolerance models in cloud computing: A review

Cloud Computing has been considered as a future technology of internet. This is all because of sharing of IT resources, feature of scalability, flexibility and higher levels of automation. With this the rapid growth, Cloud Computing has bought concerns of security trust. Various trust issues of Cloud have been addressed by a combination of frameworks, standards and related technologies. Sometimes, consumers avoid a specific technology whenever it shows no ability to cope with their security demands. This type of loss can occur in computing platforms such as Cloud platforms and mobile platforms. Also, the concept of fault tolerance that helps in working of a system even when some of the functionalities are not working with full efficiency. Along with Trust, Fault tolerance is also a vital issue in Cloud computing platforms and applications. This feature enables any system to continue its operation at a reduced level, rather than completely failing in delivery output, especially when some subcomponent of the system malfunctions unexpectedly. This paper represents various trust and fault tolerance models existing in cloud environment along with its existing challenges.

17. Auto-scaling web applications in clouds: A taxonomy and survey.

Cloud computing is the emerging paradigm for offering computing resources and applications as subscription-oriented services on a pay-as-you-go basis. one of its features, called elasticity, which allows users to dynamically acquire and release the right amount of computing resources according to their needs is continuously attracting web application providers to move their applications into clouds. To efficiently utilizing elasticity of clouds, it is vital to automatically and timely provision and deprovision cloud resources, since over-provisioning leads to resource wastage and extra monetary cost, while under-provisioning causes performance degradation and violation of service level agreement (SLA). This mechanism of dynamically acquiring or releasing resources to meet QoS requirements is called auto-scaling. However, designing and implementing an efficient general purpose auto-scaling system for web applications is a challenging task due to various factors, such as dynamic workload characteristics, diverse application resource requirements, and complex cloud resource and pricing models. In this paper, we aim to comprehensively analyze the challenges in the implementation of an auto-scaler in clouds and review the developements for researchers that are new to this field. We presnt a taxonomy regarding the various challenges and key properties of auto-scaling web applications.We compare the existing works and map them to the taxonomy to discuss their strength and weakness. Based on the analysis, we also propose promising future directions that can be pursued by researchers to improve the state-of-the-art. Lorido-Botran et al. [Lorido-Botran et al. 2014] have already written a survey about this topic. However, their focus is on resource estimation techniques while omitting other important challenges such as oscillation mitigation, and resource planning. Different from them, our work provides comprehensive discussions about all the major challenges in the topic and it also introduces new developments after their work. The rest of the paper is organized as follows. In Section 2, we describe our definition of the auto-scaling problem for web applications and list its major challenges that need to be addressed when trying to implement one. After that, we present a taxonomy regarding the existing auto-scaling systems. From Section 4 to Section 12, we introduce and compare how the existing auto-scaling systems tackle the listed challenges. After that, in Section 13, we discuss the gaps of the current solutions and present some promising future research directions. Finally, we summarize the findings and conclude the paper.

18. SEDC-Based Hardware-Level Fault Tolerance and Fault Secure Checker Design for Big Data and Cloud Computing.

Fault tolerance is of great importance for big data systems. Although several software-based application-level techniques exist for fault security in big data systems, there is a potential research space at the hardware level. Big data needs to be processed inexpensively and efficiently, for which traditional hardware architectures are, although adequate, not optimum for this purpose. In this paper, we propose a hardware-level fault tolerance scheme for big data and cloud computing that can be used with the existing software-level fault tolerance for improving the overall performance of the systems.The proposed scheme uses the concurrent error detection (CED) method to detect hardware-level faults, with the help of Scalable Error Detecting Codes (SEDC) and its checker. SEDC is an all unidirectional error detection (AUED) technique capable of detecting multiple unidirectional errors. The SEDC scheme exploits data segmentation and parallel encoding features for assigning code words. Consequently, the SEDC scheme can be scaled to any binary data length “n” with constant latency and less complexity, compared to other AUED schemes, hencemaking it a perfect candidate for use in big data processing hardware.We also present a novel area, delay, and power efficient, scalable fault secure checker design based on SEDC. In order to show the effectiveness of our scheme, we (1) compared the cost of hardwarebased fault tolerance with an existing software-based fault tolerance technique used in HDFS and (2) compared the performance of the proposed checker in terms of area, speed, and power dissipation with the famous Berger code and m-out-of-2m code checkers. The experimental results show that (1) the proposed SEDC-based hardware-level fault tolerance scheme significantly reduces the average cost associated with software-based fault tolerance in a big data application, and (2) the proposed fault secure checker outperforms the state-of-the-art checkers in terms of area, delay, and power dissipation.

19. A preliminary fault taxonomy for multi-tenant SaaS systems.

Multi-tenancy is the key feature for every Software as a Service (SaaS), as it enables multiple customers, so-called tenants, to transparently share a system’s resources reducing costs. Tenants can customize a system according to their particular needs, however, such a high level of complexity may open possibilities for a failure. In addition, there is a lack of a reference architecture for such applications and once the implementations differ significantly, ensuring that all executions flows have been verified without impacting the working features for other tenants is a complex task. The clear understanding of the possible faults is fundamental for the identification, tolerance and definition of appropriate testing techniques. This paper presents a preliminary fault taxonomy for multi-tenant cloud applications considering their foundational features. A literature review previously carried out, a survey with practitioners and analysis of some applications were performed to achieve this classification. In addition, an e-commerce called MtShop was developed for a case study. The expressiveness of the proposed taxonomy is illustrated with critical faults identified in the MtShop through the automated and parallel testing. We conclude with the benefits that our taxonomy can bring to testing, prediction and regression testing activity of multi-tenant cloud applications.

20. Fault tolerant software systems using software configurations for cloud computing.

Customizable software systems consist of a large number of different, critical, non-critical and interdependent configurations. Reliability and performance of configurable system depend on successful completion of communication or interactions among its configurations. Most of the time users of configurable systems very often use critical configurations than non-critical configurations. Failure of critical configurations will have severe impact on system reliability and performance. We can overcome this problem by identifying critical configurations that play a vital role, then provide a suitable fault tolerant candidate to each critical configuration. In this article we have proposed an algorithm that identifies optimal fault tolerant candidate for every critical configuration of a software system. We have also proposed two schemes to classify configurations into critical and non-critical configurations based on: 1) Frequency of configuration interactions (IFrFT), 2) Characteristics and frequency of interactions (ChIFrFT). These schemes have played very important role in achieving reliability and fault tolerance of a software system in a cost effective manner. The percentage of successful interactions of IFrFT and ChIFrFT are 25 and 40% higher than that of the NoFT scheme. In NoFT scheme none of the configurations are supported by fault tolerance candidates. Performance of IFrFT, ChIFrFT, and NoFT schemes are tested using a file structure system.

21. The Effects of Latency , Bandwidth , and Packet Loss on Cloud-Based Gaming Services.

Network bandwidth increases make the concept of cloud-based gaming services a promising alternative to traditional gaming platforms. Cloud-based gaming services do this by processing and rendering the game in a cloud server, receiving control input from the client and streaming the rendered game back to the client akin to video streaming. Network latency presents a challenge cloud-based gaming services must overcome to provide a comparable experience to traditional gaming. Measuring the effects of latency on key factors, such as quality of experience and player performance, can help understand the capabilities of the current generation of cloud-based gaming services. We conduct a cloud-based gaming service user study, surveying user’s subjective quality of experience and measuring their in-game performance and conduct experiments that measure cloud-based gaming services’ network characteristics. Analysis of results shows a significant decrease in both quality of experience and player performance as latency increases, but latency has little effect on the frame rate or average throughput of cloud-based gaming services.

22. AURA: Recovering from transient failures in cloud deployments.

In this work, we propose AURA, a cloud deployment tool used to deploy applications over providers that tend to present transient failures. The complexity of modern cloud environments imparts an error-prone behavior during the deployment phase of an application, something that hinders automation and magnifies costs both in terms of time and money. To overcome this challenge, we propose AURA, a framework that formulates an application deployment as a Directed Acyclic Graph traversal and re-executes the parts of the graph that failed. AURA achieves to execute any deployment script that updates filesystem related resources in an idempotent manner through the adoption of a layered filesystem technique. In our demonstration, we allow users to describe, deploy and monitor applications through a comprehensive UI and showcase AURA’s ability to overcome transient failures, even in the most unstable environments.

23. Modeling and simulation of scalable cloud computing environments and the cloudsim toolkit: Challenges and opportunities.

Cloud computing aims to power the next generation data centers and enables application service providers to lease data center capabilities for deploying applications depending on user QoS (Quality of Service) requirements. Cloud applications have different composition, configuration, and deployment requirements. Quantifying the performance of resource allocation policies and application scheduling algorithms at finer details in Cloud computing environments for different application and service models under varying load, energy performance (power consumption, heat dissipation), and system size is a challenging problem to tackle. To simplify this process, in this paper we propose CloudSim: an extensible simulation toolkit that enables modelling and simulation of Cloud computing environments. The CloudSim toolkit supports modelling and creation of one or more virtual machines (VMs) on a simulated node of a Data Center, jobs, and their mapping to suitable VMs. It also allows simulation of multiple Data Centers to enable a study on federation and associated policies for migration of VMs for reliability and automatic scaling of applications.

24. Fault Diagnosis for Uncertain Cloud Environment through Fault Injection Mechanism.

Cloud is the epitome of business computing as it is expeditiously taking over the long-established and conventional systems as a more efficient, reliable, elastic, cost-effective and scalable panacea. However, this introduces a continuous sense of uncertainty into the computational process. Cloud presents numerous challenges, with one of them being the Quality of Service management (QoS). Cloud is dynamic; therefore uncertainty can arise from various factors such as resource availability, unmatched predictions, unreliable data or unexpected faults in the system. The existing models fail to provide an efficient mechanism that detect, measure the uncertainty that has occurred and then rectify it. To determine the potency of the model proposed, parameters like Availability and Reliability are considered to render reliable and seamless cloud service.

25. Fault tolerance management in cloud computing: A system-level perspective.

The increasing popularity of Cloud computing as an attractive alternative to classic information processing systems has increased the importance of its correct and continuous operation even in the presence of faulty components. In this paper, we introduce an innovative, system-level, modular per- spective on creating and managing fault tolerance in Clouds. We propose a comprehensive high-level approach to shading the implementation details of the fault tolerance techniques to application developers and users by means of a dedicated service layer. In particular, the service layer allows the user to specify and apply the desired level of fault tolerance, and does not require knowledge about the fault tolerance techniques that are available in the envisioned Cloud and their implementations.

26. Improving Fault Tolerance in Virtual Ma- chine Based Cloud Infrastructure.

Cloud Computing is a style of Computing where service is provided across the internet using different models. Fault tolerance is a major concern to guarantee availability and reliability of critical services as well as application execution. In this project work, we propose a model to analyze how system tolerates the faults and make decision on the basics of reliability of the processing nodes, i.e. Virtual machines. If a virtual machine manages to produce a correct result within the time limit, its reliability increases, and if it fails to produce the result within time or correct result, its reliability decreases. If the node continues to fail, it is removed, and a new node is added. There is also a minimum reliability level. If any processing node does not achieve that level, the system will perform backward recovery or safety measures. The proposed technique is based on the execution of design diverse variants on multiple virtual machines, and assigning reliability to the results produced by variants. The virtual machine instances can be of same type or of different types. The system provides both the forward and backward recovery mechanism, but main focus is on forward recovery.

27. Cloud simulation under fault constraints.

As Cloud Computing offers support to more and more complex applications, the need to verify and validate computing models under fault constraints becomes more important, aiming to ensure applications performance. Doing this experimental validation in the early development phase and with small costs require a cloud simulation tool. An extensible framework for Cloud simulation and modeling is CloudSim. This paper proposes a new module for CloudSim consisting of a fault injector based on a specification language. The aim is to assist simulation to be more realistic and includes concrete conditions and constraints. The impact is on testing Cloud applications and help test fault tolerant applications by specifying defect patterns and failing components. The evaluation of the fault injection module is done by measuring the behavior and performance of a tool based on CloudSim, named CloudAnalyst. Several metrics are determined and measured for experimental validation, and conclusions are drawn.

28. A multi-objective optimization method of initial virtual machine fault-tolerant placement for star topological data centers of cloud systems.

Virtualization is the most important technology in the unified resource layer of cloud computing systems. Static placement and dynamic management are two types of Virtual Machine (VM) management methods. VM dynamic management is based on the structure of the initial VM placement, and this initial structure will affect the efficiency of VM dynamic management. When a VM fails, cloud applications deployed on the faulty VM will crash if fault tolerance is not considered. In this study, a model of initial VM fault-tolerant placement for star topological data centers of cloud systems is built on the basis of multiple factors, including the service-level agreement violation rate, resource remaining rate, power consumption rate, failure rate, and fault tolerance cost. Then, a heuristic ant colony algorithm is proposed to solve the model. The service-providing VMs are placed by the ant colony algorithms, and the redundant VMs are placed by the conventional heuristic algorithms. The experimental results obtained from the simulation, real cluster, and fault injection experiments show that the proposed method can achieve better VM fault-tolerant placement solution than that of the traditional first fit or best fit descending method.

29. Architecture-Based Reliability-Sensitive Criticality Measure for Fault-Tolerance Cloud Applications.

The widespread adoption of service computing allows software to be developed by outsourcing open cloud services (i.e., SOAP-based or RESTful Web APIs) through mashup or service composition techniques. Fault tolerance for the purpose of assuring the stable execution for cloud-based software (or CBS) application has attracted great attention in coping with a loosely coupled CBS operating under dynamic and uncertain running environments. It is too expensive to rent massively redundant cloud services for CBS fault tolerance application. To reduce budget but guarantee the effectiveness of CBS fault tolerance, identifying critical components within a CBS composite system is of significant importance. We integrate CBS composite system architecture analysis and reliability sensitivity analysis approaches and propose an Architecture-based Reliability-sensitive Criticality Measure (or ARCMeas) method in this paper. We verify ARCMeas application through a cost-effective fault tolerance CBS by presenting a particle swarm optimization (PSO)-based cost-effective fault tolerance strategy determination (or PSO-CFTD) algorithm. Experimental results suggest the effectiveness of the approach.

30. A resilient agent-based architecture for efficient usage of transient servers in cloud computing.

Unused resources are being exploited by cloud computing providers, which are offering transient servers without availability guarantees. Spot instances are transient servers offered by Amazon AWS, with rules that define prices according to supply and demand. These instances will run for as long as the current price is lower than the maximum bid price given by users. Spot instances have been increasingly used for executing computation and memory intensive applications. By using dynamic fault tolerant mechanisms and appropriate strategies, users can effectively use spot instances to run applications at a cheaper price. This paper presents a resilient agent-based cloud computing architecture. For an efficient usage of transient servers, the architecture combines machine learning and a statistical model to predict instance survival times, refine fault tolerance parameters and reduce total execution time. We evaluate our strategies and the experiments demonstrate high levels of accuracy, reaching a 94% survival prediction success rate, which indicates that the model can be effectively used to define execution strategies to prevent failures at revocation events under realistic working conditions.

31. The Future for Adaptive Software Development in Cloud Computing Environment Using Multi Agent System.

Cloud computing system provides large-scale infrastructure for high performance computing. The convergence of interests between multi-agent systems that need reliable infrastructures and cloud computing systems that need intelligent software with dynamic, flexible and autonomous behavior can result in new systems and applications. This paper presents a proposed system using the intelligent multi-agent system in cloud computing .When it comes to developing the way of thought in a given environment, it is essential to think on a critical level in order to reach a state of understanding and operating on a level that best suits the needs of the said environment through evaluating the status of a situation and working to reach a decision or take an action. The work presented aims to reach this state of smart decision-making that is essential to software developers when they are laying the foundations of the software internalization. The proposed system pioneers led the use of a multi-agent system in cloud computing, as there was no such initiative before in the development of software. The system proposed introduces a means for the autonomous decisiontaking and critical thinking that is necessary to develop and evaluate the social behavior, universality and adaptive roles that the XML rational agent which is responsible for the cloud reservoir is capable of.

32.Enabling scalable and fault-tolerant multi-agent systems by utilizing cloud-native computing.

Multi-agent systems (MAS) represent a distributed computing paradigm well suited to tackle today’s challenges in the field of the Internet of Things (IoT). Both share many similarities such as the interconnection of distributed devices and their cooperation. The combination of MAS and IoT would allow the transfer of the experience gained in MAS research to the broader range of IoT applications. The key enabler for utilizing MAS in the IoT is the ability to build large-scale and fault-tolerant MASs since IoT concepts comprise possibly thousands or even millions of devices. However, well known multi-agent platforms (MAP), e. g., Java Agent DE-velopment Framework (JADE), are not able to deal with these challenges. To this aim, we present a cloud-native Multi-Agent Platform (cloneMAP) as a modern MAP based on cloud-computing techniques to enable scalability and fault-tolerance. A microservice architecture is used to implement it in a distributed way utilizing the opensource container orchestration system Kubernetes. Thereby, bottlenecks and single-points of failure are conceptually avoided. A comparison with JADE via relevant performance metrics indicates the massively improved scalability. Furthermore, the implementation of a large-scale use case verifies cloneMAP’s suitability for IoT applications. This leads to the conclusion that cloneMAP extends the range of possible MAS applications and enables the integration with IoT concepts.

33. The intelligent agent-based information security model for cloud.

Today’s era is the era of cloud computing and agent-based processing. Data security and integrity are achieved by information security systems, which ensure the continuity of business and protect organizations against potential risks. Information security systems are used to estimate the risks and search the place of the occurrence of the risks. It should also be able to measure the risk consequences associated with cloud organizations. The cloud organizations must analyse the information system processes and they should develop their own information systems based on the analysis. This paper proposes a comprehensive Agent-Based Information Security framework for Cloud Computing. We have considered risk assessment methods for calculating consequences by focussing on potential threats, assets, vulnerabilities, and their associated measures. A decision system for the organizations is created by taking the help of intelligent (smart) and software agents that are used to fetch and group the relevant information used in a framework that decides against threats based on information provided by the security agents. We have used a fuzzy inference system based upon fuzzy set theory for creating a decision system.

34. Host Fault Injection Using Various Distribution Functions.

Nowadays, people are connected to the cloud, i.e., back end servers to implement various tasks such as storage of important data, running applications of higher compatibility and so on. While performing such tasks, a number of hosts within the system may encounter various faults resulting in their downfall. Once a system failure occurs, it has an effect on the execution of the tasks performed by the failed components such as hosts or virtual machines. Subsequently, establishing a requirement for a cloud climate that helps in running various hosts or virtual machines effectively in a cloud computing framework independent of the fault or failure happening inside the parts of the framework.

35. Software Fault Injection Campaign Generation for Cloud Infrastructures.

Software fault injection (SFI) is a versatile tool for dependability assessment. In this approach, various types of system failure causes, namely faults or defects [1], are artificially inserted (“injected”) into a running instance of the system. An behavioural investigation can show if the fault tolerance mechanisms of the system reacts in the intended way. While numerous SFI approaches have been proposed and implemented in the past decades [2], SFI seems to remain more of a research topic rather than a commonplace software development tool. Reasons for SFI’s lack of practical application may be usability issues, the challenge of finding an adequate and representative fault load [3], or the difficulty of answering the question of when and where to inject faults in order to achieve meaningful results. We present a methodology for generating fault injection campaigns, which comprises multiple steps: First, a dependability model of the system is constructed (see Section II). Second, a fault injection campaign satisfying desirable criteria is generated from it (see Section III). Third, the fault injection campaign is conducted in an automated and orchestrated way. If the campaign succeeds, this asserts that the system is as dependable as specified in the initial model. If not, the experiment results can help to pin-point the weak parts of the architecture. We demonstrate the applicability of our approach with an OpenStack-based scenario, which is described in Section IV.

Page updated

Google Sites

Report abuse