Cloud SaaS Availability Insight

Introduction

The Cloud SaaS product Availability is defined by the following.

a. Underlying Cloud IaaS\SaaS Infrastructure

b. SaaS Application deployment strategy

c. SaaS Application design

Service Level Agreement (SLAs)

The Service model is governed by delivery of Service by provider to a client. The model requires the definition of agreement between the two.

A service-level agreement (SLA) is a commitment between a service provider and a client. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user.

The Availability is defined with a SLA Uptime Percentage (defined below)

Cloud IaaS\SaaS Infrastructure

The Cloud IaaS\SaaS Infrastructure is provided by public Cloud Vendors (AWS, Azure, etc) or private Cloud of organization.

The AWS public Cloud is referenced in this document.

AWS SLAs

AWS IaaS EC2

Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the Amazon Web Services (AWS) cloud.

Using Amazon EC2 eliminates your need to invest in hardware up front, so you can develop and deploy applications faster.

EC2 Reference

AWS Regions and Availability Zones

Amazon EC2 is hosted in multiple locations world-wide. These locations are composed of regions and Availability Zones.

Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones.

Region and Availability Zones

Each region is completely independent. Each Availability Zone is fully isolated data center, but the Availability Zones in a region are connected through low-latency links.

Regions

Each Amazon EC2 region is designed to be completely isolated from the other Amazon EC2 regions. This achieves the greatest possible fault tolerance and stability.

Availability Zones

If you distribute your instances across multiple Availability Zones and one instance fails, you can design your application so that an instance in another Availability Zone can handle requests.

AWS SaaS ELB

Elastic Load Balancing automatically distributes incoming application traffic across multiple targets, such as Amazon EC2 instances, containers, IP addresses, and Lambda functions. It can handle the varying load of your application traffic in a single Availability Zone or across multiple Availability Zones.

  • Elastic Load Balancing automatically distributes incoming traffic in multiple Availability Zones and ensures only healthy targets receive traffic.

  • Application Load Balancer is best suited for load balancing of HTTP and HTTPS traffic

  • Network Load Balancer is best suited for load balancing of TCP traffic (Connection termination and certificate delivery is to Service Node)

  • Elastic Load Balancing can also load balance across a Region, routing traffic to healthy targets in different Availability Zones.

  • Elastic Load Balancing is capable of handling rapid changes in network traffic patterns. Additionally, deep integration with Auto Scaling ensures sufficient application capacity to meet varying levels of application load without requiring manual intervention.

  • Network Load Balancers use active and passive health checks to determine whether a target is available to handle requests. By default, each load balancer node routes requests only to the healthy targets in its Availability Zone. (Usually port check is done)

  • ELB Reference

AWS SaaS ASG

Amazon EC2 Auto Scaling helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application.

  • Your EC2 instances are organized in to groups so that they can be treated as a logical unit for the purposes of scaling and management. When you create a group, you can specify its minimum, maximum, and, desired number of EC2 instances.

  • Amazon EC2 Auto Scaling provides several ways for you to scale your Auto Scaling groups. For example, you can configure a group to scale based on the occurrence of specified conditions (dynamic scaling) or on a schedule.

  • The health status of an Auto Scaling instance is either healthy or unhealthy.

  • Amazon EC2 Auto Scaling periodically performs health checks on the instances in your Auto Scaling group and identifies any instances that are unhealthy. After Amazon EC2 Auto Scaling marks an instance as unhealthy, it is scheduled for replacement.

  • If you have custom health checks, you can send the information from your health checks to Amazon EC2 Auto Scaling so that Amazon EC2 Auto Scaling can use this information.

  • ASG Health Checks

  • EC2 Auto Scaling

AWS SaaS Connection Draining

  • When you enable Connection Draining on a load balancer, any back-end instances that you deregister will complete requests that are in progress before deregistration. Likewise, if a back-end instance fails health checks, the load balancer will not send any new requests to the unhealthy instance but will allow existing requests to complete.

  • This means that you can perform maintenance such as deploying software upgrades or replacing back-end instances without impacting your customers’ experience.

  • Connection Draining is also integrated with Auto Scaling, making it even easier to manage the capacity behind your load balancer. When Connection Draining is enabled, Auto Scaling will wait for outstanding requests to complete before terminating instances.

  • ELB Connection Draining – Remove Instances From Service With Care

Increase the Availability of Your Application on Amazon EC2

    • Suppose that you start out running your app or website on two EC2 instance

    • Use Elastic Load Balancing to distribute incoming traffic for your application across these EC2 instance. This increases the availability of your application.

    • Placing your instances in multiple Availability Zones also improves the fault tolerance in your application.

    • If one Availability Zone experiences an outage, traffic is routed to the other Availability Zone.

    • You can use Amazon EC2 Auto Scaling to maintain a minimum number of running instances for your application at all times. Amazon EC2 Auto Scaling can detect when your instance or application is unhealthy and replace it automatically to maintain the availability of your application.

    • You can also use Amazon EC2 Auto Scaling to scale your Amazon EC2 capacity up or down automatically based on demand, using criteria that you specify.

AWS SaaS S3

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.

    • This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.

  • Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web.

  • S3 Reference

AWS SaaS SQS

Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications.

    • SQS eliminates the complexity and overhead associated with managing and operating message oriented middleware, and empowers developers to focus on differentiating work.

    • Using SQS, you can send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available.

Standard Queue - For Event Throughput

    • Unlimited Throughput – Standard queues support a nearly unlimited number of transactions per second (TPS) per action.

    • At-Least-Once Delivery – A message is delivered at least once, but occasionally more than one copy of a message is delivered. (Idempotent Operations)

    • Best-Effort Ordering – Occasionally, messages might be delivered in an order different from which they were sent.

FIFO Queue - For Event Ordering

    • High Throughput – By default, FIFO queues support up to 3,000 messages per second with batching. To request a limit increase, file a support request. FIFO queues support up to 300 messages per second (300 send, receive, or delete operations per second) without batching.

    • Exactly-Once Processing – A message is delivered once and remains available until a consumer processes and deletes it. Duplicates aren't introduced into the queue.

    • First-In-First-Out Delivery – The order in which messages are sent and received is strictly preserved.

AWS SaaS DynamoDB

Amazon DynamoDB is a NoSQL database that supports key-value and document data models, and enables developers to build modern, serverless applications that can start small and scale globally to support petabytes of data and tens of millions of read and write requests per second. DynamoDB is designed to run high-performance, internet-scale applications that would overburden traditional relational databases.

    • Built-in support for ACID transactions

    • On-demand backups and point-in-time recovery

    • Encryption at rest

  • DynamoDB global tables replicate your data automatically across your choice of AWS Regions

  • DynamoDB Reference

AWS API Gateway

Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale.

    • API Gateway handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring, and API version management.

  • API Gateway also helps protect your existing services by enforcing throttling rules to ensure that your backend can withstand unpredictable spikes in traffic.

    • Supports burst-limit, rate-limit and quota management

    • Supports Quota limit consumption based on API response parsing

    • Usage plan with api-key attachment

  • API Gateway Reference

  • API Gateway Docs

AWS Downtime - What does it mean?

AWS has distributed independent deployments as Region and Availability Zones.

The downtime with AWS is always contextual with AWS Service and specific to Region\AZ.

Calling AWS is down is an incorrect statement.

Case-Study: April, 2011 AWS EC2 downtime in specific Region\AZ

  • Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region

    • The impact to running instances was limited to the affected Availability Zone. As a result, many users who wrote their applications to take advantage of multiple Availability Zones did not have significant availability impact as a result of this event.

  • Some customers’ applications (or critical components of the application like the database) are deployed in only a single Availability Zone, while others have instances spread across Availability Zones but still have critical, single points of failure in a single Availability Zone.

  • We will look to provide customers with better tools to create multi-AZ applications that can support the loss of an entire Availability Zone without impacting application availability.

  • In this event, some customers were seriously impacted, and yet others had resources that were impacted but saw nearly no impact on their applications.

Key Takeaway

  • Design application deployment with redundancy across AZs (possibly Regions)

  • Design without Single Point of Failure

High Availability

High Availability is the fundamental feature of building software solutions in a cloud environment.

Traditionally high availability has been a very costly affair but now with AWS, one can leverage a number of AWS services for high availability or potentially “always availability” scenario.

Availability is a chain of responsibility of Infrastructure Components and Application

Use Multiple Availability Zones

Use Multiple Regions (ELB support across the region load distribution)

    • In an event of a regional failure, switching between individual deployments in Regions can be done using DNS fail-over routing

No Single Point of Failure

    • Consider backend (Database or Compute).

    • Use Active-Standby or Active-Active

Build Loose Coupling

    • The failure of one component does not bring the whole system down

  • The application should be built on individual small modules a.k.a micro-service. Each module should be a black box. They should be fairly independent.

  • Use queues to pass messages between these micro-services.

    • Use dynamic service-discovery to communicate between components

    • Load-balance component instances for transparent consumption & fault tolerance

Implement Elasticity

  • Plan for failures of any individual component of the overall system

    • Use ASG to quickly start a new EC2 nodes under a high load scenario to avoid the overall application goes down

    • Use Bootstrapping to quickly build a new environment

Resilience and Fault Tolerance

    • ELB\ASG and Health Checks (Consul or otherwise)

    • Frontend Node: ELB ensures request goes to healthy node

    • ASG purges the faulty node based on health check and spawn a new node

    • Server node: Server cluster could share the feedback loop, traffic to be skipped for unhealthy node

    • Raise Alert for consistent unhealthy state

    • AWS AZ or a region going down is a Fault and not a Disaster

    • AWS AZ goes down: Other AZs to ensure no downtime, ASG to ensure auto-spawn the capacity

    • AWS Region goes down: DNS routing to other Region (AZs) to ensure no downtime, ASG to ensure auto-spawn the capavity

Business Continuity and Disaster Recovery

    • Consul kv config\metadata backup in Object Store S3)

    • All VPC deployments goes down

    • Bring-up new VPC by auto-build Jenkins pipeline

    • Restore config\metadata (Consul\etc) from backup

    • Minimize downtime to achieve committed SLA (e.g. 99.9%)

Scalability (Horizontal\Auto)

    • Horizontal Vs Vertical scalability

    • Leverage Horizontal scaling

    • Define scale-up\down condition for each micro-service

    • Leverage auto-scale service (ASG) to auto scale-up on-demand

    • Define scale down with safe state & connection-draining

    • Leverage ASG & connection-draining to auto scale-down on-demand

    • Ensure conservative scale-down and prevent flipping

SaaS Application Deployment pattern

Rolling Deployment

Rolling deployment works by scale-up with new-version node and then scale-down of old nodes.

This pattern requires safe-state scale-down by connection draining.

The rollback is not transparent and workflow depends on phase of deployment.

During scale-up with new version nodes, new nodes needs to be scale-down

During scale-down of old nodes, rolling deployment is required with old version of nodes

Blue-Green Deployment

Blue-green deployments are a pattern whereby we reduce downtime during production deployments by having two production environments ("blue" and "green"), as identical as possible.

At any time one of them, let's say blue for the example, is live. As you prepare a new release of your software you do your final stage of testing in the green environment. Once the software is working in the green environment, you switch the DNS so that all incoming requests go to the green environment - the blue one is now idle.

This requires connection draining during the DNS switch.

In a blue-green deployment model, the production environment changes with each release:

    • Downtime: Reduce deployment downtime

    • Staging: when blue is active, green becomes the staging environment for the next deployment.

    • Rollback: we deploy to blue and make it active. Then a problem is discovered. Since green still runs the old code, we can roll back easily.

    • Disaster recovery: after deploying to blue and we're satisfied that it is stable, we can deploy the new release to green too. This gives us a standby environment ready in case of disaster.

Canary Deployment

Canary deployments are a pattern for rolling out releases to a subset of users or servers. The idea is to first deploy the change to a small subset of servers, test it, and then roll the change out to the rest of the servers. The canary deployment serves as an early warning indicator with less impact on downtime: if the canary deployment fails, the rest of the servers aren't impacted.

The basic steps of a canary deployment are:

    • Deploy to one or more canary servers.

    • Test, or wait until satisfied.

    • Deploy to the remaining servers.

Capacity Planning

The capacity planning refers to the resources planning for scaling-up the deployment.

The AWS AZ might also reach its limit at some point of time, so multi-AZ or Region scale capacity is good to plan is advance.

There is option to reserve capacity to prevent capacity planning surprises in AWS EC2.

Another aspect of capacity planning is any on-prem datacenter based appliance servers which are required to be procured in advance for potential future capacity prospects.

SaaS Application Design

Service Discovery

The SOA drives micro-services communication to each other. However with dynamic nature of micro-service instances, the dynamic discovery is required.

This requirement is full-filled by each service registering itself at boot time and doing service discovery at the time of communication with destination service.

The tools e.g. consul, kubernetes are used for service discovery feature.

Register a service

Discover a service

Refer Consul

Another approach is to frontend instances of same micro-service with a load-balancer and load-balancer becomes the point of invocation for source service.

The load-balancer config is dynamically generated by service discovery.

Policy Trail

  • Applications and deployment require bootstrapping configuration (a.k.a. policy)

    • Manual policy push leads to human error

    • Policy Trail is a method to store, review and push policy

    • Storing policy as github repository in json format achieves the workflow

Consistency

Strict consistency: For any incoming write operation, once a write is acknowledged to the client, the following holds true:

The updated value is visible on read from any node.

The update is protected from node failure with redundancy.

Eventual consistency: Weakens the above conditions by adding the word “eventually” and adds the condition “provided there are no permanent failures”.

Clearly, Strict Consistency is better because the user is guaranteed to always see the latest data, and data is protected as soon as it is written.

So, why don’t we always make systems Strictly Consistent? First – because under some scenarios, the implementation of Strict Consistency can significantly impact performance (Latency and Throughput).

Second – Strict Consistency isn’t always required and Eventual Consistency may suffice in some use cases. For example, in a shopping cart, say an item addition happened and the datacenter failed, it is OK for customers to add that item again. Eventual Consistency would be sufficient.

However, you wouldn’t want this happening to your bank account with a deposit you just made. It simply cannot vanish because a node failed somewhere in the distributed system. Strict Consistency is required.

Performance

The performance evaluation of SOA requires thoughtful approach, very different from traditional approach.

The key measures are base-lining and on-going monitoring based performance.

  • The base-lining refers to the performance capacity of defined size of deployment

    • The live performance is SLI\SLA based monitoring of key attributes

    • The scalability ensures performance is maintained within committed SLAs

    • The performance of SOA is monitored (live), rather than being tested

The deployment heartbeat is another key measurement of deployment.

This is achieved by periodic (~10 seconds) external connection to measure\push the pulse in terms of monitoring stats.

Monitoring Pipeline

The SOA monitoring requires stats both from Application and System.

This is achieved by monitoring pipeline.

- Statsd

- Graphite

- Grafana

- Alert

Threshold on Graphite metric

Consul Event based Alerts

Delivery

Pagerduty

MS Teams

Logging Pipeline

The Application\system logs needs to be streamed and stored outside the deployment for debugging and processing.

The logging pipeline refers to this workflow.

Deployment Artifact Pipeline

The deployment artifact pipeline refers to packaging the source code and dependencies into a deployment artifact.

Typically, the pipeline is multi-level.

VM based pipeline

  • Source Code & dependencies packaged as debian

  • Debian packaged as AMI with packer and ansible tools

  • Terraform for deployment of AMI artifact

Container based pipeline

    • Source Code & dependencies packaged as debian

    • Debian packaged as docker container

    • Kubernetes for deployment of docker container artifact

Continuous Integration Pipeline

The process of automatically building and testing your software on a regular basis. It involves building and running full unit and integration tests for every commit.

The steps includes the following:

  • Code merge to master post PR

  • Auto trigger of Jenkins to generate debian package

  • Auto trigger of Jenkins to generate AMI with latest debian

  • Auto trigger of CI pipeline

  • Auto Trigger Terraform module for AMI deployment in CI VPC

  • Auto Trigger CI automation execution

  • Generate and publish CI report

Continuous Delivery & Deployment Pipeline

Continuous Delivery is a logical step forward from continuous integration towards availability of release-candidate code\binary

Continuous Deployment is the actual delivery\deployment of features and fixes to the customer as soon as they are ready. It could be automated to trigger with continuous delivery.

The steps includes the following:

  • Promote CI AMIs and deploy to Stage VPC

  • Auto trigger Smoke automation on Stage VPC

  • Observe Stage VPC for 3 days for success

  • Promote Stage AMIs and deploy to Prod Stand-by VPC (a.k.a. Blue-Green VPC)

  • Auto trigger Smoke automation on Prod Stand-by VPC

    • Auto DNS switch for Prod

    • Observe Prod VPC for failure. Rollback with DNS switch for failure

Raising the Bar

a. Micro-services auto-packaged as AMI & Debian

b. Code Static Analysis

c. Continuous Delivery and Integration

d. Sanity Automation

e. Policy Trail for config changes

f. Cloud best-practices using ELB\ASG for fault-tolerance and eventual consistency

g. Zero production surprises : Enforce workflow to deploy only QA signoff AMIs (post integration validation).

h. Remote micro-service logging with infrastructure a.k.a Logging as a Service

i. Metric based monitoring using Grafana

j. Alerts infrastructure using Graphite threshold and delivery as PagerDuty\Other incidents and email.

k. Disaster Recovery with consul kv config\metadata backup in S3

l. Defined common libraries for cloud (aws) and utility and segregated for use across products.

Drawbacks of SaaS

o Robustness:

• SaaS software may not be as robust (functionality wise) as traditional software applications due to browser limitations. Consider Google Doc & Microsoft Office.

o Privacy

• Having all of a user’s data sit in the cloud raises security & privacy concerns. SaaS providers are usually the target of hack exploits e.g.

o Security

• Attack detection, malicious code detection

o Reliability:

• In the rare event of a SaaS provider going down, a wide range of dependent clients could be affected. For example, when Amazon

EC2 service went down in April 2011, it took down FourSquare, Reddit, Quora and other well known applications that run on it.

References