RPO and RTO

Generals

RPO: The recovery point objective (RPO) is the acceptable amount of data loss measured in time. For example, if a disaster occurs at 12:00 PM (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 AM. Data loss will span only one hour, between 11:00 AM and 12:00 PM (noon).
RTO: The recovery time objective (RTO) is the time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA). For example, if a disaster occurs at 12:00 PM (noon) and the RTO is eight hours, the DR process should restore the business process to the acceptable service level by 8:00 PM.

Disaster Recovery Options by technology

S3
1. Objects redundantly stored on multiple devices across multiple facilities within a Region, designed to provide a durability of 99.999999999%
2. Data protection with versioning, MFA, bucket policies, and IAM
3. Cross-region replication enables automatic, asynchronous copying of objects across buckets in different AWS Regions
S3 Glacier
1. Designed for the same durability as Amazon S3
2. An inventory of all archives in each of your vaults is maintained for disaster recovery or occasional reconciliation purposes
EBS
1. Create point-in-time volume snapshots
2. Copy snapshots across Regions and accounts
3. Snapshots are stored in Amazon S3, taking advantage of Amazon S3's durability and availability
4. Volumes are replicated across multiple servers in an Availability Zone
Snowball
1. Using Snowball helps eliminate challenges that can be encountered with large-scale data transfers, such as: High network costs, Long transfer times, Security concerns
2. Snowball devices can help retrieve data (>10TB) much more quickly than high-speed internet.
EFS
1. Amazon EFS File Sync can be used to sync files from on-premises or in-cloud file systems to Amazon EFS at speeds of up to 5x faster than standard Linux copy tools.
EC2
1. When you need to launch new Amazon EC2 instances, either when scaling to provide greater availability or as part of disaster recovery, use custom AMIs to help save time and effort.
2. You can arrange for automatic recovery of an EC2 instance when a system status check of the underlying hardware fails.
3. The instance will be rebooted (on new hardware if necessary) but will retain its instance DID, IP address, Elastic IP addresses, Amazon EBS volume attachments, and other configuration details.
4. For the recovery to be complete, you’ll need to make sure that the instance automatically starts up any services or applications as part of its initialization process.
Custom Container Images
1. Because your containers contain your applications, dependencies, and configuration, deploying more containers is made faster and easier compared with launching them from scratch.
RDS
1. Snapshot data and save it in a separate Region
2. Can save a manual snapshot with up to 20 other AWS accounts
3. Combine Read Replicas with multi-AZ deployments (dependent on database engine)
4. Read Replicas can be promoted to become primary database instances in the event of primary database instance failure
5. Automatic backups available
DynamoDB
1. Back up full tables to other Regions or to Amazon S3 within seconds
2. Point-in-time recovery enables you to continuously back up tables for up to 35 days
3. Initiate backups with a single click in the console or a single API call
4. Build multi-region, multi-master tables with global tables
5. Global tables are replicated across Regions
CloudFormation
1. Model your entire infrastructure in a text file, allowing for fast and consistent redeployment of failed/lost infrastructure
2. No need to perform manual actions or write custom scripts
3. Rolls back changes automatically in event of error
Elastic Beanstalk
1. Quickly redeploy your entire stack in a few clicks
2. Roll back to a previous version of your application if your updated version fails
OpsWorks
1. Automatic host replacement
2. Combine it with AWS CloudFormation in the recovery phase
3. Provision a new stack in the stored configuration that supports the defined RTO

Disaster recovery solution 1: Backup and Restore

Disaster recovery method with longest RPO/RTO
For lower priority use cases
Primarily use Amazon S3 and AWS Storage Gateway
Preparation phase:
1. Take backups of current systems
2. Store backups in Amazon S3
3. Document the procedure to restore from backup on AWS:
  1. Know which AMI to use; build your own as needed
  2. Know how to restore system from backups
  3. Know how to switch to new system
  4. Know how to configure the deployment
In case of disaster:
1. Retrieve backups from Amazon S3
2. Bring up required infrastructure:
  1. Amazon EC2 instances with prepared AMIs, ELB, etc.
  2. Use AWS CloudFormation to automate deployment of core networking
3. Restore system from backup
4. Switch over to the new system:
  1. Adjust DNS records to point to AWS

With StorageGateway

Connects an on-premises software appliance (AWS Storage Gateway Hardware Appliance) with cloud-based storage to provide seamless and highly secure integration between your on-premises IT environment and the AWS storage infrastructure
Supports industry-standard storage protocols that work with your existing applications
Integrated with Amazon CloudWatch, AWS CloudTrail, AWS KMS, IAM, and more
Virtual tape library (VTL): virtual tapes stored in Amazon S3 or Amazon S3 Glacier
Gateway-cached volumes: store primary data in Amazon S3 and retain your frequently accessed data locally at substantial cost savings and lower latency
Gateway-stored volumes: stores primary data locally and asynchronously backs up point-in-time snapshots of this data to Amazon S3

Disaster recovery solution 2: Pilot Light

Low cost, but RPO/RTO of tens of minutes
Best for core application services
Based on having a replicated but scaled-down and not running infrastructure that your application can fail over to once it is activated
Preparation phase:
1. Set up Amazon EC2 instances to replicate or mirror data
2. Ensure that you have all supporting custom software packages available in AWS
3. Create and maintain AMIs of key servers where fast recovery is required
4. Regularly run these servers, test them, and apply any software updates and configuration changes
5. Consider automating the provisioning of AWS resources
In case of disaster:
1. Automatically bring up resources around the replicated core data set
2. Scale the system as needed to handle current production traffic
3. Switch over to the new system:
  1. Adjust DNS records to point to AWS
Objectives:
1. RTO: As long as it takes to detect a need for disaster recovery and automatically scale up the replacement system
2. RPO: Depends on replication type

Disaster recovery solution 3: Fully Working Low-Capacity Standby

More expensive, but an RPO/RTO of minutes
Best for business-critical services
Can take some production traffic at any time, not just during disaster recovery
Cost footprint smaller than full disaster recovery
Preparation:
1. Like pilot light, but components are active 24/7
2. Not scaled for production traffic
3. Best practice: Continuous testing with a statistical subset of production traffic to a disaster recovery site
In case of disaster:
1. Immediately fail over to most critical production load:
  1. Adjust DNS records to point to AWS
2. Scale the system automatically to handle all production load
Objectives:
1. RTO: For critical load, as long as it takes to fail over. For all other load, as long as it takes to scale further.
2. RPO: Depends on replication type

Disaster recovery solution 4: Multi-site Active-Active

Most expensive, but a real-time RPO/RTO
Best for achieving as close to 100% availability as possible
Can take all production load at any moment
Preparation:
1. Similar to low-capacity standby
2. Fully scaling in/out with production load
In case of disaster:
1. Immediately fail over all production load
Objectives:
1. RTO: As long as it takes to fail over
2. RPO: Depends on replication type

Best Practices

Start Simple
Check for software licensing issues
Practice "Game Day" Exercises

Resilient System Example 1

How can you reduce costs for your EC2 instances?

Resilient System Example 2

Page updated

Report abuse