RPO: The recovery point objective (RPO) is the acceptable amount of data loss measured in time. For example, if a disaster occurs at 12:00 PM (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 AM. Data loss will span only one hour, between 11:00 AM and 12:00 PM (noon).
RTO: The recovery time objective (RTO) is the time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA). For example, if a disaster occurs at 12:00 PM (noon) and the RTO is eight hours, the DR process should restore the business process to the acceptable service level by 8:00 PM.
S3
Objects redundantly stored on multiple devices across multiple facilities within a Region, designed to provide a durability of 99.999999999%
Data protection with versioning, MFA, bucket policies, and IAM
Cross-region replication enables automatic, asynchronous copying of objects across buckets in different AWS Regions
S3 Glacier
Designed for the same durability as Amazon S3
An inventory of all archives in each of your vaults is maintained for disaster recovery or occasional reconciliation purposes
EBS
Create point-in-time volume snapshots
Copy snapshots across Regions and accounts
Snapshots are stored in Amazon S3, taking advantage of Amazon S3's durability and availability
Volumes are replicated across multiple servers in an Availability Zone
Snowball
Using Snowball helps eliminate challenges that can be encountered with large-scale data transfers, such as: High network costs, Long transfer times, Security concerns
Snowball devices can help retrieve data (>10TB) much more quickly than high-speed internet.
EFS
Amazon EFS File Sync can be used to sync files from on-premises or in-cloud file systems to Amazon EFS at speeds of up to 5x faster than standard Linux copy tools.
EC2
When you need to launch new Amazon EC2 instances, either when scaling to provide greater availability or as part of disaster recovery, use custom AMIs to help save time and effort.
You can arrange for automatic recovery of an EC2 instance when a system status check of the underlying hardware fails.
The instance will be rebooted (on new hardware if necessary) but will retain its instance DID, IP address, Elastic IP addresses, Amazon EBS volume attachments, and other configuration details.
For the recovery to be complete, you’ll need to make sure that the instance automatically starts up any services or applications as part of its initialization process.
Custom Container Images
Because your containers contain your applications, dependencies, and configuration, deploying more containers is made faster and easier compared with launching them from scratch.
RDS
Snapshot data and save it in a separate Region
Can save a manual snapshot with up to 20 other AWS accounts
Combine Read Replicas with multi-AZ deployments (dependent on database engine)
Read Replicas can be promoted to become primary database instances in the event of primary database instance failure
Automatic backups available
DynamoDB
Back up full tables to other Regions or to Amazon S3 within seconds
Point-in-time recovery enables you to continuously back up tables for up to 35 days
Initiate backups with a single click in the console or a single API call
Build multi-region, multi-master tables with global tables
Global tables are replicated across Regions
CloudFormation
Model your entire infrastructure in a text file, allowing for fast and consistent redeployment of failed/lost infrastructure
No need to perform manual actions or write custom scripts
Rolls back changes automatically in event of error
Elastic Beanstalk
Quickly redeploy your entire stack in a few clicks
Roll back to a previous version of your application if your updated version fails
OpsWorks
Automatic host replacement
Combine it with AWS CloudFormation in the recovery phase
Provision a new stack in the stored configuration that supports the defined RTO
Disaster recovery method with longest RPO/RTO
For lower priority use cases
Primarily use Amazon S3 and AWS Storage Gateway
Preparation phase:
Take backups of current systems
Store backups in Amazon S3
Document the procedure to restore from backup on AWS:
Know which AMI to use; build your own as needed
Know how to restore system from backups
Know how to switch to new system
Know how to configure the deployment
In case of disaster:
Retrieve backups from Amazon S3
Bring up required infrastructure:
Amazon EC2 instances with prepared AMIs, ELB, etc.
Use AWS CloudFormation to automate deployment of core networking
Restore system from backup
Switch over to the new system:
Adjust DNS records to point to AWS
Connects an on-premises software appliance (AWS Storage Gateway Hardware Appliance) with cloud-based storage to provide seamless and highly secure integration between your on-premises IT environment and the AWS storage infrastructure
Supports industry-standard storage protocols that work with your existing applications
Integrated with Amazon CloudWatch, AWS CloudTrail, AWS KMS, IAM, and more
Virtual tape library (VTL): virtual tapes stored in Amazon S3 or Amazon S3 Glacier
Gateway-cached volumes: store primary data in Amazon S3 and retain your frequently accessed data locally at substantial cost savings and lower latency
Gateway-stored volumes: stores primary data locally and asynchronously backs up point-in-time snapshots of this data to Amazon S3
Low cost, but RPO/RTO of tens of minutes
Best for core application services
Based on having a replicated but scaled-down and not running infrastructure that your application can fail over to once it is activated
Preparation phase:
Set up Amazon EC2 instances to replicate or mirror data
Ensure that you have all supporting custom software packages available in AWS
Create and maintain AMIs of key servers where fast recovery is required
Regularly run these servers, test them, and apply any software updates and configuration changes
Consider automating the provisioning of AWS resources
In case of disaster:
Automatically bring up resources around the replicated core data set
Scale the system as needed to handle current production traffic
Switch over to the new system:
Adjust DNS records to point to AWS
Objectives:
RTO: As long as it takes to detect a need for disaster recovery and automatically scale up the replacement system
RPO: Depends on replication type
More expensive, but an RPO/RTO of minutes
Best for business-critical services
Can take some production traffic at any time, not just during disaster recovery
Cost footprint smaller than full disaster recovery
Preparation:
Like pilot light, but components are active 24/7
Not scaled for production traffic
Best practice: Continuous testing with a statistical subset of production traffic to a disaster recovery site
In case of disaster:
Immediately fail over to most critical production load:
Adjust DNS records to point to AWS
Scale the system automatically to handle all production load
Objectives:
RTO: For critical load, as long as it takes to fail over. For all other load, as long as it takes to scale further.
RPO: Depends on replication type
Most expensive, but a real-time RPO/RTO
Best for achieving as close to 100% availability as possible
Can take all production load at any moment
Preparation:
Similar to low-capacity standby
Fully scaling in/out with production load
In case of disaster:
Immediately fail over all production load
Objectives:
RTO: As long as it takes to fail over
RPO: Depends on replication type
Start Simple
Check for software licensing issues
Practice "Game Day" Exercises