aws disaster recovery architecture

Even though data may be replicated between Regions, we still must also back up the data as part of DR. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service. executed.

For workloads on existing physical or virtual data centers or private clouds, CloudEndure Disaster Recovery, Instead of creating individual Amazon Elastic Compute Cloud (Amazon EC2) instances, create worker nodes using an Amazon EC2 Auto Scaling group. Amazon Route53 Application Recovery Controller helps you Note: Amazon Redshift may also relocate clusters in non-AZ failure situations, such as when issues in the current AZ prevent optimal cluster operation or to improve service availability. If Example Corp has multiple applications with varying criticality, and each of their applications have different needs in terms of resiliency, [], In part I of this series, we introduced a disaster recovery (DR) concept that uses managed services through a single AWS Region strategy. Now lets learn about the pilot light and warm standby strategies. When Amazon Redshift relocates a cluster to a new AZ, the new cluster has the same endpoint as the original cluster. Data replication is useful for data synchronization and will As lead solutions architect for the AWS Well-Architected Reliability pillar, I help customers build resilient workloads on AWS. or region: Ensure that your infrastructure, data, and Such increases in RTO and RPO are fine, as long as business objectives can be met. can route load to healthy AWS Regions. This includes support infrastructure such as Amazon Virtual Private Cloud (Amazon VPC) with subnets and routing configured, Elastic Load Balancing, and Amazon EC2 Auto Scaling groups. In Figure 3, we show how active/passive works. Therefore, you must choose RTO and RPO objectives that provide appropriate value for your workload. Both strategies replicate data from the primary Region to data resources in the recovery Region, such as Amazon Relational Database Service (Amazon RDS) DB instances or Amazon DynamoDB tables. reduced capacity levels) immediately.

Deploying your data nodes into three AZs with Amazon OpenSearch Service (formerly Amazon Elasticsearch Service) can improve the availability of your domain and increase your workloads tolerance for AZ failures. DR to ensure that RTO and RPO are met. For RTO and RPO, lower numbers represent less downtime and data loss. Recovery Time Objective (RTO) is defined by the organization. validate the implementation: Regularly test failover to Parts II and III of this series will show you how to implement this service in a multi-Region DR deployment. Standby. This strategy replicates workloads across multiple AZs and continuously backs up your data to another Region with point-in-time recovery, so your application is safe even if all AZs within your source Region fail. Figure 4.

configurations. Take automatic, incremental snapshots of your data periodically with Amazon Redshift and save them to Amazon S3. The probability of disruption and cost Figure 2 shows an EC2 Auto Scaling group that is configured, but it has no deployed EC2 instances. Brent Kim is an Advisory Consultant within the AWS ProServe SDT Advisory group, and has been with AWS for 3 years. Fully automatic failover such as this should be used with caution. To select the best strategy, you must analyze benefits and risks with the business owner of a workload, as informed by engineering/IT. What does static stability mean with regard to a multi-Region disaster recovery (DR) plan? In Figure 6, Amazon Aurora global database replicates data to a local read-only cluster in the recovery Region. Service validation tests provide metrics on the function and correctness of your API operations. If you've got a moment, please tell us what we did right so we can do more of it. RPO is the maximum acceptable amount of time since the last data recovery point. My subsequent posts shared details on the backup and restore, pilot light, and warm standby active/passive strategies. But, you can also use these for Multi-AZ strategies or hybrid (on-premises workload/cloud recovery) strategies. Figure 5. Figure 4 shows an active/active strategy where two or more Regions are actively accepting requests and data is replicated between them. Backups are created in the same Region as their source and are also copied to another Region. When you write to a data store and The pilot light and warm standby strategies both offer a good balance of benefits and cost, as shown in Figure 1. The parameter value can be set via the AWS Management Console as shown in Figure 4. Figure 2. By using the best practices provided in the AWS Well-Architected Reliability Pillar whitepaper to design your DR strategy, your workloads can remain available despite disaster events [], As lead solutions architect for the AWS Well-Architected Reliability pillar, I help customers build resilient workloads on AWS. This gives you the most effective protection from disasters of any scope of impact.

Having backups and redundant workload components in place is the start of your DR Implement a strategy to meet these objectives, considering locations and

Each DR strategy will be detailed in future blog posts; the following sections summarize each strategy. 2022, Amazon Web Services, Inc. or its affiliates. between these based on your RTO and RPO needs. Choose a strategy such as: backup and restore, active/passive (pilot light or warm standby), or active/active. Other elements such as application servers are In the example we When you deploy the data nodes across three AZs with one replica enabled, shards are distributed across the three AZs. Recovery Point Objective DR strategies trade-offs between RTO/RPO and costs. available through AWS Marketplace, enables organizations to set up an automated disaster recovery These data resources are ready to serve requests. Ultimately, any event that prevents a workload or system from fulfilling its business objectives in its primary location is classified a disaster. For example, This significantly reduces the risk of a single event impacting more than one AZ. AWS Config continuously monitors and records your AWS resource It may be more, but is always less than the full production deployment for cost savings. In the case of disaster events that wipe out or corrupt your data, these backups let you rewind to a last known good state.

monitoring for failures, deploying to multiple locations, and automatic failover. If a disaster event occurs and the active Region cannot support workload operation, then the passive site becomes the recovery site (recovery Region). Here too you can use endpoint health checks for automatic routing, or set the percent traffic to each endpoint using traffic dials. AWS offers resources and services to build a DR strategy that meets your business needs. Amazon Relational Database Service (Amazon RDS) handles failovers automatically so you can resume database operations as quickly as possible. With Application Recovery Controller, you can create Route 53 health checks that do not actually check health, but instead act as on/off switches that you have full control over. This determines what is considered an acceptable time window when service is unavailable. Also, AWS CloudFormation is a powerful tool for making these updates.

less): Back up your data and applications using point-in-time backups into the DR Region. Amazon ElastiCache continually monitors the state of the primary node. AWS. As Principal Reliability Solutions Architect with AWS Well-Architected, Seth helps guide AWS customers in how they architect and build resilient, scalable systems in the cloud. If you have a complex or critical recovery path, you It provides a quick way to light the furnace burners that then provide heat. In part two, we introduce a multi-Region backup and restore approach. paths is best. A pattern to avoid is developing recovery paths that are rarely DR is a crucial part of your Business Continuity Plan. distribute load to healthy Availability Zones while services, such as Amazon Route53 and AWS Global Accelerator, However, you can use AWS resources like Amazon EventBridge to build serverless automation, which will reduce RTO by improving detection and recovery. Disaster events pose a threat to your workload availability, but by using AWS Cloud services you can mitigate or remove these threats. (RPO) is defined by the organization. All rights reserved. In this 3-part blog series, we filter through those 200+ services and focus on those that have specific features to assist you in building multi-Region applications. AWS Region other than the one primary used for your workload (or any AWS Region if your Click here to return to Amazon Web Services homepage, natural disasters, technical failures, or human actions, RTO (recovery time objective) and RPO (recovery point objective), Active/passive and active/active DR strategies, Amazon Relational Database Service (Amazon RDS), Amazon Virtual Private Cloud (Amazon VPC), Amazon Elastic Compute Cloud (Amazon EC2), KPIs indicate whether the workload is performing as intended, Amazon Elastic Container Service (Amazon ECS), Amazon Route 53 Application Recovery Controller, tells Route 53 to send traffic to the recovery Region instead of the primary Region, Disaster Recovery (DR) Architecture on AWS, Part I: Strategies for Recovery in the Cloud, Disaster Recovery (DR) Architecture on AWS, Part II: Backup and Restore with Rapid Recovery, Disaster Recovery (DR) Architecture on AWS, Part IV: Multi-site Active/Active, Disaster recovery options in the cloud whitepaper. As required for all active/passive strategies, both require a means to route traffic to the primary Region, and then fail over to the recovery Region when recovering from a disaster. during testing or when Disaster Recovery failover is invoked. Therefore, if youre designing a DR strategy to withstand events such as power outages, flooding, and other other localized disruptions, then using a Multi-AZ DR strategy within an AWS Region can provide the protection you need.

Warm standby (RPO in seconds, RTO in minutes): Maintain a recovery. The single Region/multi-AZ strategy safeguards your workloads against a disaster that disrupts an Amazon data center by replicating workloads across multiple AZs in the same Region. corruption or destruction unless your solution also includes options for point-in-time Although there are ways to work around this, we are focusing on cluster relocation. reliance will be. Based on configured health checks, AWS services, such as Elastic Load Balancing and AWS Auto Scaling, can By first understanding business requirements for your workload, you can choose an appropriate DR strategy. All rights reserved. In the next post, we will discuss a multi-Region warm standby strategy for the same application stack illustrated in this post. only requires you to scale up (everything is already deployed and running). databases and object storage are always on. The warm standby strategy deploys a functional stack, but at reduced capacity. RTO for these strategies is different. In the pilot light strategy, basic infrastructure elements are in place like Elastic Load Balancing and Amazon EC2 Auto Scaling in Figure 6. Seth joined Amazon in 2005 where soon after, he helped develop the technology that would become Prime Video. objective (RTO) and recovery point objective (RPO). Manage configuration drift at the DR site Choose Using CloudFormation parameters and conditional logic, you can create a single template that can create both active stacks (primary Region) or passive stacks (recovery Region). In addition to replication, both strategies require you to create a continuous backup in the recovery Region. With this approach, you can deploy a DR solution in multiple Regions, but it will be associated with longer RPO/RTO. regardless of need. The workload operates from a single site (in this case an AWS Region) and all requests are handled from this active Region. Instead of using Route 53 and DNS records, you can also use AWS Global Accelerator to implement failover. This blog shows you how AWS managed services automatically fails over between AZs without interruption when experiencing a localized disaster, and how backups to a separate Region ensure data protection. Live data means the data stores and databases are up-to-date (or nearly up-to-date) with the active Region and ready to service read operations. Then we explored the backup and restore strategy. What if the very tools that we rely on for failover are themselves impacted by a DR event? Here is how the managed services back up data to a secondary Region: Note: You can add a layer of protection to your backups through AWS Backup Vault Lock and S3 Object Lock. Using [], The Availability and Beyond whitepaper discusses the concept of static stability for improving resilience. the DR site or region. In my first blog post of this series, I introduced you to four strategies for disaster recovery (DR). This is to ensure high availability of the service and application. Or to automate the process, you can use the AWS CLI to update the stack, and change the ActiveOrPassive value. All requests are now switched to be routed there in a process called failover. For tighter RTO/RPO objectives, the data is maintained live, and the infrastructure is fully or partially deployed in the recovery site before failover. This distribution helps prevent cluster downtime if an AZ experiences a service disruption. You can download the entire template here. Javascript is disabled or is unavailable in your browser.

For Region failover, in addition to data recovery from backup, you must also be able to restore your infrastructure in the recovery Region. From left to right, the graphic shows how DR strategies incur differing RTO and RPO. Then we explored the backup and restore strategy. In Part I, well discuss the single AWS Region/multi-Availability Zone (AZ) DR strategy. AWS Systems Manager Automation to fix it and raise alarms. Join the group to a cluster, and the group will automatically replace any terminated or failed nodes if an AZ fails. infrastructure. strategy. The thoughtful design of a cost-optimized solution will allow your business to sustain the system [], In this blog post, we share a reference architecture that uses amulti-Region active/passivestrategy to implement a hot standby strategy for disaster recovery (DR). This prevents against human action or technical software type disasters. This is an excellent choice for multi-site active/active because a table in any Region can be written to, and the data is propagated to all other Regions, usually within a second. Figure 2 categorizes DR strategies as either active/passive or active/active. Like a pilot light in a furnace that cannot heat your house until triggered, a pilot light strategy cannot process requests until it is triggered to deploy the remaining infrastructure. 2022, Amazon Web Services, Inc. or its affiliates. Define recovery objectives for downtime If the passive stack is deployed to the recovery Region at full capacity however, then this strategy is known as hot standby. Because warm standby deploys a functional stack to the recovery Region, this makes it easier to test Region readiness using synthetic transactions. RTO potentially zero): Your workload is deployed to, and actively serving traffic from, When the time comes for recovery, the system is scaled up quickly to handle the 2022, Amazon Web Services, Inc. or its affiliates. Using.

As always for DR, data is also backed up in case it needs to be restored to fix accidental deletion or corruption. When you deploy across three AZs, Amazon OpenSearch Service distributes master nodes equally across all three AZs. still need to regularly execute that failure in production to choose one of the following multi-region strategies. Figure 2 shows the four strategies for DR that are highlighted in the DR whitepaper. RPO for these strategies is similar, since they share a common data strategy. These strategies enable you to prepare for and recover from a disaster. Failover consists of re-routing requests away from a Region that cannot serve them. Set these based on

This example architecture refers to an application that processes payment transactions that has been modernized with AMS. Customer traffic is onboarded at the closest of over 200 edge locations and travels over the AWS network to the endpoints you configure. The strategy outlined in this blog post addresses how to integrate AWS managed services [], Voice calling systems are prevalent and necessary to many businesses today. Previously, I introduced you to four strategies for disaster recovery (DR) on AWS.

If you dont frequently test this failover, you might As Principal Reliability Solutions Architect with AWS Well-Architected, Seth helps guide AWS customers in how they architect and build resilient, scalable systems in the cloud. features continually monitor your applications ability to recover from failures, so you can Between these two strategies, you have a choice of optimizing for RTO or for cost. It lets you specify active or passive for the parameter ActiveOrPassive, which determines whether zero or non-zero EC2 instances will be deployed. might have been sufficient when you last tested, may be no longer A replacement read replica is then created and provisioned in the same AZ as the failed primary. But as with all DR strategies, backups (like the Aurora DB cluster snapshot in Figure 6) are also necessary. discrete copies of the entire workload. In Part 1, well build [], This 3-part blog series discusses disaster recovery (DR) strategies that you can implement to ensure your data is safe and that your workload stays available during a disaster. The primary DB instance is synchronously replicated across AZs to a standby replica. Failover re-directs production traffic from the primary Region (where you have determined the workload can no longer run) to the recovery Region.