What
The SaaS environment had services distributed across multiple Availability Zones (AZs) in an Active / Active configuration to provide resilience to AZ failures. The database relied on Active / Passive configuration, so the active node was in one AZ only. We found performance significantly reduced when requests came from the Application Server in one AZ to the Database Server in another. The difference compared with the App Server co-located with the DB was stark - 2.4 seconds to render a page Vs 8 seconds when the traffic traversed AZs.
So What
Latency between AZs was low, real-world network traffic round trips were 2ms when the App Server and DB server we co-located, and 4-5ms when they were not co-located. We employed our standard performance investigation framework and together with a team member's network tracing skills found the problem was the number of round trips to build a page was approx 2000. The app server was making 2000 DB calls to render the page, so the latency difference between 2ms and 4ms was the causing the inconsistent and poor performance.
Our AWS deployment model was broken. We relied on multi-AZ active / active for high availability.
Now What
This caused a quick re-think of our architecture. We had been testing and refining our automated DR process which used scripting to stand up a complete new environment from backup. This had exceeded all duration expectations, creating new AWS instances and bringing up the DR environment in under 10 minutes. We modified these scripts decomposing them into two components:
1. to define new instances from backup
2. to bring up a new Production environment
The Production application was duplicated to maintain HA capabilities, with a complete HA Active set on one AZ, and another complete Passive set in another AZ with machines turned off to save cost.
Our patching process used the first script to rebuild the passive machines after patching. Our DR and Database fail-over process used the second script to move processing between AZs.
The result was maintaining HA at the same level as previously, reducing the DR fail-over time from 10 minutes to 3 minutes and for very little additional ongoing cost or effort.