For the last couple of weeks I have been going over articles and videos in the Amazon Builder library. They cover useful patterns that Amazon uses to build and operate software. Below are the important points I captured while going over the material.
Reliability, constant work, and a good cup of coffee – Link
- Amazon systems strive to solve problems using reliable constant work patterns. These work patterns have three key features:
- One, they don’t scale up or slow down with load or stress.
- Two, they don’t have modes, which means they do the same operations in all conditions.
- Three, if they have any variation, it’s to do less work in times of stress so they can perform better when you need them most.
- There are not many problems that can be efficiently designed using constant work patterns.
- For example, If you’re running a large website that requires 100 web servers at peak, you could choose to always run 100 web servers. This certainly reduces a source of variance in the system, and is in the spirit of the constant work design pattern, but it’s also wasteful. For web servers, scaling elastically can be a better fit because the savings are large. It’s not unusual to require half as many web servers off peak time as during the peak.
- Based on the examples given in the post it seems that a constant work pattern is suitable for use cases where system reliability, stability, and self-healing are primary concerns. It is fine if the system does some wasteful work and costs more. These are essential concerns for systems which others use to build their systems on. I think control plane systems fall under this category. The example of such a system mentioned in the post is a system that applies configuration changes to foundational AWS components like AWS Network load balancer. The solution can be designed using both the push and pull based approach. The pull based constant work pattern approach lends to a simpler and reliable design.
- Although not mentioned in the post, constant work that the system is doing should be idempotent in nature.
Making retries safe with idempotent APIs – Link
- Retrying a failed request is a simple strategy that can help you build resilient systems. You can retry requests that fail as a result of network IO issues, server-side fault, or service rate limiting. But, to safely retry you need to ensure the operation you are retrying is idempotent. An idempotent operation is one where a request can be retransmitted or retried with no additional side effects, a property that is very beneficial in distributed systems.
- To determine that request is a duplicate of an earlier request Amazon uses a unique caller provided client request identifier. Requests from the same caller with the same client request identifier can be considered duplicate requests and can be dealt with accordingly. If the caller does not provide a client token then the server that responds to the requests generates the token and sends that in the response. Any future request with the same parameters that uses the previous client token is considered a duplicate and semantically equivalent response is returned.
- The process that combines recording the idempotent token and all mutating operations related to servicing the request must meet the properties for an atomic, consistent, isolated, and durable (ACID) operation. An ACID server-side operation needs to be an “all or nothing” or atomic process. This ensures that we avoid situations where we could potentially record the idempotent token and fail to create some resources or, alternatively, create the resources and fail to record the idempotent token.
- If the caller sends the same client token but changes the request parameters then validation error is returned in the response. To support this deep validation, AWS services also store the parameters used to make the initial request along with the client request identifier.
Fairness in multi-tenant systems – Link
I still need to read this article.
Avoiding overload in distributed systems by putting the smaller service in control – Link
- Within AWS, a common pattern is to split the system into services that are responsible for executing customer requests (the data plane), and services that are responsible for managing and vending customer configuration (the control plane). The size of the data plane fleet exceeds the size of the control plane fleet, frequently by a factor of 100 or more.
- When building such architectures at Amazon, one of the decisions we carefully consider is the direction of API calls. Should the larger data plane fleet call the control plane fleet? Or should the smaller control plane fleet call the data plane fleet? For many systems, having the data plane fleet call the control plane tends to be the simpler of the two approaches. In such systems, the control plane exposes APIs that can be used to retrieve configuration updates and to push operational state. The data plane servers can then call those APIs either periodically or on demand. Each data plane server initiates the API request, so the control plane does not need to keep track of the data plane fleet. For each request, a control plane server responds with updated configuration (usually by querying its durable data store), and then forgets about it. To provide fault tolerance and horizontal scalability, control plane servers are placed behind a load balancer like Application Load Balancer. This is one of the simpler types of distributed systems to build.
- The biggest challenge with this architecture is scale mismatch. The control plane fleet is badly outnumbered by the data plane fleet. At AWS, we looked to our storage services for help. When it comes to serving content at scale, the service we commonly use at Amazon is Amazon Simple Storage Service (S3). Instead of exposing APIs directly to the data plane, the control plane can periodically write updated configuration into an Amazon S3 bucket. Data plane servers then poll this bucket for updated configuration and cache it locally. Similarly, to stay up to date on the data plane’s operational state, the control plane can poll an Amazon S3 bucket into which data plane servers periodically write that information.
- Despite its popularity, this architecture does not work in all cases. In systems with a large amount of dynamic configuration it might be impractical to continuously recompute the entire configuration and write it into Amazon S3. An example of such a system is Amazon EC2, where a large number of data plane servers need to know about the constantly changing configuration of instances that they need to run, their VPCs, IAM roles, and Amazon Elastic Block Store (EBS) volumes.
- At Amazon when faced with such scenarios, we look for other approaches where the small fleet can be in control of the pace at which requests flow through the system. One such approach is to reverse the flow of API calls and have the smaller control plane fleet push configuration changes to the larger data plane fleet. To implement this approach you have to solve three main challenges
- The control plane needs to keep an up-to-date inventory of all available data plane servers, so that it can push configuration updates to each one.
- At any point in time, some data plane servers will be unreachable, and the control plane needs to deal with that.
- The control plane needs a mechanism to make sure that every data plane server receives configuration updates from at least one control plane server and a mechanism to ensure that no control plane server is responsible for too many data plane servers.
- The article suggests two ways to implement control plane -> data plane API flow.
- Using consistent hashing and a shared data store like DynamoDB
- Data plane initiates the connection to the control plane but then control takes over and calls the data plane APIs. This is implemented using the HTTP server push mechanism.
Building dashboards for operational visibility – Link
- At Amazon, we refer to the creation, usage, and ongoing maintenance of these dashboards as dashboarding. Dashboarding has evolved into a first-class activity because it’s as critical to the success of our services as other day-to-day software delivery and operational activities, such as designing, coding, building, testing, deploying, and scaling our services.
- The data that each audience wants to see can vary significantly from dashboard to dashboard. We have learned to focus on the intended audience when we design dashboards. We decide what data goes into each dashboard based on who will use the dashboard and why they will use the dashboard. You’ve probably heard that at Amazon we work backwards from the customer. Dashboard creation is a good example of this. We build dashboards based on the needs of the expected users and their specific requirements.
- Amazon teams create two types of dashboards
- High level dashboards: These include:
- Customer experience dashboards
- System-level dashboards
- Service instance dashboards
- Service audit dashboards
- Capacity planning and forecasting dashboards
- Low level dashboards: These include:
- Microservice-specific dashboards
- Infrastructure dashboards
- Dependency dashboards
- High level dashboards: These include:
- We have adopted a design convention that standardizes the layout of data in a dashboard. Dashboards render from top to bottom, and users tend to interpret the initially rendered graphs (visible when the dashboard loads) as the most important. So, our design convention advises placing the most important data at the top of the dashboard. We have found that aggregated/summary availability graphs and end-to-end latency percentile graphs are typically the most important dashboards for web services.
- Layout graphs for the expected minimum display resolution
- For dashboards that display date and time data, we make sure that the related time zone is visible in the dashboard. For dashboards that are used concurrently by operators in different time zones, we default to one time zone (UTC) that all users can relate to. This way users can communicate with each other using a single time zone, saving them the time and effort of doing excessive mental time zone translations.
- Maintaining and updating dashboards is ingrained in our development process.
- Review dashboards during the code review process
- Maintain dashboard as code
- When we conduct a post-mortem for an unexpected operational event, our teams review whether improvements to dashboards (and automated alarming) could have preempted the event, identified the root cause faster, or reduced the mean time to recovery.
Automating safe, hands-off deployments – Link
- Amazon uses everything as code philosophy. Source code includes application code, static assets, tests, infrastructure, configuration, and the application’s underlying operating system (OS). All these different types of code assets are kept in different repositories. Each of them has their own deployment pipeline. These pipelines have rollback mechanisms built as well.
- A typical microservice might have an application code pipeline, an infrastructure pipeline, an OS patching pipeline, a configuration/feature flags pipeline, and an operator tools pipeline. Having multiple pipelines for the same microservice helps us deploy changes to production faster. Application code changes that fail integration tests and block the application pipeline don’t affect other pipelines. For example, they don’t block infrastructure code changes from reaching production in the infrastructure pipeline.
- Code reviewers make use of a code review checklist to ensure confirmness. Their checklist includes tasks like to check if any changes are required to their monitoring setup and if the change can be rolled back without any issues.
- Before deploying to production, the pipeline deploys and validates changes in multiple pre-production environments, for example, alpha, beta, and gamma. Alpha and beta validate that the latest code functions as expected by running functional API tests and end-to-end integration tests. Gamma validates that the code is both functional and that it can be safely deployed to production. Gamma is as production-like as possible, including the same deployment configuration, the same monitoring and alarms, and the same continuous canary testing as production. Gamma is also deployed in multiple AWS Regions to catch any potential impact from regional differences.
- AWS team also tests for APIs backward compatibility. This is done in a Gamma pre-prod environment where continuous canary testing is done. They add one node/vm/container that runs the latest code with the currently deployed code. This helps them uncover issues that might happen if the latest code writes data in a format that the current code can’t parse.
- Production deployment is done in multiple stages each limited to a small portion of AWS infrastructure. Teams split regional deployments into even smaller-scoped deployments by deploying to individual Availability Zones or to their service’s individual internal shards (called cells) in their pipeline, to further limit the scope of potential impact from a failed production deployment.
Instrumenting distributed systems for operational visibility – Link
- We find that when we reduce high percentile latency in a system, our efforts have the side effect of reducing median latency. By contrast, we find that when we reduce median latency, this reduces high percentile latency less often.
- Instrumentation requires developers to write more code. The two common patterns Amazon use for instrumentation are:
- Common shared libraries are instrumented so that consumers of the libraries get visibility into how the library is operating
- Use structured log-based metric reporting
- Amazon writes all the metrics data (timers and counters) to a log file. Aggregate metrics are computed by processing these log files by another system.
- All services emit two types of data – request data and debugging data
- Request data is a single structured log entry that contains properties about the request and who made the request, what the request was for, counters of how often things happened, and timers of how long things took.
- Debugging data includes unstructured or loosely structured data of whatever debugging lines the application emits
- These two types of data are written to two separate log files
- Log the request before the request is rejected due to validation or authentication or throttling or any other reason
- Plan for scenarios when service fails to log. This could be due to disk fill up or some other issue.
- You should log availability and latency of all the dependencies.
- Log different latency metrics depending on status code and size. Errors are often fast—like access denied, throttling, and validation error responses. If clients start getting throttled at a high rate, it might make latency look deceptively good. To avoid this metric pollution, we log a separate timer for successful responses, and focus on that metric in our dashboards and alarms instead of using a generic Time metric.
- Consider rate limiting how often a logger will log. This helps when a service starts logging a large volume of errors.
- Offload serialization and log flushing to a separate thread.
Leader Election in Distributed Systems – Link
- Leader election is the simple idea of giving one thing (a process, host, thread, object, or human) in a distributed system some special powers. Those special powers could include the ability to assign work, the ability to modify a piece of data, or even the responsibility of handling all requests in the system. This pattern simplifies distributed systems but it has its own downsides. Downsides include being single point of failure and single point in scaling both in terms of data size and request rate.
- Amazon uses sharding to scale the number of leaders. Each item of data still belongs to a single leader, but the whole system contains many leaders. This is the fundamental design approach behind Amazon DynamoDB (DynamoDB), Amazon Elastic Block Store (Amazon EBS), Amazon Elastic File System (Amazon EFS), and many other systems at Amazon.
- To ensure systems work in failure situations, Amazon distributed systems don’t have a single leader. In distributed systems, it’s not possible to guarantee that there is exactly one leader in the system. Instead, there can mostly be one leader, and there can be either zero leaders or two leaders during failures. To handle two leaders, the processing system needs to be idempotent.
- Consider that slow networking, timeouts, retries, and garbage collection pauses can cause the remaining lease time to expire before the code expects it to.
Ensuring rollback safety during deployments – Link
- Amazon uses a two phase deployment technique to ensure safe rollback. The two phase deployment technique helps in rolling deployment strategy where servers could be in a different state. This technique has two phases – prepare phase and activate phase.
- We generally follow a specific order in the context of serialization. That is, readers go before writers while rolling forward whereas writers go before readers while rolling backward.
- Amazon uses a testing strategy called upgrade-downgrade testing to explicitly verify that a software change is safe to roll forward and backward.
Video: Beyond five 9s: Lessons from our highest available data planes
- Poison Tasters: Running a local copy of service to check if something will not fail
- Data formats and schema matter
- Precompute and generate look up tables to avoid dynamic queries. Use AWS EMR service to generate this lookup tables
- Under high load you should consider sacrificing low priority stuff. For example, logging, metering, etc.
Going faster with continuous delivery – Link
- Controlling when software is released: To control the safety of our software releases, we’ve built mechanisms that allow us to control the speed at which changes move through our pipeline. We use metrics, time windows, and safety checks to control when our software is released.
Timeouts, retries, and backoff with jitter – Link
- If errors are caused by load, retries can be ineffective if all clients retry at the same time. To avoid this problem, we employ jitter. This is a random amount of time before making or retrying a request to help prevent large bursts by spreading out the arrival rate.
- All remote calls should have a timeout. There are two timeouts – connection timeout and request timeout.
- A good practice for choosing a timeout for calls within an AWS Region is to start with the latency metrics of the downstream service. So at Amazon, when we make one service call another service, we choose an acceptable rate of false timeouts (such as 0.1%). Then, we look at the corresponding latency percentile on the downstream service (p99.9 in this example).
- Instead of retrying immediately and aggressively, the client waits some amount of time between tries. The most common pattern is an exponential backoff, where the wait time is increased exponentially after every attempt. Exponential backoff can lead to very long backoff times, because exponential functions grow quickly. To avoid retrying for too long, implementations typically cap their backoff to a maximum value.
- In Microservices world where requests fan-out to multiple services it is important to consider whether you should retry at the lowest level stack service or at the highest layer of the stack.
- You should retry idempotent operations only. Consider adding an idempotency key in your API so that you can safely make non-idempotent operations idempotent.
- Knowing which failures are worth retrying. HTTP provides a clear distinction between client and server errors. It indicates that client errors should not be retried with the same request because they aren’t going to succeed later, while server errors may succeed on subsequent tries.
- Jitter adds some amount of randomness to the backoff to spread the retries around in time. Jitter isn’t only for retries. Operational experience has taught us that the traffic to our services, including both control-planes and data-planes, tends to spike a lot. These spikes of traffic can be very short, and are often hidden by aggregated metrics. When building systems, we consider adding some jitter to all timers, periodic jobs, and other delayed work. This helps spread out spikes of work, and makes it easier for downstream services to scale for a workload.
Video: Architecting and operating resilient serverless systems at scale – Link
- Using load shedding you can maintain goodput.
- Throughput is the total number of requests per second that is being sent to the server.
- Goodput is the subset of the throughput that is handled without errors and with low enough latency for the client to make use of the response
I have not completely watched this video. Some of the material was already covered in other posts.
Video: Amazon’s approach to high-availability deployment – Link
I still need to watch this video.
Workload isolation using shuffle-sharding – Link
- Amazon uses a technique called shuffle sharding to build a 100 percent uptime DNS service – Route 53. DNS services are the target for most DDoS attacks.
- Shuffle sharding is a sharding strategy to reduce the blast radius of an outage and better isolate tenants. It is useful in multi-tenant architectures. The advantages of shuffle sharding are:
- An outage only affects a subsets of tenants
- A misbehaving tenant will affect only its shard instances. Due to the low overlap of instances between different tenants, it’s statistically quite likely that any other tenant will run on different instances or only a subset of instances will match the affected ones.
- In the example covered in the post with 8 workers and 8 customers, shuffle sharding is 7 times better than regular sharding.
Using load shedding to avoid overload – Link
- You need to perform two types of load tests. First, where the server fleet increases as load increases. Second with fixed server fleet size.
- If, in an overload test, a service’s availability degrades rapidly to zero as throughput increases, that’s a good sign that the service is in need of additional load shedding mechanisms. The ideal load test result is for goodput to plateau when the service is close to being fully utilized, and to remain flat even when more throughput is applied.
- When using load shedding measures to maintain goodput it is important you understand the cost of rejecting a request. There are times when the cost of dropping a request is more than holding on to the request. In such cases, it is better to slow down rejected requests to match the latency of successful responses
- When clients time out, the server still does the work. This work is wasted.
- One way to avoid this wasted work is for clients to include timeout hints in each request, which tell the server how long they’re willing to wait. The server can evaluate these hints and drop doomed requests at little cost.
- Send remaining time
- Another way to do load shedding in the right way is to drop tasks that are sitting in the queue for long. You no longer need to follow FIFO order. We need LIFO in these situations.
- When we pull work off queues, we use that opportunity to examine how long the work was sitting on the queue. At a minimum, we try to record that duration in our service metrics. In addition to bounding the size of queues, we’ve found it’s extremely important to place an upper bound on the amount of time that an incoming request sits on a queue, and we throw it out if it’s too old. This frees up the server to work on newer requests that have a greater chance of succeeding. As an extreme version of this approach, we look for ways to use a last in, first out (LIFO) queue instead, if the protocol supports it.
Video: Amazon’s approach to security during development – Link
I still need to watch this video.
Caching challenges and strategies – Link
- The rate of change of the source data, as well as the cache policy for refreshing data, will determine how inconsistent the data tends to be. These two are related to each other. For example, relatively static or slow-changing data can be cached for longer periods of time.
- There are generally two types of cache – Local(or On-box) cache and External cache.
- Local caches are simple to implement. They can be implemented using a hash table. Despite their simplicity, local caches can lead to cache coherence and cold start issues.
- External caches solve both the Local cache issues – cache coherence and cold start. They solve problems because they have their own centralized fleet. External caches make architecture more complex and they require more operational work to keep their uptime high.
- We prefer to either use the external cache in conjunction with an in-memory cache that we can fall back to if the external cache becomes unavailable, or to use load shedding and cap the maximum rate of requests sent to the downstream service.
- With an external cache you have to consider that the format of cached objects will evolve over time. You need to ensure that updated software can always read data that a previous version of the software wrote, and that older versions can gracefully handle seeing new formats/fields.
- There are two approaches to caching – inline caching and side caching
- Inline caches, or read-through/write-through caches, embed cache management into the main data access API, making cache management an implementation detail of that API.
- Side caches, in contrast, are generic object stores that the application code directly manipulates the cache before and after calls to the data source, checking for cached objects before making the downstream calls, and putting objects in the cache after those calls are completed
- The primary benefit of an inline cache is a uniform API model for clients. Caching can be added, removed, or tweaked without any changes to client logic. An inline cache also pulls cache management logic out of application code, thus eliminating a source of potential bugs.
- Another pattern we use to improve resiliency when downstream services are unavailable is to use two TTLs: a soft TTL and a hard TTL. The client will attempt to refresh cached items based on the soft TTL, but if the downstream service is unavailable or otherwise doesn’t respond to the request, then the existing cache data will continue to be used until the hard TTL is reached.
- Thundering herd is a situation in which many clients make requests that need the same uncached downstream resource at approximately the same time. To remedy this issue we use request coalescing, where the servers or external cache ensure that only one pending request is out for uncached resources.
- Be deliberate and empirical in the choice of cache size, expiration policy, and eviction policy.
- Consider the security aspects of maintaining cached data, including encryption, transport security when communicating with an external caching fleet, and the impact of cache poisoning attacks and side-channel attacks.
Avoiding insurmountable queue backlogs – Link
- When queues are introduced for durability, it is easy to miss the tradeoff that causes such high processing latency in the face of a backlog. The hidden risk with asynchronous systems is dealing with large backlogs.
- Fairness in a multitenant system is important so that no customer’s workload impacts another customer. A common way that AWS implements fairness is by setting per-customer rate-based limits, with some flexibility for bursting. In many of our systems, for example in SQS itself, we increase per-customer limits as customers grow organically. The limits serve as guardrails for unexpected spikes, allowing us time to make provisioning adjustments behind the scenes.
- Single queue for all consumers(use case and customers) do not work. Also, using a separate queue for each consumer might not work either.
- AWS Lambda provisions a fixed number of queues, and hashes each customer to a small number of queues. Before enqueueing a message, it checks to see which of those targeted queues contains the fewest messages, and enqueues into that one. When one customer’s workload increases, it will drive a backlog in its mapped queues, but other workloads will automatically be routed away from those queues. It doesn’t take a large number of queues to build in some magical resource isolation.
- We can also sideline old traffic. When we dequeue a message, we can check how old it is. Rather than just logging the age, we can use the information to decide whether to move the message into a backlog queue that we work through only after we’re caught up on the live queue. If there’s a load spike where we ingest a lot of data, and we get behind, we can sideline that wave of traffic into a different queue as quickly as we can dequeue and re-enqueue the traffic. This frees up consumer resources to work on fresh messages more quickly than if we had simply worked the backlog in order. This is one way to approximate LIFO ordering.
Video: Amazon’s approach to building resilient services – Link
- You need to close the loop. The loop mentioned in the video – Build → Fail → Analyze causes for failure → Ops team lears → Build
- Build your systems to be kind. Kind and wicked learning environments
- Kind: The things we learn match the environment well. More experience means better predictions and better judgements
- Wicked: The things we learn don’t match the environment. Our experience leads to the wrong things.
Video: Amazon’s approach to failing successfully – Link
I still need to watch this video.
Implementing health checks – Link
- Health checks are a way of asking a service on a particular server whether or not it is capable of performing work successfully. Load balancers ask each server this question periodically to determine which servers it is safe to direct traffic to.
- Some load-balancing algorithms, such as “least requests,” give more work to the fastest server. When a server fails, it often begins failing requests quickly, creating a “black hole” in the service fleet by attracting more requests than healthy servers. In some cases, we add extra protection to prevent black holes by slowing down failed requests to match the average latency of successful requests.
- There are 4 types of checks being used at Amazon
- Liveness checks test the basic connectivity to a service and the presence of a server process. They are often performed by a load balancer or external monitoring agent, and they are unaware of the details about how an application works.
- Local health checks go further than liveness checks to verify that the application is likely to be able to function. These health checks test resources that are not shared with the server’s peers.
- Dependency health checks are a thorough inspection of the ability of an application to interact with its adjacent systems.
- Anomaly detection looks across all servers in a fleet to determine if any server is behaving oddly compared to its peers.
- In overload conditions, it is important for servers to prioritize their health checks over their regular work. In this situation, failing or responding slowly to health checks can make a bad brownout situation even worse.
- Services need to be configured to set resources aside to respond to health checks in a timely way instead of taking on too many additional requests.
- Make a server’s worker pool large enough to accommodate extra health check requests
- Another way to help ensure that services respond in time to a health check ping request is to perform the dependency health check logic in a background thread and update an isHealthy flag that the ping logic checks. In this case, servers respond promptly to health checks, and the dependency health checking produces a predictable load on the external system it interacts with.
- To deal with zombies, systems often reply to health checks with their currently running software version. Then a central monitoring agent then compares the responses across the fleet to look for anything running an unexpectedly out of date version and prevents these servers from moving back into service
Challenges with distributed systems – Link
- Distributed computing is also weirder and less intuitive than other forms of computing because of two interrelated problems. Independent failures and nondeterminism cause the most impactful issues in distributed systems.
- There are three types of distributed systems
- Offline distributed systems like batch jobs. Relatively easy to implement and benefit from distributed computing
- Soft real-time distributed systems like search indexers. Moderately complex to build.
- Hard real-time distributed systems. Request/reply services like frontend services, transaction processors, telephony, etc. Most difficult to build and maintain.
- A piece of code that makes a network call needs to handle following scenarios
- When server never received the call
- When the server received the call but had transient failure. We can safely retry in this case.
- When server got it but had a failure
- When server request took long and client timed out
- When successful response is received by the client and it updates its local state
- When response is received but it is corrupt or incompatible
Avoiding fallback in distributed systems – Link
- Amazon avoids distributed fallback strategies since they are harder to test and often have latent bugs that show up only when an unlikely set of coincidences occur, potentially months or years after their introduction.
- Amazon avoid fallbacks by
- Reducing the number of moving parts when responding to requests. If, for example, a service needs data to fulfill a request, and that data is already present locally (it doesn’t need to be fetched), there is no need for a failover strategy
- Running both the fallback and the non-fallback logic continuously. It must not merely run the fallback case but also treat it as an equally valid source of data. For example, a service might randomly choose between the fallback and non-fallback responses (when it gets both back) to make sure they’re both working.
- Avoiding retries turning into fallback by executing them all the time with proactive retry (also known as hedging or parallel requests). Proactive retry follows the design pattern of constant work. Because the redundant requests are always being made, no extra load from retries is added to the system as the need for the redundant requests increases.
Static stability using Availability Zones – Link
- In a statically stable design, the overall system keeps working even when a dependency becomes impaired. Perhaps the system doesn’t see any updated information (such as new things, deleted things, or modified things) that its dependency was supposed to have delivered. However, everything it was doing before the dependency became impaired continues to work despite the impaired dependency.
- AWS EC2 is designed keeping static stability in mind. AWS EC2 consists of the control plane and data plane. The data plane sub-system is designed with the static stability principle in mind.
- Being able to decompose a system into its control plane and data plane can be a helpful conceptual tool for designing highly available services, for a number of reasons:
- It’s typical for the availability of the data plane to be even more critical to the success of the customers of a service than the control plane.
- It’s typical for the data plane to operate at a higher volume (often by orders of magnitude) than its control plane. Thus, it’s better to keep them separate so that each can be scaled according to its own relevant scaling dimensions.
- We’ve found over the years that a system’s control plane tends to have more moving parts than its data plane, so it’s statistically more likely to become impaired for that reason alone.
- To achieve static stability AWS deploys their own internal services in three availability zones. They overprovision each zone by 50% so that each Availability Zone is operating at only 66%of the level for which they have load-tested it.