For the last couple of weeks I have been going over articles and videos in the Amazon Builder library. They cover useful patterns that Amazon uses to build and operate software. Below are the important points I captured while going over the material.
Reliability, constant work, and a good cup of coffee – Link
Amazon systems strive to solve problems using reliable constant work patterns. These work patterns have three key features:
One, they don’t scale up or slow down with load or stress.
Two, they don’t have modes, which means they do the same operations in all conditions.
Three, if they have any variation, it’s to do less work in times of stress so they can perform better when you need them most.
There are not many problems that can be efficiently designed using constant work patterns.
For example, If you’re running a large website that requires 100 web servers at peak, you could choose to always run 100 web servers. This certainly reduces a source of variance in the system, and is in the spirit of the constant work design pattern, but it’s also wasteful. For web servers, scaling elastically can be a better fit because the savings are large. It’s not unusual to require half as many web servers off peak time as during the peak.
Based on the examples given in the post it seems that a constant work pattern is suitable for use cases where system reliability, stability, and self-healing are primary concerns. It is fine if the system does some wasteful work and costs more. These are essential concerns for systems which others use to build their systems on. I think control plane systems fall under this category. The example of such a system mentioned in the post is a system that applies configuration changes to foundational AWS components like AWS Network load balancer. The solution can be designed using both the push and pull based approach. The pull based constant work pattern approach lends to a simpler and reliable design.
Although not mentioned in the post, constant work that the system is doing should be idempotent in nature.
This is an amazing read. Etsy engineer Salem Hilal shares their ES6 to Typescript journey. In this post, he covers the strategy, technical challenges they faced, tooling they built, and how they educated their engineers to write effective Typescript code. Etsy has been built over the last 16 years and they had 17000 JS files. Migrating such a codebase is a multi year effort. You need to have a clear plan and ensure there are no tail migration issues.
A couple of months back a customer wanted us to migrate their 20+ TB Oracle database to Postgres. They had hundreds of stored procedures written in Oracle. Also, their batch processing jobs were written in stored procedures. They wanted to do the complete migration in a couple of months. We politely told them it is not possible. They went with another vendor that said they could do it in two months. Migrations are very risky. There are so many unknowns involved. For a vendor it is much more difficult because they don’t even understand your functional requirements and code base. For migrations I prefer to be safe than sorry.
Little’s law states that the long-term average number L of customers in a stationary system is equal to the long-term average effective arrival rate λ multiplied by the average time W that a customer spends in the system.
L = Average number of customers in a stationary system λ = Average arrival rate in the system W = Average time a customer spend in the system
In context of an API it means: L = Average number of concurrent requests system can serve λ = Average arrival rate of requests in the system W = Average latency of each request
Here are 11 posts I thought were worth sharing this week.
To Learn a New Language, Read Its Standard Library – Link
I like the idea of learning a new language by reading its standard library. You learn the idiomatic way of writing code in a language by reading source code written by its original authors. I am planning to learn Rust. I will also give this approach a try. There are two limiting factors when you might struggle with this approach 1) Poor documentation 2) when the standard library is implemented in a lower level language.
In this post, Subbu Allamaraju shares his thoughts on how you can be both a nice and effective leader . He talks about six different leadership styles and how those leadership styles create positive and negative climates. I am in my first engineering leadership role and still figuring out my leadership style. Based on my limited leadership experience I think a leader can have multiple leadership styles depending on the situation. There are times you have to course correct and change your leadership style based on the situation and context. Also, I think leaders can be “nice” and “not nice” depending on the context. Leadership is hard.
42 things I learned from building a production database – Link
Not a deep technical post. Many useful pieces of advice by Mahesh Balakrishnan in this post. He worked on a Chubby like system at Facebook. My favorites:
Be conservative on APIs and liberal with implementations
When designing APIs, write code for one implementation; plan actively for the second implementation; and hope/pray that things will work for a third implementation.
Anything that can’t be measured easily (e.g., consistency) is often forgotten; pay particular attention to attributes that are difficult to measure
One of the advantages of Microservices architecture is that it enables components to have deployment independence. Based on my consulting and software development experience deployment independence is often overlooked and very few teams achieve it. Deployment independence is important since it brings true agility and reduces communication overhead between different teams and services.
Shared libraries make Microservices tightly coupled and introduce hard dependencies. Since, now a team making a change has to ensure that it does not break another service that depends on the shared library. This requires communication between multiple teams. Also, change in a shared library leads to all the services that depend on it to be redeployed. This leads to long build, release, and deployment times. We might have to consider the deployment order of services as well. All this leads to more synchronization and communication between teams. So, it is recommended that in Microservices architecture teams should avoid using shared libraries.
It is hard to retain good talent in tech. I agree with the reasons on why employees are resigning in huge numbers. Shortage of good talent, better compensation, lack of purpose, burnout, career advancement are the reasons that I hear as well. One reason that I don’t see covered in the post is poor leadership skills. It might be implicit but I think it should be called out as well. Good leadership can provide a sense of belonging and purpose that can help retain good employees.
One of the common Web API design anti-patterns that I see in the field is the exposure of database model in the API contract. If you are building a Java Spring Boot JPA application then it means exposing JPA entities as Web API’s request and response objects. The primary reason this happens is because most teams are not following contract first model of API design. They start from code and database schema and then they create API contract from them.
This is not the first time I have seen this anti-pattern being applied by development teams. I have seen this often so I thought let me document it so that in future I can share this post. The advantages of document such lessons/patterns/practices are:
I can be thorough in my explanation. Writing helps me understand if my point is valid. Writing is thinking for me.
While explaining to a developer I might forget a key point.
Give the development team time to reflect upon the feedback by themselves.
Discussion after going over the post might be more productive.
I can keep updating this post.
Following are the reasons that I think we should avoid exposing database model as an API contract.
Here are 7 posts I thought were worth sharing this week.
Google: A Collection Of Best Practices For Production Services – Link
This is an amazing read. Building resilient systems is hard. The first step to building resilient systems is to become aware of the practices that are used in the trenches. All the practices are worth reading/knowing and you should look for opportunities to apply them in your environment . Every few weeks I see teams struggling with making configuration changes safely. Article gives some practice advice on the same. Writing fail-safe and resilient HTTP clients is not easy. HTTP clients are used heavily in Microservices architecture. Most developers consider the happy path when service either succeeds or fails with expected response codes. But, we need to consider retries with jitter, timeouts, queueing, load shedding, etc while building HTTP clients. This article covers a few more practices that can help us build resilient systems.