distributed-systems – Page 2

Paper Summary: How Complex System Fail

Today, I read paper How Complex System Fail by Richard Cook. This paper was written in 2000. This is a short and easy to read paper written by a doctor. He writes about his learnings in the context of patient care, but most of them are applicable to software development as well.

In this post, I go over the points that I liked from the paper and write how I think they are applicable to software development.

The first point raised in the paper is

All complex systems are intrinsically hazardous systems.

With respect to complex software that most of us build, I don’t think hazardous will be the right word. Luckily, no one will die if the software we write fails. I see it as all complex systems are intrinsically important systems for an organisation. Most of the complex software has business and financial importance. So if the software fails, organization will end up loosing money and reputation. These can be catastrophic for the organization and might bring an organization to its knees. It is the business importance of these systems that drives the creation of defences against failure.

This takes us to the next point in the paper:

Complex systems are heavily and successfully defended against failure.

I don’t think from the day one most complex systems have successful defensive mechanisms placed. But, I agree that over time mechanisms will be added to successful operate the application. Few of the mechanisms that we use in software are data backup, redundancy, monitoring, periodic health checks, training people, creating change approval process, and many others. With time, an organization builds a series of shields that normally divert operations away from failures. Another key point relevant to software is that when we try to rebuild a system we should expect that the new system will become stable over time. Many times you deploy a new system and it fails because the mechanisms used in old system might not be sufficient to run the new system. Over time, an organization will put in place mechanisms to successfully operate the new application.

The next point mentioned in this paper is:

Catastrophe requires multiple failures – single point failures are not enough.

I disagree with author that single point of failures are not enough for catastrophic failures. I think it depends on the way the complex software is built. If your complex software systems is built in such a way that fault in one service brings cause failure in the service using the first one. Then, a single failure in the least important component can cause the entire application to break. So, if you are building complex distributed applications then you should use circuit breakers, service discovery, retries, etc to build resilient applications.

The next point raised by author is not obvious to most of us:

Complex systems contain changing mixtures of failures latent within them.

Author says that it is impossible for us to run complex systems without flaws. I have read a similar point somewhere that Amazon tries for five-nine availability i.e. 99.999%. Five-nines or 99.999% availability means 5 minutes, 15 seconds or less of downtime in a year. They don’t try to go further because it becomes too costly after this time and customers can’t notice the difference between their internet not working and service becoming unavailable.

A corollary of the above point is:

Complex systems run in degraded mode.

The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.

The next point in the paper is:

Catastrophe is always just around the corner.

Another way to put the above point is Failure is the norm. Deal with it. If you are building distributed systems then you must be clear in your mind that different components will fail. They might lead to complete failure of the system. So, we should try to build systems that can handle individual component failure. Author writes, it is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system’s own nature.

Then, author goes on to write:

Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

I don’t agree entirely with this point. The author is saying that there is no single cause for failure. According to him, there are multiple contributors to the failure. I agree that multiple reasons would have let to failure. But, I don’t think root cause analysis is fundamentally wrong. I think root cause analysis is the starting point to understand why things go wrong. In software world, blameless postmortem is done to understand failure with the intention of not finding which team or person is responsible but to learn from failure and avoid such failure in future.

The next point brings human biases into picture:

Hindsight biases post-accident assessments of human performance.

It always look obvious in the hindsight that we should have avoided the failure given all the events that happened. This is the most difficult point to follow in practice. I don’t know how we can overcome biases. Author writes, Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

The next point mentioned in the paper is:

All practitioner actions are gambles.

This is an interesting point. If you have ever tried to bring a failure system back to state, you will agree with the author. In those uncertain times, we try different things to bring the system up but at best they are our guesses based on our previous learnings. Author writes, practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

I like the way author makes a point about human expertise:

Human expertise in complex systems is constantly changing.

I think most software organisations want to use the latest and greatest technology to build complex systems but they don’t invent enough in developing expertise. Humans are the most critical part of any complex system so we need to plan for their training skill refinement as a function of system itself. Failure to do this will lead to software failures.

As they say, change is the only constant. The next point talks:

Change introduces new forms of failure.

I will quote directly from the paper as author puts it eloquently

The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. When new technologies are used to eliminate well understood system failures or to gain high precision performance they often introduce new pathways to large scale, catastrophic failures.

The last point mentioned in the paper sums it all:

Failure free operations require experience with failure.

As they say, failure is the world greater teacher. When you encounter performance issues or difficult to predict situations then we push the envelope of our understanding of the system. These are the situations that help us discover new mechanisms that we need to put in our system to handle future failures.

Conclusion

I thoroughly enjoyed reading this paper. It has many learnings for software developers. If we become conscious about them I think we can write better resilient software.

Paper Summary: Simple Testing Can Prevent Most Failures

Today evening, I decided to read paper Simple Testing Can Prevent Most Failures: An analysis of Production Failures in Distributed Data-intensive systems.

This paper asks an important question

Why widely used distributed data systems designed for high availability like Cassandra, Redis, Hadoop, HBase, and HDFS. experience failures and what can be done to increase their resiliency?

We have to answer this question keeping in mind that these systems are developed by some of the best software developers in the world following good software development practices and are intensely tested.

These days most of us are building distributed systems. We can apply the findings shared in this post to build systems that are more resilient to failure.

The paper shares:

Most of the catastrophic system failures are result of incorrect handling of non-fatal errors explicitly signalled in the software

This falls into 1) empty error handling blocks, or error blocks with just log statement 2) the error handing aborts the clusters on an overly-general exception 3) the error handling code contains expressions like “FIXME” or “TODO” in the comments.

Most of the developers are guilty of doing all the three above mentioned. Developers are good at finding that something will go wrong but they don’t know what to do when something goes wrong. I looked at the error handling code in one of my projects and I found the same behaviour. I had written TODO comments or caught general exceptions. These are considered to be bad practices but still most of us end up doing.

Overall, we found that the developers are good at anticipating possible errors. In all but one case, the errors were checked by the developers. The only case where developers did not check the error was an unchecked error system call return in Redis.

Another important point mentioned in the paper is

We found that 74% of the failures are deterministic in that they are guaranteed to manifest with an appropriate input sequence, that almost all failures are guaranteed to manifest on no more than three nodes, and that 77% of the failures can be reproduced by a unit test.

Most popular open source projects use unit testing so it could be surprising that the existing tests were not good enough to catch these bugs. Part of this has to do with the fact that these bugs or failure situations happens when a sequence of events happen. The good part is that sequence is deterministic. As a software developer, I could relate to the fact that most of us are not good at thinking through all the permutation and combinations. So, even though we write unit tests they do not cover all scenarios. I think code coverage tools and mutation testing can help here.

It is now universally agreed that unit testing helps reduce bugs in software. Last few years, I have worked with few big enterprises and I can attest most of their code didn’t had unit tests and even if parts of the code had unit tests those tests were useless. So, even though open source projects that we use are getting better through unit testing most of the code that an average developer writes has a long way to go. One thing that we can learn from this paper is to start write high quality tests.

The paper mentions specific events where most of the bugs happen. Some of these events are:

Starting up services
Unreachable nodes
Configuration changes
Adding a node

If you are building distributed application, then you can try to test your application for these events. If you are building applications that uses Microservices based architecture then these are interesting events for your application as well. For example, if you call a service that is not available how your system behaves.

As per the paper, these mature open-source systems has mature logging.

76% of the failures print explicit failure related messages.

Paper mentions three reasons why that is the case:

First, since distributed systems are more complex, and harder to debug, developers likely pay more attention to logging.
Second, the horizontal scalability of these systems makes the performance overhead of outputing log message less critical.
Third, communicating through message-passing provides natural points to log messages; for example, if two nodes cannot communicate with each other because of a network problem, both have the opportunity to log the error.

Authors of the paper built a static analysis tool called Aspirator for locating these bug patterns.

If Aspirator had been used and the captured bugs fixed, 33% of the Cassandra, HBase, HDFS, and MapReduce’s catastrophic failures we studied could have been prevented.

Overall, I enjoyed reading this paper. I found it easy to read and applicable to all software developers.

Thinking about software system in terms of reliability, scalability, and maintainability

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. – Leslie Lamport

Last six months I was building pricing engine for a client. The application was built using multiple components:

We had a change data capture pipeline built using AWS Kinesis that read data from IBM DB2 and writes to PostgreSQL to keep database in sync with changes happening in the source system
We were storing denormalised documents in AWS ElastiCache i.e. Redis
We had a batch job that was doing one time load of the PostgreSQL database
We had a near cache that helped us process our worst requests in few hundred milliseconds

When you build a system using multiple independent components then you have to keep in mind that you are building a data system that it needs to provide certain guarantees. In our case, we had to guarantee:

AWS ElastCache i.e Redis will be updated with changes happening in the source system in less than 30 seconds
Near-cache will be invalidated and updated with latest data so that clients accessing the system will get consistent results. Keeping a global cache like Redis is easier to keep in sync than keeping near-cache in sync. We came up with a novel way to keep near cache in sync with the global cache.
Data will not be lost by our change data capture pipeline. If processing of a message failed then we retry the message
There will be times when different data components i.e. PostgreSQL, Redis, and near-cache will have different state. But, eventually it should become consistent
That there will be a mechanism to observe state of the system at any point of time

Like it or not systems that we are building are becoming more and more distributed. This means there are many more ways they can fail. To help build software systems that meets the end goal, we should keep following three concerns in our mind. These should be defined as clearly as possible so that every team member keep these in mind while building software systems.

Reliability
Scalability
Maintainability

Continue reading “Thinking about software system in terms of reliability, scalability, and maintainability”

Two-phase commit protocol

Welcome to the fourth post in the distributed systems series. In the last post, we covered ACID transactions. ACID transactions guarantee

Atomicity: Either all the operation succeed or none
Consistency: System moves from one consistent state to another at the successful completion of a transaction
Isolation: Concurrent transactions do not interfere with each other
Durability: After successful completion of a transaction all changes made by the transaction persist even in the case of a system failure

If the database is running on a single machine then it is comparatively easier to guarantee ACID semantics in comparison to a distributed database. Following are the reasons you would want to run a database in a distributed fashion:

To be fault-tolerant
To handle more reads and writes

Let’s assume our’s is a read intensive application and our single machine database is not able to scale to our demand. One of the solution to scale read is achieved through replication. The most common replication topology is single master and multiple slaves. All the writes go to the master and reads are performed on slaves. Data from master is replicated to the salves synchronously or asynchronously. In this post, we will assume synchronous replication.

Continue reading “Two-phase commit protocol”

The Minimalistic Guide to ACID Transactions

Welcome to the third post of distributed system series. So far in this series, we have looked at service discovery and CAP theorem. Before we move along in our distributed system learning journey, I thought it will be useful to refresh our memory with understanding of ACID transactions. ACID transactions are at the heart of relational databases. The knowledge of ACID transactions is useful when building distributed applications.

Understanding ACID transactions

A transaction is a sequence of operations that form a single logical unit of work. These transactions are executed on a shared database system to perform a higher-level function. An example of higher-level function is transferring money from one account to another. Transactions represent a basic unit of change in the database. It either executed in its entirety or not at all.

ACID (Atomicity, Consistency, Isolation, and Durability) refers to a set of properties that a database transaction should guarantee even in the event of errors, power failure, etc. The canonical example of ACID transaction is transfer of funds from one bank account to another. In a single fund transferring transaction, you have to check the account balance, debit one account, and credit another transaction. ACID properties guarantee that either money transfer from one account to other occur correctly and permanently or in case of failure both accounts have the same initial state. It would be unacceptable if one account was debited but the other account was credited.

Database transactions are motivated by two independent requirements:

Concurrent database access: Multiple clients can access the system at the same time. This is achieved by the Isolation property of ACID transaction.
Resiliency to system failures: System remains in consistent state in case of a system failure. This is provided by Atomicity, Consistency, and Durability properties of ACID transaction.

Continue reading “The Minimalistic Guide to ACID Transactions”

CAP Theorem for Application Developers

Most of us are building distributed systems. This is a fact. According to Wikipedia, a distributed system is a system whose components are located on different networked computers, which then communicate and coordinate their actions by passing messages to each other. A distributed system could either be a standard three-tier web application or it could be a massive multiplayer online game.

The goal of a distributed system is to solve a problem that can’t be solved on a single machine. A single machine can’t provide enough compute or storage resources required to solve the problem. The user of a distributed system perceives the collection of autonomous machines as a single unit.

The distributed systems are complex as there are several moving parts. You can scale out components to finish the workloads in a reasonable time. Because of numerous moving parts and their different scaling needs it becomes difficult to reason out the characteristics of a distributed applications. CAP theorem can help us.

Continue reading “CAP Theorem for Application Developers”

Service Discovery for Modern Distributed Applications

I am starting a new blog series (with no end date) from today. In this series, I will pick a topic and go in-depth so that I don’t just scratch the surface of the topic. The goal is to build a habit of learning each week and share it with the community. For the next few months, I will write on different aspects of building distributed systems. Each Wednesday, you may expect a new post.

I am sure we all have built applications where one application uses another application to do its job. Most of the time, applications communicate with each other using HTTP REST API but it can be other communication mechanisms like gRPC, Thrift, Message Queues as well. For example, if you are building an application that needs Twitter service for fetching tweets. To call Twitter API, you will need the API URL and access keys to make a successful API call. Most often we rely on static configuration either in the form of a configuration file or environment variable to get the API URL. This approach works fine when you are working with third party APIs like Twitter as their API URLs do not change often. The static configuration approach fails when we build a Microservices architecture based application. The definition of Microservices that I like is by Martin Fowler as described in his blog,

Microservice architecture style is an approach to developing a single application as a suite of small services, each running its own process and communicating with lightweight mechanisms, often an HTTP resource API.

Continue reading “Service Discovery for Modern Distributed Applications”