Paper Summary: Simple Testing Can Prevent Most Failures

Today evening, I decided to read paper Simple Testing Can Prevent Most Failures: An analysis of Production Failures in Distributed Data-intensive systems.

This paper asks an important question

Why widely used distributed data systems designed for high availability like Cassandra, Redis, Hadoop, HBase, and HDFS. experience failures and what can be done to increase their resiliency?

We have to answer this question keeping in mind that these systems are developed by some of the best software developers in the world following good software development practices and are intensely tested.

These days most of us are building distributed systems. We can apply the findings shared in this post to build systems that are more resilient to failure.

The paper shares:

Most of the catastrophic system failures are result of incorrect handling of non-fatal errors explicitly signalled in the software

This falls into 1) empty error handling blocks, or error blocks with just log statement 2) the error handing aborts the clusters on an overly-general exception 3) the error handling code contains expressions like “FIXME” or “TODO” in the comments.

Most of the developers are guilty of doing all the three above mentioned. Developers are good at finding that something will go wrong but they don’t know what to do when something goes wrong. I looked at the error handling code in one of my projects and I found the same behaviour. I had written TODO comments or caught general exceptions. These are considered to be bad practices but still most of us end up doing.

Overall, we found that the developers are good at anticipating possible errors. In all but one case, the errors were checked by the developers. The only case where developers did not check the error was an unchecked error system call return in Redis.

Another important point mentioned in the paper is

We found that 74% of the failures are deterministic in that they are guaranteed to manifest with an appropriate input sequence, that almost all failures are guaranteed to manifest on no more than three nodes, and that 77% of the failures can be reproduced by a unit test.

Most popular open source projects use unit testing so it could be surprising that the existing tests were not good enough to catch these bugs. Part of this has to do with the fact that these bugs or failure situations happens when a sequence of events happen. The good part is that sequence is deterministic. As a software developer, I could relate to the fact that most of us are not good at thinking through all the permutation and combinations. So, even though we write unit tests they do not cover all scenarios. I think code coverage tools and mutation testing can help here.

It is now universally agreed that unit testing helps reduce bugs in software. Last few years, I have worked with few big enterprises and I can attest most of their code didn’t had unit tests and even if parts of the code had unit tests those tests were useless. So, even though open source projects that we use are getting better through unit testing most of the code that an average developer writes has a long way to go. One thing that we can learn from this paper is to start write high quality tests.

The paper mentions specific events where most of the bugs happen. Some of these events are:

  1. Starting up services
  2. Unreachable nodes
  3. Configuration changes
  4. Adding a node

If you are building distributed application, then you can try to test your application for these events. If you are building applications that uses Microservices based architecture then these are interesting events for your application as well. For example, if you call a service that is not available how your system behaves.

As per the paper, these mature open-source systems has mature logging.

76% of the failures print explicit failure related messages.

Paper mentions three reasons why that is the case:

  1. First, since distributed systems are more complex, and harder to debug, developers likely pay more attention to logging.
  2. Second, the horizontal scalability of these systems makes the performance overhead of outputing log message less critical.
  3. Third, communicating through message-passing provides natural points to log messages; for example, if two nodes cannot communicate with each other because of a network problem, both have the opportunity to log the error.

Authors of the paper built a static analysis tool called Aspirator for locating these bug patterns.

If Aspirator had been used and the captured bugs fixed, 33% of the Cassandra, HBase, HDFS, and MapReduce’s catastrophic failures we studied could have been prevented.

Overall, I enjoyed reading this paper. I found it easy to read and applicable to all software developers.

What Motivates After You Have Met the Baseline?

In 2016, I wrote a technology series – 52-technologies-in-2016 in which I posted a new technical blog/tutorial every week for a year (I managed to write 43 posts). The series got quite popular, receiving more than 7000 stars on GitHub and close to a million page views. I created a page in the project GitHub repository covering the mentions it received from the community. I received thank you notes from software developers around the world. Though it involved a good amount of hard work, I thoroughly enjoyed doing it.

Back in 2017, I remember sharing with a friend of mine about this achievement and how it managed to reach thousands of developers around the world. After listening to me for a while, he said, “But, you did not make any money with it.” Well, he was right. I didn’t make any money with it. I politely answered him that money was not my ultimate goal of the series.

Read More »

Locust: Load test your REST API

Recently, I wanted to load test one of my applications. In my previous projects, I have used Apache JMeter and Gatling. Both of them are great tools but I wanted something with ease of Apache Benchmark and web user interface. In my limited experience with Gatling and JMeter, I found that it takes time to get started with these tools. I don’t use these tools daily so there is always a learning curve for me.

My quest for an easy to use load testing tool lead me to Locust. It took me no time to get started with the tool and I was able to understand performance characteristics of my application by running a load test. I found Locust a great tool if you want to find how many request per second your system can support and how response time varies for different percentiles of requests.

Locust is an open source load testing tool written in Python. It can simulate millions of users to load test your application. It has an intuitive user interface that you can use to easily get started with it. It allows you to define custom behaviour using Python code.

Read More »

Getting Started with Amazon Corretto: Production Ready Distribution of OpenJDK

In November 2018, James Gosling (who now works at Amazon), father of Java released Corretto at Devoxx Belgium conference. Amazon Corretto is no-cost, multi platform, production ready distribution of OpenJDK. This comes at a time when Oracle announced that it will no longer provide free binary downloads of JDK after a six-month period; and neither it will patch OpenJDK with fixes after that period. The six-month is the new release cycle for new JDK versions. If you follow Java, then you might be aware that Java has moved to six month release cycle. The latest version of JDK is 12.

Read More »

Issue #18: 10 Reads, A Handcrafted Weekly Newsletters for Humans

The total time to read this newsletter is 130 minutes.
Fortune favors the prepare mind. — Louis Pasteur
  1. Three Sales Mistakes Software Engineers Make: 15 mins read. This post by PipelineDB folks talk about three mistakes sales mistakes software engineers make. I myself find it difficult when I have to take part in any sales initiative. The truth is we all have to sell. Sales do not always mean selling a product. It could be as simple as sharing your idea with the audience. It requires social skills that most software engineers lack. This post talk about three sales mistakes:
    1. Building a product before validating the market for it. This is part of the lean philosophy. I don’t think it is always feasible that you will have an audience with which you can validate your idea. So, I think in some cases it makes sense to building a functional MVP and drive from there. The MVP should not take more than 3 months.
    2. Talking instead of listening. The key message here is that listen to your audience and ask open ended questions. Some example of questions you can ask:
    3. How do you think about this problem?
      1. To what extent is this a priority?
      2. Why are you interested in this topic?
      3. .. Etc
    4. Mistaking interest for demand. Until you get money in your account your work is not done. IBM salespeople use BANT to qualify sales lead.
      1. Do they have enough budget to purchase the product?
      2. Do they have the authority to make the purchase?
      3. Do they need your product?
      4. Will the transaction be completed in a timeline that is acceptable to you?
  2. Amazon’s HQ2 Spectacle Isn’t Just Shameful—It Should Be Illegal: 20 mins read. This is yet another story of corporate getting things the way they want. They take billions of dollar subsidy from the government and return quite less. It is true in all parts of the world. In India, we have seen loan worth crores of rupees given to corporate. When the time comes to return back, system allows them to get away easily. All governments are hand in glove with corporates. We just don’t matter.
  3. Cloud Computing without Containers: 15 mins read. This looks interesting. I thought containers are the best we can go. But, as mentioned in this post, there are other possibilities like Isolates that can provider more efficient and economic alternatives.  This post does a good job comparing how Isolates compare against AWS Lambda that underneath uses containers. I will dig deeper into it more. Overall, a great post by Cloudfare folks.
  4. Things I learned from working at Shopify: 10 mins read. This is a great post by Budi Tanrim, a software engineer at Shopify. In this post, he talked about why he left an amazing job at Shopify to go back to Indonesia. Few points from his post that resonated with me:
    1. Come with a learning mindset: I often go for consulting assignments and there is always a tendency in me to come up with solutions before understanding the problem statement well enough. As he mentioned in his post, try to first understand the context and then think about solution.
    2. Be comfortable with being uncomfortable: When life events do not go as planned don’t get uncomfortable. If you get stressed then things will get more worse. It is always better to take a step back and think about the situation again.
    3. Prepare before presenting your work: This is an essential if you want to make an impact. Many time right words doesn’t come out at the right time so preparation helps a lot.
    4. Make decision log to have firmer decision: I also started doing it but I am not consistent. I agree with Budi that it is essential to document your decisions. The time you spent today in documenting your decision will serve you tomorrow when you might need to explain your rationale.
  5. What would a message-oriented programming language look like?: 10 mins read. Author thinks answer is Erlang. I have not worked with Erlang but I have used Akka with Scala in the past. Erlang like Akka is based on Actor model and when you work with actor model different objects communicate with each other using message passing. So, may be the answer is the language that has support for Actor model.
  6. Lessons from the data lake, part 1: Architectural decisions: 15 mins read.This post by AutoTrader engineers goes over the architectural decisions they took in building their own data lake. They started with on-premise solution but soon faced series of operational issues. The author writes Cluster computing on-premises is hard and expensive, cloud is easier. After failing with on-premise solution, they decided to build a new solution using Amazon S3 and Apache Spark delivered through AWS EMR solution. They used Terraform for provisioning cloud resources. They build five different zones to impose structure on their lake. They confined data to five ‘zones’ – in practice, five S3 buckets – named transient, raw, refined, user and trusted.They used Apache Avro for achieving schema on read.
  7. There’s Seldom Any Traffic on the High Road: 5 mins read. Another meaningful post on Farnam Street. This post makes an important point of not reacting when someone behaves rudely of you. As author writes, She was being rude. Yes. But that wasn’t the best version of her. I see the value of learning this skill. Making enemies is expensive. Sometimes you don’t even realize how expensive.
  8. Peeking under the hood of redesigned Gmail: 15 mins read. This post does a good analysis of performance issues with new Gmail interface. Using the tools available in Google Chrome, author was able to find possible reasons for bad performance of Gmail. It is sad that Gmail team does not use the facilities provided by Google’s own browser. I will recommend reading this article as you can apply the same learning for your website as well.
  9. Dealing with significant Postgres database bloat — what are your options? 15 mins read. When data is updated or deleted in Postgres, new data is written. The old data then needs to be vacuumed. That unvacuumed data is known as bloat. Here’s a look at how you can deal with it.
  10. Scalability Worst Practices: 10 mins read. This is an old blog published in 2008. The worst practices are still applicable today. So, give it a read.

Thinking about software system in terms of reliability, scalability, and maintainability

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. – Leslie Lamport

Last six months I was building pricing engine for a client. The application was built using multiple components:

  1. We had a change data capture pipeline built using AWS Kinesis that read data from IBM DB2 and writes to PostgreSQL to keep database in sync with changes happening in the source system
  2. We were storing denormalised documents in AWS ElastiCache i.e. Redis
  3. We had a batch job that was doing one time load of the PostgreSQL database
  4. We had a near cache that helped us process our worst requests in few hundred milliseconds

When you build a system using multiple independent components then you have to keep in mind that you are building a data system that it needs to provide certain guarantees. In our case, we had to guarantee:

  1. AWS ElastCache i.e Redis will be updated with changes happening in the source system in less than 30 seconds
  2. Near-cache will be invalidated and updated with latest data so that clients accessing the system will get consistent results. Keeping a global cache like Redis is easier to keep in sync than keeping near-cache in sync. We came up with a novel way to keep near cache in sync with the global cache.
  3. Data will not be lost by our change data capture pipeline. If processing of a message failed then we retry the message
  4. There will be times when different data components i.e. PostgreSQL, Redis, and near-cache will have different state. But, eventually it should become consistent
  5. That there will be a mechanism to observe state of the system at any point of time

Like it or not systems that we are building are becoming more and more distributed. This means there are many more ways they can fail. To help build software systems that meets the end goal, we should keep following three concerns in our mind. These should be defined as clearly as possible so that every team member keep these in mind while building software systems.

  1. Reliability
  2. Scalability
  3. Maintainability

Read More »

The Compound Effect

We all are looking for a quick way to earn money, lose weight, build relationships, get promotion in our job, or become successful in life. I have failed numerous times with my effort to achieve my goals. We all give up too quickly. Failing is not an issue if you failed after you have given your best. Most of the time we fail because we don’t try hard enough. We give up too soon.

Read More »