Category Archives: distributed-systems

MemSQL Introduction: A Hybrid transactional/analytical processing database

main-how-it-works-ecosystem-diagram

MemSQL is s fast, commercial, ANSI SQL compliant, highly scalable HTAP database. HTAP databases are those that support both OLTP and OLAP workloads. It supports ACID transactions just like a regular relational database .It also supports document and geospatial data types.

MemSQL is fast because it stores data in-memory. But, it does not mean it is not durable. It maintains a copy of data on disk as well. Transactions are committed to the transaction log on disk and later compressed into full-database snapshots. One of the main reason new databases are designed as in-memory first is because memory is getting cheaper every year. It is estimated memory is becoming cheaper 40% every year.

MemSQL has tuneable durability. You can make it fully durable or completely ephemeral. It can be sync or async.

MemSQL simplifies your architecture as you don’t have to write ETL jobs to move data from one data store to another data store. This is the biggest selling point of any HTAP database.

Continue reading

System Design: Design the Amazon Recently Viewed Items Page API

I enjoy working through system design problems. It helps me think how I will design interesting features of various systems. I will post design solutions to interesting problems.

Today, I will share how I will design Amazon recently viewed items page. You can view this page by going to https://www.amazon.com/gp/history/

To me it showed last 73 items I viewed on Amazon.com. I don’t think they are showing last N items rather they are showing items that I viewed in last X days(or months) with some max limit.

Let’s redefine problem now that we better understand it.

Design the Amazon recently viewed items page API. The recently viewed items are all the items that you viewed in the last 6 months. The max count of items could be 100.

Continue reading

Introducing Chaos Engineering to an Organization

This post explains my learning on how to introduce Chaos Engineering to an organisation. This is based on my experience of re-architecting monolithic application to Microservices based architecture. Microservices architecture style structures an application as a collection of loosely-coupled services. Microservices architecture has many benefits like independent development and deployments of services, eliminate long-term commitment to a technology stack, specialized services built by small teams, and many others. One of the drawbacks of Microservices is that it increases the surface area of failures. You now have to deal with failures related to the interaction between services and system boundaries. Our client was facing issues running their distributed application in a steady state. The issues that we faced were:

  1. Communication failure between services. There was no clear strategy on how to handle network failure between services and how to give proper feedback to the customers of the application.
  2. Difficulty in understanding why the whole application became unavailable when only a single service was down. Is there any single point of failure? These types of issues were not visible with usual testing.
  3. System becoming partially unavailable when the network gets choked.
  4. Unwanted local state leading to system unavailability when one instance of the service dies.
  5. Out of memory errors in production services leading to complete or partial unavailability of the system.
  6. Possible data loss issues as data replication and backup strategies were never tested in real workloads.

Continue reading

A minimalistic guide to distributed tracing with OpenTracing and Jaeger

If you have ever worked on a distributed application you will know that it is difficult to debug when things go wrong. The two common tools to figure out root cause of the problem are logging and metrics. But the fact of the matter is that logs and metrics fail to give us complete picture of a situation in a distributed system. They fail to tell us the complete story.

If you are building a system using Microservices / Serverless architecture then you are building a distributed system.

Logs fail to give us the complete picture of a request because they are scattered across a number log files and it is difficult to link them together to form a shared context. Metrics can tell you that your service is having high response time but it will not be able to help you easily identify the root cause.

Logging and Metrics are not enough to build observable systems.

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. It helps bring visibility into systems. – Wikipedia

Logs, metrics, and traces are the three pillars of observability. While most software teams use logging and monitoring few of them use traces. Before we look at distributed tracing in depth, let’s define logs, metrics, and traces.

Continue reading

A simple introduction to distributed systems

A couple of weeks back a junior developer asked me a seemingly simple question – What is a distributed system? One question led to another and we end up spending more than an hour discussing different aspects of distributed systems. I felt my knowledge on distributed systems was rusty and I was unable to explain concepts in a simple and clear manner.

In the last two weeks since our discussion I spent time reading distributed systems literature to gain better understanding of the basics. In a series of post starting today, I will cover distributed system basics. In today’s post we will cover what and why of distributed systems.

Continue reading

Z-axis scaling

Last year I was building an application that had to process million of records. The processing of each record was independent but complicated. To process all the records it was taking more time than the response time we had to meet as per SLA. We employed near cache and were processing all the data in memory. This made memory utilisation of the app high.

At that time I employed the strategy to shard the data so that each application instance can process a subset of the data. This helped us improve cache utilisation and reduce memory and I/O usage of the application. These days each time scalability is mentioned Microservices is thrown as the solution. We as software developers need to keep in mind three dimensions of scalability so that we can choose the best possible strategy for the problem. These were first mentioned in the book The Art of Scalability. I read this book in 2012 when I was working with OpenShift platform as a service.

As per the book The Art of Scalability there are three dimensions of scalability as shown below

scalecube

X-axis scaling

This is the traditional way of scaling monolithic applications. We run multiple copies of an application behind a load balancer.

Y-axis scaling

In this model, we break the application vertically into multiple independent services. This is what Microservices architecture allows us to achieve.

Z-axis scaling

In Z-axis scaling each server is responsible for processing subset of the data. This is the solution that I applied for my problem.

In case you want to read more about Scale Cube I suggest you read this post.

Paper Summary: How Complex System Fail

Today, I read paper How Complex System Fail by Richard Cook. This paper was written in 2000. This is a short and easy to read paper written by a doctor. He writes about his learnings in the context of patient care, but most of them are applicable to software development as well.

In this post, I go over the points that I liked from the paper and write how I think they are applicable to software development.

The first point raised in the paper is

All complex systems are intrinsically hazardous systems.

With respect to complex software that most of us build, I don’t think hazardous will be the right word. Luckily, no one will die if the software we write fails. I see it as all complex systems are intrinsically important systems for an organisation. Most of the complex software has business and financial importance. So if the software fails, organization will end up loosing money and reputation. These can be catastrophic for the organization and might bring an organization to its knees. It is the business importance of these systems that drives the creation of defences against failure.

This takes us to the next point in the paper:

Complex systems are heavily and successfully defended against failure.

I don’t think from the day one most complex systems have successful defensive mechanisms placed. But, I agree that over time mechanisms will be added to successful operate the application. Few of the mechanisms that we use in software are data backup, redundancy, monitoring, periodic health checks, training people, creating change approval process, and many others. With time, an organization builds a series of shields that normally divert operations away from failures. Another key point relevant to software is that when we try to rebuild a system we should expect that the new system will become stable over time. Many times you deploy a new system and it fails because the mechanisms used in old system might not be sufficient to run the new system. Over time, an organization will put in place mechanisms to successfully operate the new application.

The next point mentioned in this paper is:

Catastrophe requires multiple failures – single point failures are not enough.

I disagree with author that single point of failures are not enough for catastrophic failures. I think it depends on the way the complex software is built. If your complex software systems is built in such a way that fault in one service brings cause failure in the service using the first one. Then, a single failure in the least important component can cause the entire application to break. So, if you are building complex distributed applications then you should use circuit breakers, service discovery, retries, etc to build resilient applications.

The next point raised by author is not obvious to most of us:

Complex systems contain changing mixtures of failures latent within them.

Author says that it is impossible for us to run complex systems without flaws. I have read a similar point somewhere that Amazon tries for five-nine availability i.e. 99.999%. Five-nines or 99.999% availability means 5 minutes, 15 seconds or less of downtime in a year. They don’t try to go further because it becomes too costly after this time and customers can’t notice the difference between their internet not working and service becoming unavailable.

A corollary of the above point is:

Complex systems run in degraded mode.

The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.

The next point in the paper is:

Catastrophe is always just around the corner.

Another way to put the above point is Failure is the norm. Deal with it. If you are building distributed systems then you must be clear in your mind that different components will fail. They might lead to complete failure of the system. So, we should try to build systems that can handle individual component failure. Author writes, it is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system’s own nature.

Then, author goes on to write:

Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

I don’t agree entirely with this point. The author is saying that there is no single cause for failure. According to him, there are multiple contributors to the failure. I agree that multiple reasons would have let to failure. But, I don’t think root cause analysis is fundamentally wrong. I think root cause analysis is the starting point to understand why things go wrong. In software world, blameless postmortem is done to understand failure with the intention of not finding which team or person is responsible but to learn from failure and avoid such failure in future.

The next point brings human biases into picture:

Hindsight biases post-accident assessments of human performance.

It always look obvious in the hindsight that we should have avoided the failure given all the events that happened. This is the most difficult point to follow in practice. I don’t know how we can overcome biases. Author writes, Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

The next point mentioned in the paper is:

All practitioner actions are gambles.

This is an interesting point. If you have ever tried to bring a failure system back to state, you will agree with the author. In those uncertain times, we try different things to bring the system up but at best they are our guesses based on our previous learnings. Author writes, practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

I like the way author makes a point about human expertise:

Human expertise in complex systems is constantly changing.

I think most software organisations want to use the latest and greatest technology to build complex systems but they don’t invent enough in developing expertise. Humans are the most critical part of any complex system so we need to plan for their training skill refinement as a function of system itself. Failure to do this will lead to software failures.

As they say, change is the only constant. The next point talks:

Change introduces new forms of failure.

I will quote directly from the paper as author puts it eloquently

The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. When new technologies are used to eliminate well understood system failures or to gain high precision performance they often introduce new pathways to large scale, catastrophic failures.

The last point mentioned in the paper sums it all:

Failure free operations require experience with failure.

As they say, failure is the world greater teacher. When you encounter performance issues or difficult to predict situations then we push the envelope of our understanding of the system. These are the situations that help us discover new mechanisms that we need to put in our system to handle future failures.

Conclusion

I thoroughly enjoyed reading this paper. It has many learnings for software developers. If we become conscious about them I think we can write better resilient software.