In the last couple of months I was involved in a couple of consulting assignments where I had to understand the platform the team has built and give my recommendations. Platform is a broad term and it means different thing in different context.
Today, I read a paper titled Lessons from Giant-Scale Services. This paper is written by Eric Brewer, the guy behind CAP theorem. It is an old paper published in 2001. The paper helps the reader build mental model on how to think about availability of large scale distributed systems.
MemSQL is s fast, commercial, ANSI SQL compliant, highly scalable HTAP database. HTAP databases are those that support both OLTP and OLAP workloads. It supports ACID transactions just like a regular relational database .It also supports document and geospatial data types.
MemSQL is fast because it stores data in-memory. But, it does not mean it is not durable. It maintains a copy of data on disk as well. Transactions are committed to the transaction log on disk and later compressed into full-database snapshots. One of the main reason new databases are designed as in-memory first is because memory is getting cheaper every year. It is estimated memory is becoming cheaper 40% every year.
MemSQL has tuneable durability. You can make it fully durable or completely ephemeral. It can be sync or async.
MemSQL simplifies your architecture as you don’t have to write ETL jobs to move data from one data store to another data store. This is the biggest selling point of any HTAP database.
I enjoy working through system design problems. It helps me think how I will design interesting features of various systems. I will post design solutions to interesting problems.
Today, I will share how I will design Amazon recently viewed items page. You can view this page by going to https://www.amazon.com/gp/history/
To me it showed last 73 items I viewed on Amazon.com. I don’t think they are showing last N items rather they are showing items that I viewed in last X days(or months) with some max limit.
Let’s redefine problem now that we better understand it.
Design the Amazon recently viewed items page API. The recently viewed items are all the items that you viewed in the last 6 months. The max count of items could be 100.
This post explains my learning on how to introduce Chaos Engineering to an organisation. This is based on my experience of re-architecting monolithic application to Microservices based architecture. Microservices architecture style structures an application as a collection of loosely-coupled services. Microservices architecture has many benefits like independent development and deployments of services, eliminate long-term commitment to a technology stack, specialized services built by small teams, and many others. One of the drawbacks of Microservices is that it increases the surface area of failures. You now have to deal with failures related to the interaction between services and system boundaries. Our client was facing issues running their distributed application in a steady state. The issues that we faced were:
- Communication failure between services. There was no clear strategy on how to handle network failure between services and how to give proper feedback to the customers of the application.
- Difficulty in understanding why the whole application became unavailable when only a single service was down. Is there any single point of failure? These types of issues were not visible with usual testing.
- System becoming partially unavailable when the network gets choked.
- Unwanted local state leading to system unavailability when one instance of the service dies.
- Out of memory errors in production services leading to complete or partial unavailability of the system.
- Possible data loss issues as data replication and backup strategies were never tested in real workloads.
If you have ever worked on a distributed application you will know that it is difficult to debug when things go wrong. The two common tools to figure out root cause of the problem are logging and metrics. But the fact of the matter is that logs and metrics fail to give us complete picture of a situation in a distributed system. They fail to tell us the complete story.
If you are building a system using Microservices / Serverless architecture then you are building a distributed system.
Logs fail to give us the complete picture of a request because they are scattered across a number log files and it is difficult to link them together to form a shared context. Metrics can tell you that your service is having high response time but it will not be able to help you easily identify the root cause.
Logging and Metrics are not enough to build observable systems.
Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. It helps bring visibility into systems. – Wikipedia
Logs, metrics, and traces are the three pillars of observability. While most software teams use logging and monitoring few of them use traces. Before we look at distributed tracing in depth, let’s define logs, metrics, and traces.
A couple of weeks back a junior developer asked me a seemingly simple question – What is a distributed system? One question led to another and we end up spending more than an hour discussing different aspects of distributed systems. I felt my knowledge on distributed systems was rusty and I was unable to explain concepts in a simple and clear manner.
In the last two weeks since our discussion I spent time reading distributed systems literature to gain better understanding of the basics. In a series of post starting today, I will cover distributed system basics. In today’s post we will cover what and why of distributed systems.