A simple introduction to distributed systems

A couple of weeks back a junior developer asked me a seemingly simple question – What is a distributed system? One question led to another and we end up spending more than an hour discussing different aspects of distributed systems. I felt my knowledge on distributed systems was rusty and I was unable to explain concepts in a simple and clear manner.

In the last two weeks since our discussion I spent time reading distributed systems literature to gain better understanding of the basics. In a series of post starting today, I will cover distributed system basics. In today’s post we will cover what and why of distributed systems.

Read More »

Issue #22: 10 Reads, A Handcrafted Weekly Newsletter For Humans

The time to read this newsletter is 150 minutes.

Curiosity is, in great and generous minds, the first passion and the last – Samuel Johnson

  1. Monorepos: Please don’t!: 20 mins read. In this post, Matt Klein gives reasons to why monorepo approach does not provide benefits most often cited by monorepo proponents. His recommendation is to go with polyrepo structure. The post makes four valid arguments:
    1. Organizations that use monorepo spend considerable engineering resources on building tools to work with monorepos. Most organisations don’t have such luxury.
    2. Monorepo makes it difficult to open source internal projects as you have single commit history
    3. Most VCS are not meant to be used for large monolithic repositories. There is some work done by Microsoft as part of its git VFS project but it has some rough edges.
    4. The last interesting point that post makes is The frank reality is that, at scale, how well an organization does with code sharing, collaboration, tight coupling, etc. is a direct result of engineering culture and leadership, and has nothing to do with whether a monorepo or a polyrepo is used.

    The main drawback of polyrepo approach is that it creates a culture where different teams own different parts of the code. There are few people in organization aware of the big picture. This point is beautifully put by Adam Jacob in his post – Monorepo: please do!.

    My take on this is somewhere in between. For example, if you are building a web application then I like to keep all backend Microservices in one repository and front-end application in another repo. I think the best is somewhere in between the both approaches. Taking either too far does not work.
    Read More »

Z-axis scaling

Last year I was building an application that had to process million of records. The processing of each record was independent but complicated. To process all the records it was taking more time than the response time we had to meet as per SLA. We employed near cache and were processing all the data in memory. This made memory utilisation of the app high.

At that time I employed the strategy to shard the data so that each application instance can process a subset of the data. This helped us improve cache utilisation and reduce memory and I/O usage of the application. These days each time scalability is mentioned Microservices is thrown as the solution. We as software developers need to keep in mind three dimensions of scalability so that we can choose the best possible strategy for the problem. These were first mentioned in the book The Art of Scalability. I read this book in 2012 when I was working with OpenShift platform as a service.

As per the book The Art of Scalability there are three dimensions of scalability as shown below

scalecube

X-axis scaling

This is the traditional way of scaling monolithic applications. We run multiple copies of an application behind a load balancer.

Y-axis scaling

In this model, we break the application vertically into multiple independent services. This is what Microservices architecture allows us to achieve.

Z-axis scaling

In Z-axis scaling each server is responsible for processing subset of the data. This is the solution that I applied for my problem.

In case you want to read more about Scale Cube I suggest you read this post.

How Docker uses cgroups to set resource limits?

Today, I was interested to know how does Docker uses cgroups to set resource limits. In this short post, I will share with you what I learnt.

I will assume that you have a machine on which Docker is installed.

Docker allows you to pass resource limits using the command-line options. Let’s assume that you want to limit the IO read rate to 1mb per second for a container. You can start a new container with the device-read-bps option as shown below

$ docker run -it --device-read-bps /dev/sda:1mb centos

In the above command, we are instantiating a new centos container. We specified device-read-bps option to limit the read rate to 1mb per second for /dev/sda device.

Read More »

Issue #21: 10 Reads, A Handcrafted Weekly Newsletter For Humans

The time to read this newsletter is 165 minutes.

Quote

He who learns but does not think, is lost! He who thinks but does not learn is in great danger — Confucius

  1. Stop Learning Frameworks: 15 mins read. This post makes a great point that we should focus more on gaining deep understanding of software development fundamentals than learning new frameworks. Frameworks come and go and most of the time knowledge you gain is not portable. Focussing on core software development topics like algorithms, unit testing, design patterns, HTTP, DDD, etc will give you better return for your time. If you encounter a new technology at work then you will learn it anyhow.
    Another post on similar lines that I read this week said Stack is irrelevant. Author writes,
    I’ve started with almost no knowledge about the stack and been up to speed in less than 3 months or so. This is why I always tell people to pick up technology and language agnostic problem solving skills because those are the only skills transferable across stacks.
  2. Analyzing Hacker News book suggestions in Python: 15 mins read. A quick tutorial that will teach you how to do simple text processing in Python. This tutorial covers how to find top book recommendations in an HN thread. The approach used in simple and most programmers can easily follow it.
  3. The Major Features in Postgres 11: 20 mins read. This is 24 page PDF slidedeck covering important features introduced in Postgres 11. My favourite features from the list are:
    1. Partitioning table by hash
    2. Parallel hash joins
    3. Finer-Grained access control
  4. Materialized views vs. Rollup tables in Postgres: 15 mins read. I loved this post for being clear and to the point. It teaches when to use Materialized view and rollup tables. I could see myself using it in near future. Great post!
  5. How to Get Over Productivity Guilt: 10 mins read. Just the kind of post that you need to end your year. I feel productivity guilt on many days. I am trying to be better at it. Else, it kills all your happiness and you end up doing nothing. Prioritize the essentials tasks that you need to do and get them done. Life will be good. Don’t over stress yourself.
  6. The business case for serverless: 10 mins read. I spent few months on building an application using Serverless. I uses AWS stack to build the application.I enjoyed Serverless experience but I found that pace of development goes down as you are not able to do end-to-end testing on your local machine. This post does a good job in making a business case for Serverless. The main point author makes is that Serverless increases developer velocity. Author concludes the post by writing
    One day, complexity will grow past a breaking point and development velocity will begin to decline irreversibly, and so the ultimate job of the founder is to push that day off as long as humanly possible. The best way to do that is to keep your ball of mud to the minimum possible size— serverless is the most powerful tool ever developed to do exactly that.
    Another interesting post on Serverless that I read this week is by Amazon ‘s Tim Bray — Serverless Everything. He proposed an interesting way to look at Serverless using data plane and control plane analogy. Give it a read, it is not that long.
    The Amazon MQ service, which is a managed version of the ex­cel­lent Apache Ac­tiveMQ open-source message broker. To make this us­able by AWS cus­tomer­s, we had to write a bunch of soft­ware to cre­ate, de­ploy, con­fig­ure, start, stop, and delete mes­sage bro­ker­s. In this sort of sce­nar­i­o, Ac­tiveMQ it­self is called the “data plane” and the man­age­ment soft­ware we wrote is called the “control plane”. The con­trol plane’s APIs are REST­ful and, in Ama­zon MQ, its im­ple­men­ta­tion is en­tire­ly server­less, based on Lamb­da, API Gate­way, and Dy­namoDB.
  7. Benchmark PostgreSQL With Linux HugePages: 15 mins read. PostgreSQL is my first choice when it comes to RDBMS. This post covers how we can configure Linux HugePages configuration to improve performance of Postgres database. Author concludes the post by writing
    One of my key recommendations is that we must keep Transparent HugePages off. You will see the biggest performance gains when the database fits into the shared buffer with HugePages enabled. Deciding on the size of huge page to use requires a bit of trial and error, but this can potentially lead to a significant TPS gain where the database size is large but remains small enough to fit in the shared buffer.
  8. How we built Globoplay’s API Gateway using GraphQL: 15 mins read. This blog start with the reason why Globo’s team decided to use GraphQL instead of REST for building API. The main reason outlined in the post for choosing GraphQL is the ease with which you can support different requirements for different devices. After covering the why GraphQL, the post talks about how you can get started with it.
    In mid 2018, we had two backends for frontends (BFF) doing very similar tasks: One for web, and another for iOS, android and TV. As much as I love the “backend for frontend” idea (and how cool it sounds), we could not keep the current architecture. Not only because of the reasons I just said, but because each BFF was serving slightly different content to its clients while the business team started to ask for something new: Ubiquity among all clients. the more I reviewed everything we needed to support, the more GraphQL started to make sense. While TVs need a big program poster, mobiles need a small one. We need to show exactly the same video duration among all clients. TVs should provide detailed information about each program, but iOS and android could show only a poster + program title.
  9. Envoy Proxy at Reddit: 20 mins read. The post goes into depth on why Reddit moved to Envoy for service to service communication. What I like in this post is how they incorporated a new technology in a step by step manner rather than going the Big Bang approach. It is a well written post so you will end up learning a lot about how big sites like Reddit introduce new technologies in their ecosystem and how they architecture evolve over time. The post outlines three requirements Reddit team had from their service mesh choice. Envoy and its ecosystem fit all of these requirements.
    1. Performance: Avoid adding a performance bottleneck at all costs. Any performance losses at the proxy level need to be offset by considerable feature gains. The two biggest considerations here were resource utilization and latency impact. Our mesh approach accounts for a sidecar proxy on every host, so we wanted the solution to be one that we were comfortable running on every host and at every hop in the network.
    2. Features: The biggest differentiator among the options was the possibility of L7 Thrift support in the proxy. Thrift is our main inter-service RPC protocol and without first-class support for the behavior control we want in a service mesh, it wouldn’t make sense to switch to something that would just be providing the same basic TCP load balancing we’re getting out of HAProxy. We’ll address this in the next section.
    3. Integrations and Extensibility: Being able to contribute or request integrations and possibly extend out-of-the-box functionality was also a core requirement. The network proxy needed to be able to evolve with Reddit’s service needs and developer feature requests.
  10. Going Head-to-Head: Scylla vs Amazon DynamoDB: 30 mins read. This post compares ScyllaDB with Amazon DynamoDB. Author writes, Scylla is a drop-in replacement for Cassandra, implemented from scratch in C++. Cassandra itself was a reimplementation of concepts from the Dynamo paper. So, in a way, Scylla is the “granddaughter” of Dynamo. That means this is a family fight, where a younger generation rises to challenge an older one. It was inevitable for us to compare ourselves against our “grandfather,” and perfectly in keeping with the traditions of Greek mythology behind our name.
    The conclusion from the post says it all

    1. DynamoDB failed to achieve the required SLA multiple times, especially during the population phase.
    2. DynamoDB has 3x-4x the latency of Scylla, even under ideal conditions
    3. DynamoDB is 7x more expensive than Scylla
    4. Dynamo was extremely inefficient in a real-life Zipfian distribution. You’d have to buy 3x your capacity, making it 20x more expensive than Scylla
    5. Scylla demonstrated up to 20x better throughput in the hot-partition test with better latency numbers
    6. Last but not least, Scylla provides you freedom of choice with no cloud vendor lock-in (as Scylla can be run on various cloud vendors, or even on-premises).

I will end this newsletter with a short video — The Electronic Coach. This short video shows how Donald Knuth build a mainframe program that helped a basketball team win 11 out of 14 matches. This is an early example of computer used in data driven decision making. In case you don’t know Donald Knuth, he is one of the greatest computer scientist. His book The Art of Computer Programming is included by American Scientist on its list of books that shaped the last century of science.