December 2018 – Shekhar Gulati

Issue #21: 10 Reads, A Handcrafted Weekly Newsletter For Humans

The time to read this newsletter is 165 minutes.

Quote

He who learns but does not think, is lost! He who thinks but does not learn is in great danger — Confucius

Stop Learning Frameworks: 15 mins read. This post makes a great point that we should focus more on gaining deep understanding of software development fundamentals than learning new frameworks. Frameworks come and go and most of the time knowledge you gain is not portable. Focussing on core software development topics like algorithms, unit testing, design patterns, HTTP, DDD, etc will give you better return for your time. If you encounter a new technology at work then you will learn it anyhow.
Another post on similar lines that I read this week said Stack is irrelevant. Author writes,
I’ve started with almost no knowledge about the stack and been up to speed in less than 3 months or so. This is why I always tell people to pick up technology and language agnostic problem solving skills because those are the only skills transferable across stacks.
Analyzing Hacker News book suggestions in Python: 15 mins read. A quick tutorial that will teach you how to do simple text processing in Python. This tutorial covers how to find top book recommendations in an HN thread. The approach used in simple and most programmers can easily follow it.
The Major Features in Postgres 11: 20 mins read. This is 24 page PDF slidedeck covering important features introduced in Postgres 11. My favourite features from the list are:
1. Partitioning table by hash
2. Parallel hash joins
3. Finer-Grained access control
Materialized views vs. Rollup tables in Postgres: 15 mins read. I loved this post for being clear and to the point. It teaches when to use Materialized view and rollup tables. I could see myself using it in near future. Great post!
How to Get Over Productivity Guilt: 10 mins read. Just the kind of post that you need to end your year. I feel productivity guilt on many days. I am trying to be better at it. Else, it kills all your happiness and you end up doing nothing. Prioritize the essentials tasks that you need to do and get them done. Life will be good. Don’t over stress yourself.
The business case for serverless: 10 mins read. I spent few months on building an application using Serverless. I uses AWS stack to build the application.I enjoyed Serverless experience but I found that pace of development goes down as you are not able to do end-to-end testing on your local machine. This post does a good job in making a business case for Serverless. The main point author makes is that Serverless increases developer velocity. Author concludes the post by writing
One day, complexity will grow past a breaking point and development velocity will begin to decline irreversibly, and so the ultimate job of the founder is to push that day off as long as humanly possible. The best way to do that is to keep your ball of mud to the minimum possible size— serverless is the most powerful tool ever developed to do exactly that.
Another interesting post on Serverless that I read this week is by Amazon ‘s Tim Bray — Serverless Everything. He proposed an interesting way to look at Serverless using data plane and control plane analogy. Give it a read, it is not that long.
The Amazon MQ service, which is a managed version of the excellent Apache ActiveMQ open-source message broker. To make this usable by AWS customers, we had to write a bunch of software to create, deploy, configure, start, stop, and delete message brokers. In this sort of scenario, ActiveMQ itself is called the “data plane” and the management software we wrote is called the “control plane”. The control plane’s APIs are RESTful and, in Amazon MQ, its implementation is entirely serverless, based on Lambda, API Gateway, and DynamoDB.
Benchmark PostgreSQL With Linux HugePages: 15 mins read. PostgreSQL is my first choice when it comes to RDBMS. This post covers how we can configure Linux HugePages configuration to improve performance of Postgres database. Author concludes the post by writing
One of my key recommendations is that we must keep Transparent HugePages off. You will see the biggest performance gains when the database fits into the shared buffer with HugePages enabled. Deciding on the size of huge page to use requires a bit of trial and error, but this can potentially lead to a significant TPS gain where the database size is large but remains small enough to fit in the shared buffer.
How we built Globoplay’s API Gateway using GraphQL: 15 mins read. This blog start with the reason why Globo’s team decided to use GraphQL instead of REST for building API. The main reason outlined in the post for choosing GraphQL is the ease with which you can support different requirements for different devices. After covering the why GraphQL, the post talks about how you can get started with it.
In mid 2018, we had two backends for frontends (BFF) doing very similar tasks: One for web, and another for iOS, android and TV. As much as I love the “backend for frontend” idea (and how cool it sounds), we could not keep the current architecture. Not only because of the reasons I just said, but because each BFF was serving slightly different content to its clients while the business team started to ask for something new: Ubiquity among all clients. the more I reviewed everything we needed to support, the more GraphQL started to make sense. While TVs need a big program poster, mobiles need a small one. We need to show exactly the same video duration among all clients. TVs should provide detailed information about each program, but iOS and android could show only a poster + program title.
Envoy Proxy at Reddit: 20 mins read. The post goes into depth on why Reddit moved to Envoy for service to service communication. What I like in this post is how they incorporated a new technology in a step by step manner rather than going the Big Bang approach. It is a well written post so you will end up learning a lot about how big sites like Reddit introduce new technologies in their ecosystem and how they architecture evolve over time. The post outlines three requirements Reddit team had from their service mesh choice. Envoy and its ecosystem fit all of these requirements.
1. Performance: Avoid adding a performance bottleneck at all costs. Any performance losses at the proxy level need to be offset by considerable feature gains. The two biggest considerations here were resource utilization and latency impact. Our mesh approach accounts for a sidecar proxy on every host, so we wanted the solution to be one that we were comfortable running on every host and at every hop in the network.
2. Features: The biggest differentiator among the options was the possibility of L7 Thrift support in the proxy. Thrift is our main inter-service RPC protocol and without first-class support for the behavior control we want in a service mesh, it wouldn’t make sense to switch to something that would just be providing the same basic TCP load balancing we’re getting out of HAProxy. We’ll address this in the next section.
3. Integrations and Extensibility: Being able to contribute or request integrations and possibly extend out-of-the-box functionality was also a core requirement. The network proxy needed to be able to evolve with Reddit’s service needs and developer feature requests.
Going Head-to-Head: Scylla vs Amazon DynamoDB: 30 mins read. This post compares ScyllaDB with Amazon DynamoDB. Author writes, Scylla is a drop-in replacement for Cassandra, implemented from scratch in C++. Cassandra itself was a reimplementation of concepts from the Dynamo paper. So, in a way, Scylla is the “granddaughter” of Dynamo. That means this is a family fight, where a younger generation rises to challenge an older one. It was inevitable for us to compare ourselves against our “grandfather,” and perfectly in keeping with the traditions of Greek mythology behind our name.
The conclusion from the post says it all
1. DynamoDB failed to achieve the required SLA multiple times, especially during the population phase.
2. DynamoDB has 3x-4x the latency of Scylla, even under ideal conditions
3. DynamoDB is 7x more expensive than Scylla
4. Dynamo was extremely inefficient in a real-life Zipfian distribution. You’d have to buy 3x your capacity, making it 20x more expensive than Scylla
5. Scylla demonstrated up to 20x better throughput in the hot-partition test with better latency numbers
6. Last but not least, Scylla provides you freedom of choice with no cloud vendor lock-in (as Scylla can be run on various cloud vendors, or even on-premises).

I will end this newsletter with a short video — The Electronic Coach. This short video shows how Donald Knuth build a mainframe program that helped a basketball team win 11 out of 14 matches. This is an early example of computer used in data driven decision making. In case you don’t know Donald Knuth, he is one of the greatest computer scientist. His book The Art of Computer Programming is included by American Scientist on its list of books that shaped the last century of science.

Issue #20: 10 Read, A Handcrafted Weekly Newsletter For Humans

The total estimated time to read this newsletter is 190 minutes.

The secret of getting ahead is to get started – Mark Twain

Facial recognition: It’s time for action: 30 mins read. This is a post by Microsoft on the need for government regulation and responsible industry measures to address advancing facial recognition technology. This is a welcome step by Microsoft and it shows them that they are on the right side of the issue. The post lays out the potential dangers of facial recognition if it is not regulated. The three main problems outlined in the post are:
1. Certain uses of facial recognition increase the risk of decisions, and, more generally, outcomes that are biased and, in some case, in violation of laws prohibiting discrimination
2. Intrusion into people’s privacy
3. The use of facial recognition technology by a government for mass surveillance can encroach on democratic freedoms
Microsoft has also defined six principles that they are adopting to address the concerns. These are 1) Fairness 2) Transparency 3) Accountability 4) Non-discrimination 5) Notice and consent 6) Lawful surveillance.
Why You Should Never, Ever Use Quora: 15 mins read. I personally don’t use Quora for last many years. I find it full of gossip and useless questions and answers. The author makes a good point about Quora lack of intent to make knowledge accessible. Quora does not provide any API or data export tool. They have explicitly forbidden Internet Archive from indexing their web site. Also, you will be forced to login before you can see full answer text. Moreover, they are having trouble making money. So, you never know if they will exist few years down the line. This makes it even more important that they allow shareability of their data.
The Swiss Army Knife of Hashmaps: 30 mins read. This post covers new implementation of HashMap based called Hashbrown. Hashbrown is based on Google’s SwissTable implementation. The blog starts from HashMap basics covering hashes and different implementations of HashMap using linear probing, Robin Hood hashing, and finally talking about Hashbrown. This post gives you a good understanding of HashMap.
Why you need both rituals and routines to power your workday: 10 mins read. The post covers the importance of routines and rituals to make most of the day. Routine is a series of regularly followed actions. Rituals, on the other hand, are those symbolic actions performed at key moments that help us move through the day smoothly.
How Pinterest runs Kafka at scale: 10 mins read. This post talks about how Pinterest is using Kafka. Pintrest has one of the largest Kafka deployments in the cloud. Their Kafka deployments runs in three AWS regions. They make use of MirrorMaker to transport data among three regions. They created and open sourced DoctorKafka, a Kafka operations automation service to perform partition reassignment during broker failure for operation automation. They use d2.2xlarge instances for brokers.
Using Golang to Build Microservices at The Economist: A Retrospective: 20 mins read. This article by Economist engineer covers in depth why they chose Golang for building their new content platform. The three main reasons outlined in the post are:
1. Go has key design elements required for building distributed systems
2. Go’s concurrency model is relatively easy to implement
3. It is easy too get started and fun to write
Serverless Tip: Don’t overpay when waiting on remote API calls: 15 mins read. I consider as good software developer, you should not shy away from validating your assumptions. This post goes in depth on how the author did detailed analysis to validate his hypothesis. Author writes, My hypothesis was that by lowering the memory configuration, that the execution of the Lambda function would be slower and perhaps not as cost effective. He then carried out experiments to validate his hypothesis. As it turns out, functions that make remote API calls can be broken down into small, asynchronous components with low memory settings. We get the same performance and significantly reduce our costs, especially at scale.
Our learnings from adopting GraphQL: 20 mins read.In this post, Netflix Marketing Technology team shares their learning in adopting GraphQL. Author writes, We have been running GraphQL on NodeJS for about 6 months, and it has proven to significantly increase our development velocity and overall page load performance.
How To Build A Real-Time App With GraphQL Subscriptions On Postgres: 30 mins read. This is a tutorial to build a real time pooling app. Author does a good job explaining why GraphQL is a good choice for building real time apps. You can use this tutorial to build your firsts GraphQL app.
Your Intuition Is Wrong, Unless These 3 Conditions Are Met: 10 mins read. Daniel Kahneman, author of Thinking Fast and Slow explains why most intuitions are wrong. I loved how he compared two definitions of intuitions and explained why the second is better than the first. The first definition is Intuition is defined as knowing without knowing how you know. The second definition is intuition is thinking that you know without knowing why you do.

Bonus: I will end this week newsletter with a great talk on decision making.

Paper Summary: How Complex System Fail

Today, I read paper How Complex System Fail by Richard Cook. This paper was written in 2000. This is a short and easy to read paper written by a doctor. He writes about his learnings in the context of patient care, but most of them are applicable to software development as well.

In this post, I go over the points that I liked from the paper and write how I think they are applicable to software development.

The first point raised in the paper is

All complex systems are intrinsically hazardous systems.

With respect to complex software that most of us build, I don’t think hazardous will be the right word. Luckily, no one will die if the software we write fails. I see it as all complex systems are intrinsically important systems for an organisation. Most of the complex software has business and financial importance. So if the software fails, organization will end up loosing money and reputation. These can be catastrophic for the organization and might bring an organization to its knees. It is the business importance of these systems that drives the creation of defences against failure.

This takes us to the next point in the paper:

Complex systems are heavily and successfully defended against failure.

I don’t think from the day one most complex systems have successful defensive mechanisms placed. But, I agree that over time mechanisms will be added to successful operate the application. Few of the mechanisms that we use in software are data backup, redundancy, monitoring, periodic health checks, training people, creating change approval process, and many others. With time, an organization builds a series of shields that normally divert operations away from failures. Another key point relevant to software is that when we try to rebuild a system we should expect that the new system will become stable over time. Many times you deploy a new system and it fails because the mechanisms used in old system might not be sufficient to run the new system. Over time, an organization will put in place mechanisms to successfully operate the new application.

The next point mentioned in this paper is:

Catastrophe requires multiple failures – single point failures are not enough.

I disagree with author that single point of failures are not enough for catastrophic failures. I think it depends on the way the complex software is built. If your complex software systems is built in such a way that fault in one service brings cause failure in the service using the first one. Then, a single failure in the least important component can cause the entire application to break. So, if you are building complex distributed applications then you should use circuit breakers, service discovery, retries, etc to build resilient applications.

The next point raised by author is not obvious to most of us:

Complex systems contain changing mixtures of failures latent within them.

Author says that it is impossible for us to run complex systems without flaws. I have read a similar point somewhere that Amazon tries for five-nine availability i.e. 99.999%. Five-nines or 99.999% availability means 5 minutes, 15 seconds or less of downtime in a year. They don’t try to go further because it becomes too costly after this time and customers can’t notice the difference between their internet not working and service becoming unavailable.

A corollary of the above point is:

Complex systems run in degraded mode.

The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.

The next point in the paper is:

Catastrophe is always just around the corner.

Another way to put the above point is Failure is the norm. Deal with it. If you are building distributed systems then you must be clear in your mind that different components will fail. They might lead to complete failure of the system. So, we should try to build systems that can handle individual component failure. Author writes, it is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system’s own nature.

Then, author goes on to write:

Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

I don’t agree entirely with this point. The author is saying that there is no single cause for failure. According to him, there are multiple contributors to the failure. I agree that multiple reasons would have let to failure. But, I don’t think root cause analysis is fundamentally wrong. I think root cause analysis is the starting point to understand why things go wrong. In software world, blameless postmortem is done to understand failure with the intention of not finding which team or person is responsible but to learn from failure and avoid such failure in future.

The next point brings human biases into picture:

Hindsight biases post-accident assessments of human performance.

It always look obvious in the hindsight that we should have avoided the failure given all the events that happened. This is the most difficult point to follow in practice. I don’t know how we can overcome biases. Author writes, Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

The next point mentioned in the paper is:

All practitioner actions are gambles.

This is an interesting point. If you have ever tried to bring a failure system back to state, you will agree with the author. In those uncertain times, we try different things to bring the system up but at best they are our guesses based on our previous learnings. Author writes, practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

I like the way author makes a point about human expertise:

Human expertise in complex systems is constantly changing.

I think most software organisations want to use the latest and greatest technology to build complex systems but they don’t invent enough in developing expertise. Humans are the most critical part of any complex system so we need to plan for their training skill refinement as a function of system itself. Failure to do this will lead to software failures.

As they say, change is the only constant. The next point talks:

Change introduces new forms of failure.

I will quote directly from the paper as author puts it eloquently

The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. When new technologies are used to eliminate well understood system failures or to gain high precision performance they often introduce new pathways to large scale, catastrophic failures.

The last point mentioned in the paper sums it all:

Failure free operations require experience with failure.

As they say, failure is the world greater teacher. When you encounter performance issues or difficult to predict situations then we push the envelope of our understanding of the system. These are the situations that help us discover new mechanisms that we need to put in our system to handle future failures.

Conclusion

I thoroughly enjoyed reading this paper. It has many learnings for software developers. If we become conscious about them I think we can write better resilient software.

Getting started with Apache Dubbo and Spring Boot

This week I decided to play with Apache Dubbo. I follow Github trending repositories daily and for many weeks and months Apache Dubbo is one of their popular Github Java repository. It has more than 20,000 stars. So, I decided to give it a shot. One of the reasons for Dubbo popularity is that it is created by software engineers at Alibaba. Alibaba is a Chinese multinational conglomerate specializing in e-commerce, retail, Internet, AI and technology.

From its website, Apache Dubbo is

A high-performance, light weight, java based RPC framework. Dubbo offers three key functionalities:

Interface based remote call

Fault tolerance and load balancing

Automatic service registration & discovery

Continue reading “Getting started with Apache Dubbo and Spring Boot”

Introduction to Micronaut – A new JVM framework

Recently, I discovered a new framework in Java land called Micronaut. Being a Java developer for last fourteen years, I have seen many Java frameworks come and go. Apart from Spring, there are very few frameworks that made a mark. After a long time, I discovered a framework that is challenging the status quo of web development in Java.

Micronaut is an open-source modern JVM framework for building modular, easily testable microservice and serverless applications. It is a truly polyglot framework. You can build applications in Java, Kotlin, and Groovy.

Continue reading “Introduction to Micronaut – A new JVM framework”

Issue #19: 10 Reads, A Handcrafted Weekly Newsletter for Humans

The total time to read this newsletter is 165 minutes.

GRIT is that mix of passion, perseverance, and self-discipline that keeps us moving forward in spite of obstacles.– Daniel Coyle, author of The Little Book of Talent

Microsoft is building a Chromium-powered web browser that will replace Edge on Windows 10: 10 mins read. This is a big news. Microsoft is building a Chromium based browser. This will replace their current Edge browser powered by EdgeHTML rendering engine. Now, the big question is this a good decision?
1. For software developers this is good as they can expect same code to work in both Google Chrome and Microsoft new browser. This means less code to maintain and possibly fewer cross browser bugs.
2. For the web as a whole, it might not be great news. With Microsoft adopting Chromium, Google is more powerful. The only other viable choices are Firefox and Safari. Firefox team also wrote a post on this topic and highlighted some of the dangers that lie ahead of us.
Faster and simpler with the command line: deep-comparing two 5GB JSON files 3X faster by ditching the code : 15 mins read. This post shows power of jq. jq is a command-line utility for working with JSON. It can pretty print JSON or you can use it to manipulate JSON. The team at Genius was facing an issue where they wanted to compare two big JSON files (5GB in size). They used jq to convert JSON files to a single format. Then, using diff they were able to find the difference between two JSON files. This is a good post that shows how to use a tool effectively to solve a problem.
The deepest problem with deep learning: 30 mins read. The author raises an important point that Deep learning is not the panacea. It solves certain problems but not suitable for every problem. I am not into AI but still I feel we need to understand limitations of a technology.
Software Sprawl, The Golden Path, And Scaling Teams with Agency: 15 mins read. I enjoyed reading this post. It covers an interesting challenge that high-performing engineering teams face. High performing engineering teams have the autonomy and mastery to use the best tools possible for the job. This usually means multiple programming languages, multiple databases, different messaging systems, etc. This works great when organization is small but sooner this makes your operation team overwhelmed. Charity lays down a five step process to control the software sprawl. Read the post to know learn more about this topic.
The Forgotten History of OOP: 30 mins read. I have heard this multiple times in last couple of years in many different posts or talks. Each time I hear this, my mind start to think is Actor model object orientation done right. As mentioned in this post, Alan Kay(who coined object-oriented term) never thought object orientation means class based inheritance and polymorphism. For Alan, the key ideas in OOP are:
1. Message Passing
2. Encapsulation
3. Dynamic binding
Cache warming: Agility for a stateful service: 15 mins read. This post talks about Cache Warmer utility built by Netflix. Cache plays a big role in almost all serious applications. Cache is the only way you can achieve low response times. There are times when you need to spin up a new cache cluster. In this post, Netflix engineers explained multiple approaches they used to build Cache warmer utility. I also built Cache warmer utility recently in one of my projects. It is good to learn how industry stalwarts are doing it. This gives new ideas that we can apply in our systems as well.
You Should Build your Next App on a Boring Stack: 10 mins read. The author summed up it well “The best stack is the one that works the best for you. One you know well enough to have your gut feeling point you in the right direction 90% of the time an error occurs. Because that is what will help you build reliable, working software. And in turn, earn your customer’s loyalty.”
Distributed Systems: When you should build them, and how to scale. A step-by-step guide: 15 mins read. An easy step-by-step post on how to build distributed systems. Main points in the post are:
1. Most of your design choices will be driven by what your product does and who is using it.
2. Focus on figuring out what people need, and try to come up with a solution to their problem, even if it has a lot of manual steps.
3. Then think about ways to automate, spend your time coding and destroying, and use third parties where it makes sense.
4. Don’t scale but always think, code, and plan for scaling.
5. Build your system step by step, don’t address system design issues based on features that are not mature yet, and finally always try to find the best trade-off between the time you will spend and the gain in performance, money, and lowered risk.
React Native at Picnic: 10 mins read. The team at Picnic shares why they decided to use React Native instead of Native mobile app or PWA.
1. An important requirement for the new application was that there should be no device or operating system lock-in
2. Needs to operate well under uncertain networking conditions. Hence, offline support is very important
3. We decided to use TypeScript instead of Flow
4. For state persistence, we use redux combined with redux-persist for offline support
5. On the UI-side, we use styled components for styling and storybook to document our UI components. Snapshots are automatically generated for each story by using StoryShots and React Native Storybook Loader.
6. Any argument about syntax, we defer to Prettier.
7. Finally, as the cherry on the cake, we use husky to run pre-commit and pre-push hooks that verify that all code that we check in is up to the standards that we have set for ourselves.
How to Enjoy Life: 15 mins read. I could relate to each and every word written by the author. Life is all about enjoying small little things. Most of the time it is just the way we look at something. Author writes, “The habit of taking even mild pleasure in such tasks would be life-changing, because most of what we do during a typical day isn’t done for enjoyment’s sake: laundry, exercise, office work, dishes, dusting. We do these things because they make life better in some less immediate sense; they’re rewarding, but not necessarily as you do them.”.

Paper Summary: Simple Testing Can Prevent Most Failures

Today evening, I decided to read paper Simple Testing Can Prevent Most Failures: An analysis of Production Failures in Distributed Data-intensive systems.

This paper asks an important question

Why widely used distributed data systems designed for high availability like Cassandra, Redis, Hadoop, HBase, and HDFS. experience failures and what can be done to increase their resiliency?

We have to answer this question keeping in mind that these systems are developed by some of the best software developers in the world following good software development practices and are intensely tested.

These days most of us are building distributed systems. We can apply the findings shared in this post to build systems that are more resilient to failure.

The paper shares:

Most of the catastrophic system failures are result of incorrect handling of non-fatal errors explicitly signalled in the software

This falls into 1) empty error handling blocks, or error blocks with just log statement 2) the error handing aborts the clusters on an overly-general exception 3) the error handling code contains expressions like “FIXME” or “TODO” in the comments.

Most of the developers are guilty of doing all the three above mentioned. Developers are good at finding that something will go wrong but they don’t know what to do when something goes wrong. I looked at the error handling code in one of my projects and I found the same behaviour. I had written TODO comments or caught general exceptions. These are considered to be bad practices but still most of us end up doing.

Overall, we found that the developers are good at anticipating possible errors. In all but one case, the errors were checked by the developers. The only case where developers did not check the error was an unchecked error system call return in Redis.

Another important point mentioned in the paper is

We found that 74% of the failures are deterministic in that they are guaranteed to manifest with an appropriate input sequence, that almost all failures are guaranteed to manifest on no more than three nodes, and that 77% of the failures can be reproduced by a unit test.

Most popular open source projects use unit testing so it could be surprising that the existing tests were not good enough to catch these bugs. Part of this has to do with the fact that these bugs or failure situations happens when a sequence of events happen. The good part is that sequence is deterministic. As a software developer, I could relate to the fact that most of us are not good at thinking through all the permutation and combinations. So, even though we write unit tests they do not cover all scenarios. I think code coverage tools and mutation testing can help here.

It is now universally agreed that unit testing helps reduce bugs in software. Last few years, I have worked with few big enterprises and I can attest most of their code didn’t had unit tests and even if parts of the code had unit tests those tests were useless. So, even though open source projects that we use are getting better through unit testing most of the code that an average developer writes has a long way to go. One thing that we can learn from this paper is to start write high quality tests.

The paper mentions specific events where most of the bugs happen. Some of these events are:

Starting up services
Unreachable nodes
Configuration changes
Adding a node

If you are building distributed application, then you can try to test your application for these events. If you are building applications that uses Microservices based architecture then these are interesting events for your application as well. For example, if you call a service that is not available how your system behaves.

As per the paper, these mature open-source systems has mature logging.

76% of the failures print explicit failure related messages.

Paper mentions three reasons why that is the case:

First, since distributed systems are more complex, and harder to debug, developers likely pay more attention to logging.
Second, the horizontal scalability of these systems makes the performance overhead of outputing log message less critical.
Third, communicating through message-passing provides natural points to log messages; for example, if two nodes cannot communicate with each other because of a network problem, both have the opportunity to log the error.

Authors of the paper built a static analysis tool called Aspirator for locating these bug patterns.

If Aspirator had been used and the captured bugs fixed, 33% of the Cassandra, HBase, HDFS, and MapReduce’s catastrophic failures we studied could have been prevented.

Overall, I enjoyed reading this paper. I found it easy to read and applicable to all software developers.

What Motivates After You Have Met the Baseline?

In 2016, I wrote a technology series – 52-technologies-in-2016 in which I posted a new technical blog/tutorial every week for a year (I managed to write 43 posts). The series got quite popular, receiving more than 7000 stars on GitHub and close to a million page views. I created a page in the project GitHub repository covering the mentions it received from the community. I received thank you notes from software developers around the world. Though it involved a good amount of hard work, I thoroughly enjoyed doing it.

Back in 2017, I remember sharing with a friend of mine about this achievement and how it managed to reach thousands of developers around the world. After listening to me for a while, he said, “But, you did not make any money with it.” Well, he was right. I didn’t make any money with it. I politely answered him that money was not my ultimate goal of the series.

Continue reading “What Motivates After You Have Met the Baseline?”

Locust: Load test your REST API

Recently, I wanted to load test one of my applications. In my previous projects, I have used Apache JMeter and Gatling. Both of them are great tools but I wanted something with ease of Apache Benchmark and web user interface. In my limited experience with Gatling and JMeter, I found that it takes time to get started with these tools. I don’t use these tools daily so there is always a learning curve for me.

My quest for an easy to use load testing tool lead me to Locust. It took me no time to get started with the tool and I was able to understand performance characteristics of my application by running a load test. I found Locust a great tool if you want to find how many request per second your system can support and how response time varies for different percentiles of requests.

Locust is an open source load testing tool written in Python. It can simulate millions of users to load test your application. It has an intuitive user interface that you can use to easily get started with it. It allows you to define custom behaviour using Python code.

Continue reading “Locust: Load test your REST API”

Getting Started with Amazon Corretto: Production Ready Distribution of OpenJDK

In November 2018, James Gosling (who now works at Amazon), father of Java released Corretto at Devoxx Belgium conference. Amazon Corretto is no-cost, multi platform, production ready distribution of OpenJDK. This comes at a time when Oracle announced that it will no longer provide free binary downloads of JDK after a six-month period; and neither it will patch OpenJDK with fixes after that period. The six-month is the new release cycle for new JDK versions. If you follow Java, then you might be aware that Java has moved to six month release cycle. The latest version of JDK is 12.

Continue reading “Getting Started with Amazon Corretto: Production Ready Distribution of OpenJDK”