Author Archives: shekhargulati

Building an Article Extraction Python API with newspaper3k and flask

Today, I was working on an application that required me to extract the main content html for a web page. This is called article extraction. Most of the time you want to extract the text of the article but I wanted to extract HTML of the main content. For example, if you are reading following WashingtonPost article then I want to extract the main HTML content on the left. I don’t want sidebar HTML containing ads or other information.

Continue reading

Issue #30: 10 Reads, A Handcrafted Weekly Newsletter For Software Developers

The time to read this newsletter is 175 minutes.

Do every act of your life as though it were the very last act of your life – Marcus Aurelius

  1. Learn more programming languages, even if you won’t use them: 10 mins read. I first got this advice few years back when I watched a two minute video by Bjarne Stroustrup, creator of C++. He recommended you should not call yourself programmer if you only know one programming language. The magic number he mentioned in the video was 5. This post also makes the same point. Different programming languages are good at different things. Every programming language makes a tradeoff. They help you think about a problem in different way. I got hang of functional programming once I learnt Scheme basics. I try to learn a new programming language every couple of years. I need to start using them in my side projects.

  2. Announcing AMP Real URL: 20 mins read. In case you are not aware, AMP stands for Accelerated Mobile Pages. AMP is an open source standard led by Google that helps speed up access to websites by caching the content near to user. This is good for readers but for content producers there were few issues. The biggest issue with AMP is that rendered webpage has a URL starting with https://google.com/amp/. Users have become used to looking at the navigation bar in a web browser to see what web site they are visiting. The AMP cache breaks that experience. In this post by Cloudflare folks authors talk about how they fixed the real origin URL problem with AMP using web packaging and Cloudflare workers.

  3. Infrastructure as Code, Part One: 15 mins read. This is an introductory read on infrastructure as code. If you are not aware of it then you should give it a read. It is a nicely written introduction to IaC.

  4. When rules don’t apply: 30 mins read. This is a 30 mins video that talks about how executives at Apple, Google, eBay, Intuit, and other big tech companies conspire against their own employees by secretly agreeing among themselves not to hire each other employees. Tech companies treat their employees as their assets and cheat them.

  5. Designing a modern serverless application with AWS Lambda and AWS Fargate: 20 mins read. A lot of good ideas in this post on how to build modern applications. The key points for me in this post are:

    1. You should different compute services based on the use case. The post talks about why author used both AWS Lambda and AWS Fargate. For short computation jobs use lambda and for long compute jobs that have no designated end use AWS Fargate.
    2. Give a thought on isolation model when deciding which compute service to use. AWS Lambda compute instances are isolated from each other so if one rogue your application will not suffer.
    3. When you are building a side project or building MVP for your startup your goal should be to minimise maintenance and operation tasks. Serverless services help you do that.
    4. AWS CDK is a service that allows you to write IaC in your own preferred language.
  6. Thundering Herds & Promises: 10 mins read. I love this kind of posts which share how team solved a real-world technical problem. This post covers how Instagram solved the thundering herd problem with their cache using the promises. The below explains explains what thundering herd problem means

    > If your cache is hit with 100 concurrent requests, then, since the cache is empty, all of them will get a cache-miss at the one moment, resulting in 100 individual requests to the backend. If the backend is unable to handle this surge of concurrent requests (ex: capacity constraints), additional problems arise. This is what’s sometimes called a thundering herd.

  7. The Good and the Bad of Google Cloud Run: 10 mins read. The key point made in this post is that Google Cloud Run is not FaaS. Google Cloud Run allows developers to push container images with HTTP server to GCP and GCP takes care of running them at scale. If you have build pure serverless application you will know that pure serverless apps architecture is event-driven service-full architecture. This forces developers to think about applications in a different way. According to author, Cloud Run is providing a safety blanket for developers intimidated by the paradigm shift of FaaS and service-full architecture.

  8. Azure Cosmos DB: Microsoft’s Cloud-Born Globally Distributed Database: 20 mins read.This is a detailed explanation of Azure’s Cosmos DB internals. This article was too technical and detailed for me. I will try to re-read it again to better grasp the underlying details of Cosmos DB.

  9. How to Improve Your Memory (Even if You Can’t Find Your Car Keys): 10 mins read. This post by Adam Grant talks about how to improve your retention power. The key points are:

    1. Take rest after learning a new concept.
    2. Don’t re-read stuff
    3. Try to do a small quiz on what you have learnt or try to explain it to someone
    4. I also apply similar technique in my newsletter by summarising what I have learnt from a post in my own words.
  10. An Overview of Go’s Tooling: 30 mins read. This is the post that you should bookmark if you are a Go developer. The post covers most the Go tools a developers need to interact with. I wish more such posts should be written for other languages as well.

Video for this week:

Mental Models for Software Engineers: Hanlon’s Razor

It is difficult to be sanguine when people around you does something wrong with you. The wrong doing does not have to be extreme it could be as simple as a colleague not replying to your email when you expected them to reply. In many situations like these we tend to assume that other person does not want our good and they are doing it intentionally. The result of these explanations is that we strain our relations with people around us. We can’t read their mind but still we end up assuming what is in their mind. Hanlon’s razor mental model can help us overcome this bias.

Continue reading

Issue #29: 10 Reads, A Handcrafted Weekly Newsletter For Software Developers

The time to read this newsletter is 180 minutes.

Quality is not an act, it is a habit – Aristotle

  1. Your boss is 90% of the ‘Employee Experience’. Nothing else comes close.10 mins read.
  2. How NOT to hire a software engineer10 mins read.
  3. How We Used WebAssembly To Speed Up Our Web App By 20X20 mins read.
  4. Why We Moved from Heroku to Google Kubernetes Engine15 mins read.
  5. We deployed Envoy Proxy to make Monzo faster20 mins read.
  6. Achieving consistency where distributed transactions have failed: 30 mins read.
  7. Scaling Elasticsearch Part 1: How to Speed Up Indexing: 15 mins read.
  8. Five Super Helpful Rust Things That Nobody Told You About: 10 mins read.
  9. Linkerd v2: How Lessons from Production Adoption Resulted in a Rewrite of the Service Mesh: 30 mins read.
  10. 5 Things to Stop Doing When You’re Struggling and Feeling Drained: 20 mins read.

Introducing Chaos Engineering to an Organization

This post explains my learning on how to introduce Chaos Engineering to an organisation. This is based on my experience of re-architecting monolithic application to Microservices based architecture. Microservices architecture style structures an application as a collection of loosely-coupled services. Microservices architecture has many benefits like independent development and deployments of services, eliminate long-term commitment to a technology stack, specialized services built by small teams, and many others. One of the drawbacks of Microservices is that it increases the surface area of failures. You now have to deal with failures related to the interaction between services and system boundaries. Our client was facing issues running their distributed application in a steady state. The issues that we faced were:

  1. Communication failure between services. There was no clear strategy on how to handle network failure between services and how to give proper feedback to the customers of the application.
  2. Difficulty in understanding why the whole application became unavailable when only a single service was down. Is there any single point of failure? These types of issues were not visible with usual testing.
  3. System becoming partially unavailable when the network gets choked.
  4. Unwanted local state leading to system unavailability when one instance of the service dies.
  5. Out of memory errors in production services leading to complete or partial unavailability of the system.
  6. Possible data loss issues as data replication and backup strategies were never tested in real workloads.

Continue reading

A minimalistic guide to distributed tracing with OpenTracing and Jaeger

If you have ever worked on a distributed application you will know that it is difficult to debug when things go wrong. The two common tools to figure out root cause of the problem are logging and metrics. But the fact of the matter is that logs and metrics fail to give us complete picture of a situation in a distributed system. They fail to tell us the complete story.

If you are building a system using Microservices / Serverless architecture then you are building a distributed system.

Logs fail to give us the complete picture of a request because they are scattered across a number log files and it is difficult to link them together to form a shared context. Metrics can tell you that your service is having high response time but it will not be able to help you easily identify the root cause.

Logging and Metrics are not enough to build observable systems.

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. It helps bring visibility into systems. – Wikipedia

Logs, metrics, and traces are the three pillars of observability. While most software teams use logging and monitoring few of them use traces. Before we look at distributed tracing in depth, let’s define logs, metrics, and traces.

Continue reading