Shekhar Gulati

My Notes on GitLab Postgres Schema Design

I spent some time going over the Postgres schema of Gitlab. GitLab is an alternative to Github. You can self host GitLab since it is an open source DevOps platform.

My motivation to understand the schema of a big project like Gitlab was to compare it against schemas I am designing and learn some best practices from their schema definition. I can surely say I learnt a lot.

I am aware that best practices are sometimes context dependent so you should not apply them blindly.

The Gitlab schema file structure.sql [1] is more than 34000 lines of code. Gitlab is a monolithic Ruby on Rails application. The popular way to manage schema migration is using the schema.rb file. The reason the Gitlab team decided to adopt structure.sql instead is mentioned in on of their issues [2] in their issue tracker.

Now what keeps us from using those features is the use of schema.rb. This can only contain standard migrations (using the Rails DSL), which aim to keep the schema file database system neutral and abstract away from specific SQL. This in turn means we are not able to use extended PostgreSQL features that are reflected in schema. Some examples include triggers, postgres partitioning, materialized views and many other great features.

In order to leverage those features, we should consider using a plain SQL schema file (structure.sql) instead of a ruby/rails standard schema schema.rb.

The change would entail switching config.active_record.schema_format = :sql and regenerate the schema in SQL. Possibly, some build steps would have to be adjusted, too.

Now, let’s go over the things I learnt from Gitlab Postgres schema.

Below are some of the tweets from people on this article. If you find this article useful please share and tag me @shekhargulati

Improve Git Monorepo Performance

Today, I was exploring source code of the Gitlab project and experienced poor performance of the git status command. Gitlab is an open source alternative to Github.

Below is the output of git status command

 time git status

On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
git status  0.20s user 1.13s system 88% cpu 1.502 total

The total here is the number of seconds it took for the command to complete.

The same was the case for the git add command.

time git add .

git add .  0.21s user 1.11s system 115% cpu 1.146 total

So both commands took more than a second to finish.

These commands are slow because they need to search the entire worktree looking for changes. When the worktree is very large, Git needs to do a lot of work.

Why I like gRPC?

I have started using gRPC for service to service communication between Microservices and I am liking it so far.

I still prefer to expose APIs to the external world(browser, mobile, or third-party) using either REST or GraphQL. I am aware that you can use gRPC in Mobile apps and you can use grpc-web in web frontends. But, I have not used gRPC for those use cases yet.

I have earlier used REST(JSON over Http) and/or some form of Event-driven communication for service to service communication. They both work but the programming model leaves much to be desired.

gRPC is a HTTP/2 based modern and efficient inter-process communication style developed by Google. It is heavily used at Google and many other major tech companies such as Square, Lyft, Netflix, CockroachLabs, Salesforce, and many others.

As shown in the picture below gRPC builds on top of HTTP/2 and SSL as the efficient and secure transport layer. It uses Protocol Buffer for defining API contracts and efficient serialization. gRPC core provides the framework to do efficient service to service communication. gRPC tooling generates clients and servers that are used by the application tier.

Now that we understand gRPC basics I will share my reasons to prefer gRPC for service to service communication.

Choosing Primary Key Type in Postgres

In relational database design one of the key decisions is choosing the right primary key type for tables. In this post I am talking about surrogate or synthetic primary keys. They are called surrogate or synthetic as these keys are not derived from application data. In my experience I have seen very few teams giving a proper thought process to deciding primary key types. For each table they go with the default type that is used in their organization. This would mean all tables will either have a int/bigint type or uuid or varchar. In this post I am not giving any recommendation I am only discussing how different keys affect insertion speed, data size, and how they compare against each other. You will have to do your own analysis to choose the right data type based on your use case. Different tables have different needs so you should make judgement accordingly. There is no one size fit solution.

In this post, I am using Postgres 13.3. You can install it using your operating system package manager. I am running it in a Docker container.

To follow along you can create a sample database and run SQL queries in that.

create database choosing_pk_type;

Connect to this database

\c choosing_pk_type;

Primary Key Types Comparison Summary

The numbers shown above are derived from the below table.

Now the TLDR version.

Why not X?

I get this question a lot from our customers when we recommend technologies as part of the new software development proposal. Why not MySQL instead of Postgres? Why not Kotlin instead of Java? Why not Vue.js instead of React? Why not Memcache instead of Redis? Why not Dropwizard instead of Spring Boot? Why not Flutter instead of React Native? And the list goes on. The important point to note is that the alternative is not radically different from the proposed technology. Both options more or less have the same characteristics. There are successful products built using both the technologies. In this post I want to discuss how to answer why not X questions.

In this post I am not covering situations when X is very different from proposed technology. For example, why not Cassandra instead of Postgres? Or why not Aerospike instead of Redis? Or why not Rust instead of Java?

The obvious answer to why we propose certain technologies is that we have expertise in them and it is easier to find engineers in the market for those skills. But, when people ask this question they want to hear strong technical reasons not the usual boring obvious answer. So, the obvious answer does not fly.

Applying DDD To Architect a Digital Bank Part 1: Analyzing domain using Subdomains

I have started work on a new project where we are architecting and building a new digital retail bank from scratch. To architect a bank from scratch is a huge undertaking. There are many systems and partners involved. I have only started working in the banking domain from the last one year and I can surely say the banking domain is full of interesting technical challenges. Both incumbent and challenger banks are building systems using cloud native architecture and technologies. It is an interesting time to be working in the BFSI and FinTech space. Digital banks open up a possibility of making banking possible and accessible to millions of people across the world.

Shameless plug: I am hiring. If you are a software engineer or an architect looking for interesting technical work then contact me using the contact form.

DDD stands for Domain-driven design. This term was coined and popularized by Eric Evans when he wrote the seminal DDD blue book[1] in 2003. I read the blue book 7 years back. It made me understand why it is important to have a good understanding of the domain when building complex software.

Architecture of Open source systems #1: Umami: An open source Google Analytics Alternative

As I have gained experience building software I have realized that most important skill I can build is understanding existing code bases. You can learn about a new technology stack or framework much faster by reading existing code base that uses those technologies and trying to build an existing app in a step by step manner. Today, I wanted to learn how to build a web analytics service like Google Analytics. Google Analytics is the most widely used web analytics service on the web. I found a popular open source project umami.

Umami is a simple, fast, privacy-focused alternative to Google Analytics.

We will first start by understanding the project from outside in and then we will build the backend of umami in a step by step manner.

Tools Used

VS Code
Node.js
DBeaver for creating ER diagram and as database client – Link
Httpie
Git
MySQL

Project Details

Github Repo – https://github.com/mikecao/umami
11,296 stars
1,580 forks
132 contributors
47 releases on Github. Last 15 days ago
1379 commits
The project is 1 year and 10 months old. First commit was made on 17th July 2020
Close to 10,000 lines of modern Javascript code(used Tokei). This includes both backend and frontend
MIT license

Technology Stack of Umami

Node.js 12+. Modern Javascript codebase.
Nextjs as backend web framework for REST APIs
Prisma as ORM
Postgres or MySQL as database
Reactjs as frontend framework

Next.js is a framework to build server-rendered React web applications. It takes building React based web applications to the next level. The main reasons you would want to use Next.js are:

Zero config but you can easily override defaults
Extensible
Server-side rendering
Build both dynamic and static websites with a single framework
Supports all modern browsers
Convention over configuration
Code splitting
SEO Optimized

My take on libraries over framework(Spring Boot vs Dropwizard)

I am unable to sleep because of fever so I thought let me write this post. Maybe it is not the best time to write this post but who cares. A couple of weeks back I got into discussion with a customer CTO over libraries over framework. This customer wants us to prefer libraries over framework.This was mainly with regard to Spring Boot vs Dropwizard discussion.

The best definition I have read on the web about framework and library is that a framework calls your code and your code calls the library API. A framework does much more and has strong opinions. Libraries are focussed, solve one problem, and swappable (not entirely true without proper abstractions).

The customer wanted us to use Dropwizard instead of Spring Boot because of the following reasons:

Spring Boot does too much auto-magic. With something like Dropwizard you can have control over how things bind together. You can use manual dependency injection instead of a IoC container doing that magic for you. You can disable auto configuration in Spring Boot. You can also wire beans by hand if you want. But, I agree the default way is to rely on auto-configuration.
Spring Boot vulnerability surface area is higher because it is too easy to add starter jars and bring all the transitive dependencies. I think this will be mostly true with any other approach of building software in Java unless 1) you are going down the stackless way(I don’t think Java platform is there yet) 2) you have good governance on what gets in your dependencies.
Spring Boot executable size is much higher. I compared bare bones Spring Boot(spring-boot-starter-web with Tomcat) and Dropwizard(default maven archetype) executable sizes. As it turns out Spring Boot was 17M and Dropwizard was 19M.
Spring Boot startup time is higher. This depends a lot on what you are doing at startup. The bare bone spring boot app starts in 1.64 seconds whereas the bare bone Dropwizard app took 1.526 seconds to startup.
Spring Boot consumes much more memory. This was true. Spring Boot loaded 7591 classes whereas Dropwizard loaded 6255 classes. Also, heap space consumption of Spring Boot was twice compared to Dropwizard.
Spring Boot apps are difficult to debug. I agree exception stacktraces are too long at times and it takes a minute or two to reach your calling code. But, I personally never had much trouble debugging Spring Boot apps. I mostly rely on good tests and logging to debug stuff.
Lastly, they wanted us to follow a general principle – prefer libraries over frameworks.

The funny part is Spring Boot does not call itself a framework and Dropwizard documentation states Dropwizard straddles the line between a library and a framework.

We went with Dropwizard :). I respect their decision and I think their reasons have merit. I myself have seen too many badly architected/built Spring Boot apps so I am open to trying out a new, simpler, and better alternative.

I have read the Brandon Smith post – Write Libraries, not Frameworks. I also think libraries over frameworks is a good architecture principle that we should strive for. The only problem is when we apply principles blindly without understanding the context.

I think principles like favouring libraries over frameworks are fundamental. For these principles to work you need to have the right context and provide the right environment. I think it will work for you when:

You have a good engineering team that understands the cost of adding libraries. There is no free lunch. It does not work in a typical bottom heavy pyramid team structure where teams are considered feature factories. They will add any library under the roof to deliver features if your mindset is not aligned with the principle.
You have good governance. It is enforced by healthy code reviews, automation (aka fitness functions), architecture knowledge sharing sessions, Microservices production readiness reviews, and architects with the skin in the game.
You spend effort and resources on developer experience to build tooling that makes it easy to scaffold new Microservices with your opinions and choices baked in. I am not sure how it will be any different from a pure framework approach. You will end up building your own Microservices framework with your library choices.
You train your software engineers to buy into this methodology. They will have to unlearn the existing way and learn the new way to build software.
You understand productivity might take a hit till developers understand the new way to build software. Frameworks give you a productivity boost by helping you get started faster and solving common problems for you.
You use automated checks to continuously prove that software is not deviating from the principle. You can write a build tool task that fails the build if executable size reaches a threshold. You can write tests that fail the build if people use certain libraries. All of this is possible. You need to invest the time of the right engineers to make this happen.

I don’t know whether this will work in our environment or context. It will depend on if we can walk the talk. I have seen too many times all the good things thrown under the bus when business puts pressure for features.

Looks like medicine(or writing this post) has done its job. I started writing at 2:28am and now at 3:39am my fever is down and I am feeling better.

Why naming stuff is hard?

Last few months I have spent a lot of time doing code reviews. During the code review exercise I also pair with developers to refactor and improve the quality of their pull requests (PR). I care about two things in code reviews – correctness and understandability. In this post I will not focus on correctness (I might write a future post on this). Today, I want to focus on most important aspect of making code easier to understand – good names. Most of the time that I spend in the code review is coming up with the intention revealing names for classes, methods, interfaces, variables, packages, modules, and Microservices. I find most developers (irrespective of experience level) struggle to come up with good names.

There are only two hard things in Computer Science: cache invalidation and naming things.
Phil Karlton

In this post I will list three reasons I think developers struggle to come up with good names.

Structuring Spring Boot Microservices Configuration

All software systems we build use some sort of configuration files. These configuration files change depending on the environment in which your service/system is deployed. They allow us build a single deployable unit that can be deployed in multiple environments without any code change. We just change our configuration file depending on the environment and provide our service path to an external location where the configuration file exists. And, our service uses the configuration file to bootstrap itself.

Configuration files become unwieldy if not managed well. Incorrect configuration values is one of the major reasons for system downtime. Most teams don’t write tests for their configurations so a lot of times bugs are discovered in higher environments.

I was also seeing the same problem in one of my projects. There was a lack of clarity on which configuration properties change between different environments and which remains the same. Also, in a local and lower environment I don’t mind database credentials in my configuration files but for a higher environment I don’t want them to be present in the code.