How does FerretDB work?

In recent weeks, I have come across FerretDB on multiple occasions, and I thought why not just get a closer look on the topic. I took a particular interest in it (FerretDB) as it is a MongoDB implementation, on top of my favourite database PostgreSQL.

While I do have high-level thoughts on how I would go about building MongoDB on top of Postgres, I wanted to confirm, validate, and learn how the FerretDB team has been doing it.

FerretDB (previously MangoDB) was founded to become the de-facto open-source substitute to MongoDB. FerretDB is an open-source proxy, converting the MongoDB 5.0+ wire protocol queries to SQL – using PostgreSQL as a database engine.

At a high level, FerretDB is a proxy, which implements MongoDB Wire Protocol that MongoDB clients speak. After establishing the connection with MongoDB clients, it translates any query sent by MongoDB clients to the SQL queries Postgres understands.

In the recent release(0.5.0) of FerretDB, it is also possible to use it as a library rather than as a proxy. FerretDB as a library helps in reducing one network hop, which leads to better performance. It is only possible for applications that are built in Go since FerretDB is implemented in Go.

Below are some of the tweets from people on this article. If you find this article useful please share and tag me @shekhargulati

Continue reading “How does FerretDB work?”

My Notes on GitLab Postgres Schema Design

I spent some time going over the Postgres schema of Gitlab. GitLab is an alternative to Github. You can self host GitLab since it is an open source DevOps platform.

My motivation to understand the schema of a big project like Gitlab was to compare it against schemas I am designing and learn some best practices from their schema definition. I can surely say I learnt a lot.

I am aware that best practices are sometimes context dependent so you should not apply them blindly.

The Gitlab schema file structure.sql [1] is more than 34000 lines of code. Gitlab is a monolithic Ruby on Rails application. The popular way to manage schema migration is using the schema.rb file. The reason the Gitlab team decided to adopt structure.sql instead is mentioned in on of their issues [2] in their issue tracker.

Now what keeps us from using those features is the use of schema.rb. This can only contain standard migrations (using the Rails DSL), which aim to keep the schema file database system neutral and abstract away from specific SQL. This in turn means we are not able to use extended PostgreSQL features that are reflected in schema. Some examples include triggers, postgres partitioning, materialized views and many other great features.

In order to leverage those features, we should consider using a plain SQL schema file (structure.sql) instead of a ruby/rails standard schema schema.rb.

The change would entail switching config.active_record.schema_format = :sql and regenerate the schema in SQL. Possibly, some build steps would have to be adjusted, too.

Now, let’s go over the things I learnt from Gitlab Postgres schema.

Below are some of the tweets from people on this article. If you find this article useful please share and tag me @shekhargulati

Continue reading “My Notes on GitLab Postgres Schema Design”

Choosing Primary Key Type in Postgres

In relational database design one of the key decisions is choosing the right primary key type for tables. In this post I am talking about surrogate or synthetic primary keys. They are called surrogate or synthetic as these keys are not derived from application data. In my experience I have seen very few teams giving a proper thought process to deciding primary key types. For each table they go with the default type that is used in their organization. This would mean all tables will either have a int/bigint type or uuid or varchar. In this post I am not giving any recommendation I am only discussing how different keys affect insertion speed, data size, and how they compare against each other. You will have to do your own analysis to choose the right data type based on your use case. Different tables have different needs so you should make judgement accordingly. There is no one size fit solution.

In this post, I am using Postgres 13.3. You can install it using your operating system package manager. I am running it in a Docker container.

To follow along you can create a sample database and run SQL queries in that.

create database choosing_pk_type;

Connect to this database

\c choosing_pk_type;

Primary Key Types Comparison Summary

The numbers shown above are derived from the below table.

Now the TLDR version.

Continue reading “Choosing Primary Key Type in Postgres”

Correctly using Postgres as queue

I am building a central notification dispatch system that is responsible for sending different kinds of notifications to the end customer. It relies on multiple third party APIs for sending the actual email/SMS notifications. At a high level architecture of the system is shown below.

NotificationSender exposes both REST and messaging interface for accepting consumer requests. Consumers here refer to the services that need to send the notification. This is what notification system does:

  • It accepts requests from upstream services and stores that in the Postgres database after doing validation. The notification event is written to the Postgres database in ENQUEUED state. It is returns back HTTP 202 ACCEPTED to the upstream services if the request is valid else it returns HTTP 400 Bad Request.
  • At a predefined frequency a poller that is part of the NotificationDispacther polls the Postgres database for new notification events i.e. events in ENQUEUED state. For now, it respects insertion time order.
  • If enqueued events are found then it processes them and sends actual notifications using the downstream SMS and Email services.
  • After processing the events it change state of the events to processed
Continue reading “Correctly using Postgres as queue”

Issues with mmap

A couple of months back I watched a video by Andy Pavlo, Associate Professor of Databases Carnegie Mellon, where he made a point that databases should not use mmap. He went on to say that if there is only one thing you should get from his database course is to never use mmap when building and designing database management systems. I have not used mmap before so I was intrigued to understand it in more detail. I was aware that MongoDB used to use an mmap based storage engine. It allowed them to achieve faster time to market but later they had to replace it with a new storage engine wiredtiger because of the issues they faced with mmap. MongoDB is not the only database that uses mmap. There are many databases that use mmap. Some of the databases that use mmap are RavenDB, ElasticSearch, LevelDB, InfluxDB, LMDB, BoltDB, moss (key-value store from Couchbase), etc.

Given that so many databases use mmap I wanted to understand why Andy recommended us to not use mmap. I will list all of the reasons I could find in my research and from Andy’s video in this post. But, before we do that let’s first understand mmap.

Continue reading “Issues with mmap”

The 5 Minute Introduction to DuckDB: The SQLite for Analytics

Updated: 3rd September 2020

A couple of weeks back I learnt about DuckDB while going over DB Weekly newsletter. It immediately caught my attention as I was able to quickly understand why need for such a database exist. Most developers are used to working with an embedded file based relational database in their local development environment. Most popular choice among embeddable RDBMS is SQLite. Developers use embeddable databases because there is no set up required and they can get started quickly in a couple of minutes. This enables quick prototyping and developers can quickly iterate on business features.

DuckDB is similar to SQLite in the sense it is also designed to be used as an embeddable database. Developers can easily include it as a library in their code and start using it. Later in this post, I will cover how we can use DuckDB with Python.

Continue reading “The 5 Minute Introduction to DuckDB: The SQLite for Analytics”

How To Think About Different Database Data Models: Relational vs Document Data Models

This week I was talking to a developer about how to think about data models supported by different databases. One thing that I have learnt in my 15 years of building web applications is that data models play an important role in the success of any software application.

Data model provides an abstract model to organise elements of data and how these elements relate to each other. They describe the structure, manipulation and integrity aspects of the data stored in data management systems such as relational databases. For example, when you are modelling your problem domain in a relational database then you think in terms of real-world entities and how those entities are related with each other. We usually use Entity Relation (ER) diagrams to model tables in relational databases. In this post, we will focus on two popular data models — Relational and Document. We will discuss when you will use one over the other.

Data model influence two main attributes:

  1. Easy of use: A data model can make some operations easy to achieve and others difficult or impossible
  2. Performance: A data model can be suitable for faster reads but slower writes or vice versa

Continue reading “How To Think About Different Database Data Models: Relational vs Document Data Models”

MemSQL Introduction: A Hybrid transactional/analytical processing database

main-how-it-works-ecosystem-diagram

MemSQL is s fast, commercial, ANSI SQL compliant, highly scalable HTAP database. HTAP databases are those that support both OLTP and OLAP workloads. It supports ACID transactions just like a regular relational database .It also supports document and geospatial data types.

MemSQL is fast because it stores data in-memory. But, it does not mean it is not durable. It maintains a copy of data on disk as well. Transactions are committed to the transaction log on disk and later compressed into full-database snapshots. One of the main reason new databases are designed as in-memory first is because memory is getting cheaper every year. It is estimated memory is becoming cheaper 40% every year.

MemSQL has tuneable durability. You can make it fully durable or completely ephemeral. It can be sync or async.

MemSQL simplifies your architecture as you don’t have to write ETL jobs to move data from one data store to another data store. This is the biggest selling point of any HTAP database.

Continue reading “MemSQL Introduction: A Hybrid transactional/analytical processing database”

The Minimalistic Guide to ACID Transactions

Welcome to the third post of distributed system series. So far in this series, we have looked at service discovery and CAP theorem. Before we move along in our distributed system learning journey, I thought it will be useful to refresh our memory with understanding of ACID transactions. ACID transactions are at the heart of relational databases. The knowledge of ACID transactions is useful when building distributed applications.

Understanding ACID transactions

A transaction is a sequence of operations that form a single logical unit of work. These transactions are executed on a shared database system to perform a higher-level function. An example of higher-level function is transferring money from one account to another. Transactions represent a basic unit of change in the database. It either executed in its entirety or not at all.

ACID (Atomicity, Consistency, Isolation, and Durability) refers to a set of properties that a database transaction should guarantee even in the event of errors, power failure, etc. The canonical example of ACID transaction is transfer of funds from one bank account to another. In a single fund transferring transaction, you have to check the account balance, debit one account, and credit another transaction. ACID properties guarantee that either money transfer from one account to other occur correctly and permanently or in case of failure both accounts have the same initial state. It would be unacceptable if one account was debited but the other account was credited.

Database transactions are motivated by two independent requirements:

  1. Concurrent database access: Multiple clients can access the system at the same time. This is achieved by the Isolation property of ACID transaction.
  2. Resiliency to system failures: System remains in consistent state in case of a system failure. This is provided by Atomicity, Consistency, and Durability properties of ACID transaction.

Continue reading “The Minimalistic Guide to ACID Transactions”

PostgreSQL learning

To find out the location of PostgreSQL configuration files, just connect with your PostgreSQL database and run the following command.

SELECT name, setting FROM pg_settings WHERE category = 'File Locations';

To describe a table, you should use

\d+ table_name

To list all the databases,you should run following command.

\list
\l

To view all the tables in a schema

\dt+ pg_catalog.pg_t*

To view core settings of a PostgreSQL server, you should use

SHOW ALL;

To view value of a specific setting like effective_cache_size

SHOW effective_cache_size;

To reload the configuration changes

SELECT pg_reload_conf();

To create a new LOGIN user

CREATE ROLE shekhar LOGIN PASSWORD 'p@ssw0rd';

To create a new database

CREATE DATABASE myappdb;

Schema are logical compartments in a database. You can divide your database into small logical schemas.

CREATE SCHEMA etcs;

To view all the available extensions

SELECT * from pg_available_extensions;

To view details about an extension
To view all the available extensions

\dx+ plpgsql;

To install an extension and view its details

postgres=# CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION
postgres=# \dx+ fuzzystrmatch;
                    Objects in extension "fuzzystrmatch"
                             Object Description
----------------------------------------------------------------------------
 function difference(text,text)
 function dmetaphone(text)
 function dmetaphone_alt(text)
 function levenshtein(text,text)
 function levenshtein(text,text,integer,integer,integer)
 function levenshtein_less_equal(text,text,integer)
 function levenshtein_less_equal(text,text,integer,integer,integer,integer)
 function metaphone(text,integer)
 function soundex(text)
 function text_soundex(text)
(10 rows)

To view the active running processes

postgres=# SELECT pid, usename from pg_stat_activity;
 pid  | usename
------+----------
 1982 | postgres
 2742 | postgres
 2236 | postgres
 2723 | shekhar
 2744 | postgres
(5 rows)

To kill a connection

postgres=# SELECT pg_terminate_backend(2723);
 pg_terminate_backend
----------------------
 t
(1 row)

To kill all the connection by user shekhar

SELECT pg_terminate_backend(pid) FROM pg_stat_activity where usename = 'shekhar';

To run a script file, run the following command.

psql -f <path_to_filefile>

You can also run commands with psql non-interactively

psql -d mydb -c "CREATE SCHEMA test;"

The \set command can be used to create user defined shortcuts like as shown below.

\set connections_check 'SELECT pid, usename from pg_stat_activity;'

The .psqlrc file can be used to define configurations for a session. You can use PSQLRC environment variable to control the location of the startup file.

\pset null 'Null'
\encoding latin1
\set PROMPT1 '%n@%M:%>%x %/#'
\set PROMPT2
\timing on
\pset pager always

To turn on timing for the queries execution time

\timing

You can call OS shell commands from within psql as shown below.

admin2pnvtk8:myapp#\! env|grep POSTGRES

To view uptime of your PostgreSQL database in minutes, run the following query.

myapp=# select date_trunc('minute',current_timestamp - pg_postmaster_start_time()) as "postgresql_uptime";
 postgresql_uptime 
-------------------
 00:12:00
(1 row)