How Developers Utilize DuckDB: Use Cases and Suitability

In the ever-evolving landscape of data management, DuckDB has carved out a niche for itself as a powerful analytical database designed for efficient in-process data analysis. It is particularly well-suited for developers looking for a lightweight, easy-to-use solution for data processing. In this blog, we will explore how developers use DuckDB, delve into common use cases, and discuss why these scenarios are particularly suitable for this innovative database.

What is DuckDB?

Before diving into its applications, let’s briefly introduce DuckDB. Often described as the “SQLite for analytics,” DuckDB provides a robust SQL interface that allows users to perform complex analytical tasks efficiently. Its architecture is designed for embedded usage, meaning it can be easily integrated into applications without the overhead of a separate server. This makes it particularly attractive for data scientists and developers looking for an efficient way to analyze data locally.

Advantages of Columnar Storage

DuckDB utilizes a columnar storage format, which is a significant advantage for analytical workloads. In a columnar database, data is stored by columns rather than rows. This design allows for highly efficient data compression and significantly faster read speeds for analytical queries, as only the relevant columns need to be read from disk. This contrasts with traditional row-based storage, where entire rows must be read, even if only a few columns are required. Columnar storage also enhances memory efficiency, making DuckDB capable of handling larger-than-memory datasets with ease.

Continue reading “How Developers Utilize DuckDB: Use Cases and Suitability”

How does FerretDB work?

In recent weeks, I have come across FerretDB on multiple occasions, and I thought why not just get a closer look on the topic. I took a particular interest in it (FerretDB) as it is a MongoDB implementation, on top of my favourite database PostgreSQL.

While I do have high-level thoughts on how I would go about building MongoDB on top of Postgres, I wanted to confirm, validate, and learn how the FerretDB team has been doing it.

FerretDB (previously MangoDB) was founded to become the de-facto open-source substitute to MongoDB. FerretDB is an open-source proxy, converting the MongoDB 5.0+ wire protocol queries to SQL – using PostgreSQL as a database engine.

At a high level, FerretDB is a proxy, which implements MongoDB Wire Protocol that MongoDB clients speak. After establishing the connection with MongoDB clients, it translates any query sent by MongoDB clients to the SQL queries Postgres understands.

In the recent release(0.5.0) of FerretDB, it is also possible to use it as a library rather than as a proxy. FerretDB as a library helps in reducing one network hop, which leads to better performance. It is only possible for applications that are built in Go since FerretDB is implemented in Go.

Below are some of the tweets from people on this article. If you find this article useful please share and tag me @shekhargulati

Continue reading “How does FerretDB work?”

My Notes on GitLab Postgres Schema Design

I spent some time going over the Postgres schema of Gitlab. GitLab is an alternative to Github. You can self host GitLab since it is an open source DevOps platform.

My motivation to understand the schema of a big project like Gitlab was to compare it against schemas I am designing and learn some best practices from their schema definition. I can surely say I learnt a lot.

I am aware that best practices are sometimes context dependent so you should not apply them blindly.

The Gitlab schema file structure.sql [1] is more than 34000 lines of code. Gitlab is a monolithic Ruby on Rails application. The popular way to manage schema migration is using the schema.rb file. The reason the Gitlab team decided to adopt structure.sql instead is mentioned in on of their issues [2] in their issue tracker.

Now what keeps us from using those features is the use of schema.rb. This can only contain standard migrations (using the Rails DSL), which aim to keep the schema file database system neutral and abstract away from specific SQL. This in turn means we are not able to use extended PostgreSQL features that are reflected in schema. Some examples include triggers, postgres partitioning, materialized views and many other great features.

In order to leverage those features, we should consider using a plain SQL schema file (structure.sql) instead of a ruby/rails standard schema schema.rb.

The change would entail switching config.active_record.schema_format = :sql and regenerate the schema in SQL. Possibly, some build steps would have to be adjusted, too.

Now, let’s go over the things I learnt from Gitlab Postgres schema.

I am building a course on how to build production apps using LLMs. We will cover topics like prompt engineering, RAG, search, testing and evals, fine tuning, feedback analysis, and agents. You can register now and get 50% discount. Register using form – https://forms.gle/twuVNs9SeHzMt8q68

Below are some of the tweets from people on this article. If you find this article useful please share and tag me @shekhargulati

Continue reading “My Notes on GitLab Postgres Schema Design”

Choosing Primary Key Type in Postgres

In relational database design one of the key decisions is choosing the right primary key type for tables. In this post I am talking about surrogate or synthetic primary keys. They are called surrogate or synthetic as these keys are not derived from application data. In my experience I have seen very few teams giving a proper thought process to deciding primary key types. For each table they go with the default type that is used in their organization. This would mean all tables will either have a int/bigint type or uuid or varchar. In this post I am not giving any recommendation I am only discussing how different keys affect insertion speed, data size, and how they compare against each other. You will have to do your own analysis to choose the right data type based on your use case. Different tables have different needs so you should make judgement accordingly. There is no one size fit solution.

In this post, I am using Postgres 13.3. You can install it using your operating system package manager. I am running it in a Docker container.

To follow along you can create a sample database and run SQL queries in that.

create database choosing_pk_type;

Connect to this database

\c choosing_pk_type;

Primary Key Types Comparison Summary

The numbers shown above are derived from the below table.

Now the TLDR version.

Continue reading “Choosing Primary Key Type in Postgres”

Correctly using Postgres as queue

I am building a central notification dispatch system that is responsible for sending different kinds of notifications to the end customer. It relies on multiple third party APIs for sending the actual email/SMS notifications. At a high level architecture of the system is shown below.

NotificationSender exposes both REST and messaging interface for accepting consumer requests. Consumers here refer to the services that need to send the notification. This is what notification system does:

  • It accepts requests from upstream services and stores that in the Postgres database after doing validation. The notification event is written to the Postgres database in ENQUEUED state. It is returns back HTTP 202 ACCEPTED to the upstream services if the request is valid else it returns HTTP 400 Bad Request.
  • At a predefined frequency a poller that is part of the NotificationDispacther polls the Postgres database for new notification events i.e. events in ENQUEUED state. For now, it respects insertion time order.
  • If enqueued events are found then it processes them and sends actual notifications using the downstream SMS and Email services.
  • After processing the events it change state of the events to processed
Continue reading “Correctly using Postgres as queue”

Issues with mmap

A couple of months back I watched a video by Andy Pavlo, Associate Professor of Databases Carnegie Mellon, where he made a point that databases should not use mmap. He went on to say that if there is only one thing you should get from his database course is to never use mmap when building and designing database management systems. I have not used mmap before so I was intrigued to understand it in more detail. I was aware that MongoDB used to use an mmap based storage engine. It allowed them to achieve faster time to market but later they had to replace it with a new storage engine wiredtiger because of the issues they faced with mmap. MongoDB is not the only database that uses mmap. There are many databases that use mmap. Some of the databases that use mmap are RavenDB, ElasticSearch, LevelDB, InfluxDB, LMDB, BoltDB, moss (key-value store from Couchbase), etc.

Given that so many databases use mmap I wanted to understand why Andy recommended us to not use mmap. I will list all of the reasons I could find in my research and from Andy’s video in this post. But, before we do that let’s first understand mmap.

Continue reading “Issues with mmap”

The 5 Minute Introduction to DuckDB: The SQLite for Analytics

Updated: 3rd September 2020

A couple of weeks back I learnt about DuckDB while going over DB Weekly newsletter. It immediately caught my attention as I was able to quickly understand why need for such a database exist. Most developers are used to working with an embedded file based relational database in their local development environment. Most popular choice among embeddable RDBMS is SQLite. Developers use embeddable databases because there is no set up required and they can get started quickly in a couple of minutes. This enables quick prototyping and developers can quickly iterate on business features.

DuckDB is similar to SQLite in the sense it is also designed to be used as an embeddable database. Developers can easily include it as a library in their code and start using it. Later in this post, I will cover how we can use DuckDB with Python.

Continue reading “The 5 Minute Introduction to DuckDB: The SQLite for Analytics”

How To Think About Different Database Data Models: Relational vs Document Data Models

This week I was talking to a developer about how to think about data models supported by different databases. One thing that I have learnt in my 15 years of building web applications is that data models play an important role in the success of any software application.

Data model provides an abstract model to organise elements of data and how these elements relate to each other. They describe the structure, manipulation and integrity aspects of the data stored in data management systems such as relational databases. For example, when you are modelling your problem domain in a relational database then you think in terms of real-world entities and how those entities are related with each other. We usually use Entity Relation (ER) diagrams to model tables in relational databases. In this post, we will focus on two popular data models — Relational and Document. We will discuss when you will use one over the other.

Data model influence two main attributes:

  1. Easy of use: A data model can make some operations easy to achieve and others difficult or impossible
  2. Performance: A data model can be suitable for faster reads but slower writes or vice versa

Continue reading “How To Think About Different Database Data Models: Relational vs Document Data Models”

MemSQL Introduction: A Hybrid transactional/analytical processing database

main-how-it-works-ecosystem-diagram

MemSQL is s fast, commercial, ANSI SQL compliant, highly scalable HTAP database. HTAP databases are those that support both OLTP and OLAP workloads. It supports ACID transactions just like a regular relational database .It also supports document and geospatial data types.

MemSQL is fast because it stores data in-memory. But, it does not mean it is not durable. It maintains a copy of data on disk as well. Transactions are committed to the transaction log on disk and later compressed into full-database snapshots. One of the main reason new databases are designed as in-memory first is because memory is getting cheaper every year. It is estimated memory is becoming cheaper 40% every year.

MemSQL has tuneable durability. You can make it fully durable or completely ephemeral. It can be sync or async.

MemSQL simplifies your architecture as you don’t have to write ETL jobs to move data from one data store to another data store. This is the biggest selling point of any HTAP database.

Continue reading “MemSQL Introduction: A Hybrid transactional/analytical processing database”

The Minimalistic Guide to ACID Transactions

Welcome to the third post of distributed system series. So far in this series, we have looked at service discovery and CAP theorem. Before we move along in our distributed system learning journey, I thought it will be useful to refresh our memory with understanding of ACID transactions. ACID transactions are at the heart of relational databases. The knowledge of ACID transactions is useful when building distributed applications.

Understanding ACID transactions

A transaction is a sequence of operations that form a single logical unit of work. These transactions are executed on a shared database system to perform a higher-level function. An example of higher-level function is transferring money from one account to another. Transactions represent a basic unit of change in the database. It either executed in its entirety or not at all.

ACID (Atomicity, Consistency, Isolation, and Durability) refers to a set of properties that a database transaction should guarantee even in the event of errors, power failure, etc. The canonical example of ACID transaction is transfer of funds from one bank account to another. In a single fund transferring transaction, you have to check the account balance, debit one account, and credit another transaction. ACID properties guarantee that either money transfer from one account to other occur correctly and permanently or in case of failure both accounts have the same initial state. It would be unacceptable if one account was debited but the other account was credited.

Database transactions are motivated by two independent requirements:

  1. Concurrent database access: Multiple clients can access the system at the same time. This is achieved by the Isolation property of ACID transaction.
  2. Resiliency to system failures: System remains in consistent state in case of a system failure. This is provided by Atomicity, Consistency, and Durability properties of ACID transaction.

Continue reading “The Minimalistic Guide to ACID Transactions”