Issues with mmap

A couple of months back I watched a video by Andy Pavlo, Associate Professor of Databases Carnegie Mellon, where he made a point that databases should not use mmap. He went on to say that if there is only one thing you should get from his database course is to never use mmap when building and designing database management systems. I have not used mmap before so I was intrigued to understand it in more detail. I was aware that MongoDB used to use an mmap based storage engine. It allowed them to achieve faster time to market but later they had to replace it with a new storage engine wiredtiger because of the issues they faced with mmap. MongoDB is not the only database that uses mmap. There are many databases that use mmap. Some of the databases that use mmap are RavenDB, ElasticSearch, LevelDB, InfluxDB, LMDB, BoltDB, moss (key-value store from Couchbase), etc.

Given that so many databases use mmap I wanted to understand why Andy recommended us to not use mmap. I will list all of the reasons I could find in my research and from Andy’s video in this post. But, before we do that let’s first understand mmap.

Continue reading “Issues with mmap”

The 5 Minute Introduction to DuckDB: The SQLite for Analytics

Updated: 3rd September 2020

A couple of weeks back I learnt about DuckDB while going over DB Weekly newsletter. It immediately caught my attention as I was able to quickly understand why need for such a database exist. Most developers are used to working with an embedded file based relational database in their local development environment. Most popular choice among embeddable RDBMS is SQLite. Developers use embeddable databases because there is no set up required and they can get started quickly in a couple of minutes. This enables quick prototyping and developers can quickly iterate on business features.

DuckDB is similar to SQLite in the sense it is also designed to be used as an embeddable database. Developers can easily include it as a library in their code and start using it. Later in this post, I will cover how we can use DuckDB with Python.

Continue reading “The 5 Minute Introduction to DuckDB: The SQLite for Analytics”

How To Think About Different Database Data Models: Relational vs Document Data Models

This week I was talking to a developer about how to think about data models supported by different databases. One thing that I have learnt in my 15 years of building web applications is that data models play an important role in the success of any software application.

Data model provides an abstract model to organise elements of data and how these elements relate to each other. They describe the structure, manipulation and integrity aspects of the data stored in data management systems such as relational databases. For example, when you are modelling your problem domain in a relational database then you think in terms of real-world entities and how those entities are related with each other. We usually use Entity Relation (ER) diagrams to model tables in relational databases. In this post, we will focus on two popular data models — Relational and Document. We will discuss when you will use one over the other.

Data model influence two main attributes:

  1. Easy of use: A data model can make some operations easy to achieve and others difficult or impossible
  2. Performance: A data model can be suitable for faster reads but slower writes or vice versa

Continue reading “How To Think About Different Database Data Models: Relational vs Document Data Models”

MemSQL Introduction: A Hybrid transactional/analytical processing database


MemSQL is s fast, commercial, ANSI SQL compliant, highly scalable HTAP database. HTAP databases are those that support both OLTP and OLAP workloads. It supports ACID transactions just like a regular relational database .It also supports document and geospatial data types.

MemSQL is fast because it stores data in-memory. But, it does not mean it is not durable. It maintains a copy of data on disk as well. Transactions are committed to the transaction log on disk and later compressed into full-database snapshots. One of the main reason new databases are designed as in-memory first is because memory is getting cheaper every year. It is estimated memory is becoming cheaper 40% every year.

MemSQL has tuneable durability. You can make it fully durable or completely ephemeral. It can be sync or async.

MemSQL simplifies your architecture as you don’t have to write ETL jobs to move data from one data store to another data store. This is the biggest selling point of any HTAP database.

Continue reading “MemSQL Introduction: A Hybrid transactional/analytical processing database”

The Minimalistic Guide to ACID Transactions

Welcome to the third post of distributed system series. So far in this series, we have looked at service discovery and CAP theorem. Before we move along in our distributed system learning journey, I thought it will be useful to refresh our memory with understanding of ACID transactions. ACID transactions are at the heart of relational databases. The knowledge of ACID transactions is useful when building distributed applications.

Understanding ACID transactions

A transaction is a sequence of operations that form a single logical unit of work. These transactions are executed on a shared database system to perform a higher-level function. An example of higher-level function is transferring money from one account to another. Transactions represent a basic unit of change in the database. It either executed in its entirety or not at all.

ACID (Atomicity, Consistency, Isolation, and Durability) refers to a set of properties that a database transaction should guarantee even in the event of errors, power failure, etc. The canonical example of ACID transaction is transfer of funds from one bank account to another. In a single fund transferring transaction, you have to check the account balance, debit one account, and credit another transaction. ACID properties guarantee that either money transfer from one account to other occur correctly and permanently or in case of failure both accounts have the same initial state. It would be unacceptable if one account was debited but the other account was credited.

Database transactions are motivated by two independent requirements:

  1. Concurrent database access: Multiple clients can access the system at the same time. This is achieved by the Isolation property of ACID transaction.
  2. Resiliency to system failures: System remains in consistent state in case of a system failure. This is provided by Atomicity, Consistency, and Durability properties of ACID transaction.

Continue reading “The Minimalistic Guide to ACID Transactions”

PostgreSQL learning

To find out the location of PostgreSQL configuration files, just connect with your PostgreSQL database and run the following command.

SELECT name, setting FROM pg_settings WHERE category = 'File Locations';

To describe a table, you should use

\d+ table_name

To list all the databases,you should run following command.


To view all the tables in a schema

\dt+ pg_catalog.pg_t*

To view core settings of a PostgreSQL server, you should use


To view value of a specific setting like effective_cache_size

SHOW effective_cache_size;

To reload the configuration changes

SELECT pg_reload_conf();

To create a new LOGIN user

CREATE ROLE shekhar LOGIN PASSWORD 'p@ssw0rd';

To create a new database


Schema are logical compartments in a database. You can divide your database into small logical schemas.


To view all the available extensions

SELECT * from pg_available_extensions;

To view details about an extension
To view all the available extensions

\dx+ plpgsql;

To install an extension and view its details

postgres=# CREATE EXTENSION fuzzystrmatch;
postgres=# \dx+ fuzzystrmatch;
                    Objects in extension "fuzzystrmatch"
                             Object Description
 function difference(text,text)
 function dmetaphone(text)
 function dmetaphone_alt(text)
 function levenshtein(text,text)
 function levenshtein(text,text,integer,integer,integer)
 function levenshtein_less_equal(text,text,integer)
 function levenshtein_less_equal(text,text,integer,integer,integer,integer)
 function metaphone(text,integer)
 function soundex(text)
 function text_soundex(text)
(10 rows)

To view the active running processes

postgres=# SELECT pid, usename from pg_stat_activity;
 pid  | usename
 1982 | postgres
 2742 | postgres
 2236 | postgres
 2723 | shekhar
 2744 | postgres
(5 rows)

To kill a connection

postgres=# SELECT pg_terminate_backend(2723);
(1 row)

To kill all the connection by user shekhar

SELECT pg_terminate_backend(pid) FROM pg_stat_activity where usename = 'shekhar';

To run a script file, run the following command.

psql -f <path_to_filefile>

You can also run commands with psql non-interactively

psql -d mydb -c "CREATE SCHEMA test;"

The \set command can be used to create user defined shortcuts like as shown below.

\set connections_check 'SELECT pid, usename from pg_stat_activity;'

The .psqlrc file can be used to define configurations for a session. You can use PSQLRC environment variable to control the location of the startup file.

\pset null 'Null'
\encoding latin1
\set PROMPT1 '%n@%M:%>%x %/#'
\set PROMPT2
\timing on
\pset pager always

To turn on timing for the queries execution time


You can call OS shell commands from within psql as shown below.

admin2pnvtk8:myapp#\! env|grep POSTGRES

To view uptime of your PostgreSQL database in minutes, run the following query.

myapp=# select date_trunc('minute',current_timestamp - pg_postmaster_start_time()) as "postgresql_uptime";
(1 row)

Accessing PostgreSQL Server From Remote Machine

Today I was trying to remotely connect to PostgreSQL server and it took me some time to configure it. So, in this short blog I am sharing the procedure to make it work.

  1. Edit the pg_hba.conf file to enable client connections as shown below. The important thing is the use of to allow client access from any IP address. You can also define a subset of ip here.
    # TYPE  DATABASE        USER            ADDRESS                 METHOD
    # "local" is for Unix domain socket connections only
    local   all             all                                     trust
    # IPv4 local connections:
    host    all             all               trust
    host    all             all                   trust
    # IPv6 local connections:
    host    all             all             ::1/128                 trust
    # Allow replication connections from localhost, by a user with the
    # replication privilege.
    #local   replication     postgres                                md5
    host    replication     postgres               trust
    host    replication     postgres        ::1/128                 trust
  2. Edit the postgresql.conf file and change listen_addresses = ‘*’ . This means any client can access the server. By default value is local. You can also specify a comma separated list of ip address also.
  3. Restart the postgreSQL server either using postgres service or first killing the process and then starting using pg_ctl script.
  4. The last and most important thing is to edit the /etc/sysconfig/iptables file and add entry to allow access to 5432 port. This can be done as shown below. Please add the line shown below in iptables file.
    -A INPUT -m state –state NEW -m tcp -p tcp –dport 5432 -j ACCEPT
  5. Then restart the iptables service using service iptables restart
  6. Finally test it with psql client by executing following command ./psql -U postgres -h -p 5432