When to use JSON data type in database schema design?


Today, I was doing solution design for a system when I started to think when we should use JSON data type for columns. Coming up with the right schema design takes multiple iterations. I consider it more as an art than science. All the mainstream RDBMS have JSON data type support.

  • Postgres has JSON data type since version 9.2. The 9.2 version was released in September 2012
  • MySQL has JSON data type since version 5.7.8. The 5.7.8 version was released in August 2015
  • SQL Server has JSON data type since version SQL Server 2016. The 2016 version was released in June 2016
  • Oracle has JSON data type since version 19c. The 19c version was released in February 2019

They all support efficient insertion and querying of JSON data type. I will not compare their JSON capabilities today. Today, I want to answer a design question – when should a column have a JSON data type?

I use the JSON data type in design situations mentioned below. There could be other places as well where JSON is a suitable data type. 

  1. Dump request data that will be processed later
  2. Support extra fields
  3. One To Many Relationship where many side will not have to its own identity
  4. Key Value use case
  5. Simpler EAV design

Let’s talk about each of these use cases in more detail.

# 1. Dump request data that will be processed later

In this design situation you take the request from the client and store that in the database. Then, there are worker threads or processes that process the request asynchronously. Your main goal is to quickly and efficiently persist the request/event and unblock the client. 

There are a couple of recent examples where I used this pattern.

The first example was the schema design of an error collecting system (similar to Rollbar). The system exposed an API that clients(web,mobile,etc) called when they encountered an error. The API saved the request data and handed processing to the worker pool.

The `error_events` table structure was as shown below.

CREATE TABLE error_events (
    id BIGSERIAL PRIMARY KEY,
    body JSONB NOT NULL,
    timestamp TIMESTAMP WITHOUT TIME ZONE DEFAULT(NOW() AT TIME ZONE 'utc') NOT NULL,
    client_details JSONB,
    level VARCHAR

The second example is a system that I worked on this week. I was designing a centralized notification service where again I had to accept the incoming request and then send the notification asynchronously. 

The `notification_events` table structure was as shown below.

CREATE TABLE notification_events (
    id BIGSERIAL PRIMARY KEY,
    customer_data JSONB NOT NULL,
    data JSONB NOT NULL,
    timestamp TIMESTAMP WITHOUT TIME ZONE DEFAULT(NOW() AT TIME ZONE 'utc') NOT NULL,
);

I favor JSON data type in this design situation:

  • For better performance. Most APIs that I work with accept JSON so I don’t want to pay the cost of serialization/deserialization. 
  • To build a less restrictive ingestion API. It gives clients flexibility to add more details in JSON types.

If required we can use a JSON schema to validate the structure of fields with JSON data type. So far, I never had a need for this. 

# 2. Support extra fields

Most developers favor NoSQL databases for their schemaless nature. You can add new columns to a table without worrying about painful schema migration issues. RDBMS requires us to define the schema first. Else, we will get errors. 

I prefer RDBMS over schemaless NoSQL databases. But, there are times when you need flexibility. Before JSON developers achieved this by having extra columns in their tables as shown below in the `customers` table below.

CREATE TABLE customers (
    id BIGSERIAL PRIMARY KEY,
    name VARCHAR,
    age NUMERIC,
    timestamp TIMESTAMP WITHOUT TIME ZONE DEFAULT(NOW() AT TIME ZONE 'utc') NOT NULL,
    field_1 VARCHAR,
    field_2 VARCHAR,
    field_3 VARCHAR,
    field_4 VARCHAR
);

A better solution could be to use JSON datatype.

CREATE TABLE customers (
    id BIGSERIAL PRIMARY KEY,
    name VARCHAR,
    age NUMERIC,
    timestamp TIMESTAMP WITHOUT TIME ZONE DEFAULT(NOW() AT TIME ZONE 'utc') NOT NULL,
    extra_attributes JSONB
);

I recently had to use this pattern. I was designing a schema for a CMS(Content Management System). At a high level there were three main entities – Page, Section, Element. A page has multiple sections, and a section is composed of multiple elements. 

The `sections` table that maps to the `Section` entity has a possibility to support extra attributes in future. In the future we might need to support the order of sections, look and feel of the section, etc. To support the future schema evolution I used JSON data type.

CREATE TABLE sections (
    id BIGSERIAL PRIMARY KEY,
    title VARCHAR,
    sub_title VARCHAR,
    timestamp TIMESTAMP WITHOUT TIME ZONE DEFAULT(NOW() AT TIME ZONE 'utc') NOT NULL,
    extra_attributes JSONB,
);

# 3. One To Many Relationship where many side will not have to its own identity

This is the canonical use case of document databases. From an earlier post I wrote on data models

> Document databases have better performance compared to RDBMS when you can model your OneToOne or OneToMany relationships as embedded nested documents. The better performance is achieved due to data storage locality and avoidance of multi-table join. You will typically create embedded nested documents when those documents don’t have an identity of their own.

In the CMS example I discussed I decided to model elements as a nested JSON document. I did this mainly for performance reasons. I am fine with data duplication as for CMS use case performance is more important. There are always trade offs to be made. 

CREATE TABLE sections (
    id BIGSERIAL PRIMARY KEY,
    title VARCHAR,
    sub_title VARCHAR,
    timestamp TIMESTAMP WITHOUT TIME ZONE DEFAULT(NOW() AT TIME ZONE 'utc') NOT NULL,
    extra_attributes JSONB,
    elements JSONB
);

# 4. Key value use case

Typically, in key value database value is a blob. You can’t query it. You can use RDBMS as a key value database. People have used RDBMS as a KV store for ages. Having value as JSON gives you more flexibility and querying capability. 

I used this while designing a session store. All the session attributes were stored as a JSON. I always had to query by session_id so I stored all the attributes as JSON.

create table sessions (
	session_id char(36) not null,
	session_attributes JSONB,
	constraint session_pk primary key (session_id)
);

The attributes I stored in session included principal name, expiry time, customer reference number, etc.

#5. Simpler EAV design

The last use case where I recently used JSON data type is while redesigning a table that had an EAV design.

In the EAV data model only non-empty values are stored in the database, where each attribute-value (or key-value) pair describes one attribute of a given entity. EAV tables are often characterised as ‘long and skinny’ where ‘long’ refers to multiple rows describing an entity, and ‘skinny’ refers to the small number of columns used.

I had a simple schema where there are channels and providers. Channels like email, sms, push notification, etc. Providers like AWS SES, Sendgrid, Twilio, Firebase, etc. Each provider has configuration. Configuration includes API endpoints, credentials, certificates, etc. One way to design such a configuration table is using EAV. We will have a `providers` table and `provider_configs` table with one to many relationship between them. The provider_configs table will have following structure.

create table provider_configs (
    id BIGSERIAL PRIMARY KEY,
    attribute VARCHAR,
    value VARCHAR,
    type VARCHAR,
    provider_id BIGSERIAL,
    constraint provider_fk foreign key (provider_id) references providers(id) on delete cascade
);

As you might know querying EAV tables is painful and leads to complex queries.

We can simplify the `provider_configs` table by storing config as JSON. The relationship between provider and `provider_configs` will become One to One as well. If we want we can also get rid of `provider_configs` tables and store configuration in the provider table as well. This can avoid joining two tables.

create table provider_configs (
	id BIGSERIAL PRIMARY KEY,
    config JSONB,
    provider_id BIGSERIAL,
    constraint provider_fk foreign key (provider_id) references providers(id) on delete cascade
);

Conclusion

In this post, I covered use cases where I prefer JSON data type in schema design. JSON data type can help you evolve your schema without pain and it can also be more performant for certain use cases.

One thought on “When to use JSON data type in database schema design?”

  1. Hi !
    i am a long time (postgre)SQL user, and several years ago i started designing an extensible web editor.
    Let me directly give an example: a “blocks” table with columns “blocks.type” (string) and “blocks.data” (jsonb).
    Each type is bound to a JSON schema for blocks.data. There is a site-level schema that validates any block, cased on type. (Also a relations table for n-n relations between blocks).
    This approach is very useful for flexible data validation, and still keep the nice things of a relational db like postgres.
    Also postgres brings full-text search, gin indexes, and many other things.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: