TIL #1 : Using liquibase validCheckSum to solve a deployment issue


Taking inspiration from Simon Willison[1] I will start posting TIL (Today I learned) posts on something new/interesting I learn while building software. Today, I was working with a colleague on a problem where in our database migration script was working in the dev environment but failing in the staging environment. The customer platform team has mandated that we can’t access database directly and the only way to fix things is via liquibase scripts. In this post I will not discuss if I agree with them or not. That’s a rant for another day.

In our staging environment we were getting following exception

changelog-main.xml::2::author1 was: 8:a67c8ccae76190339d0fe7211ffa8d98 but is now: 8:d76c3d3a528a73a083836cb6fd6e5654
changelog-main.xml::3::author2 was: 8:0f90fb0771052231b1ax45c1x8bdffax but is now: 8:a25ca918b2eb27a2b453d6e3bf56ff77

If you have worked with Liquibase or any other similar database migration tool you will understand that this happens when developer has changed an existing changeset. This causes checksum to change for an existing changset. So, when next time liquibase tries to apply changset it gives validation error and fails.

Developer should never change an existing changeset and this is one thing we make sure we don’t miss during our code reviews.

Both our dev environment and our tests(that make use of liquibase migration scripts) were working so we were confused why this is happening in staging. We started provisioning staging environment a couple of weeks back only so ideally everything should work.

This was the time to look at the Git commit log for the changeset file to figure out when changes have happened. We had four changsets for in our database migration script and we saw that a developer modified the second changeset at a later date. Ideally, if the second changeset is changed after it is applied liquibase should have failed. But, in dev environment second changeset was never applied because it failed before applying it. We did some major refactoring of our code and one of the activities that we did was moving from xml based liquibase migration script to sql based migration script.

Our liquibase migration script look like as below on date d1.

<databaseChangeLog
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
        xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-4.7.xsd"
        xmlns:liquibasePercona="http://www.liquibase.org/xml/ns/dbchangelog-ext/liquibase-percona">

    <changeSet id="1" author="author1">
        <preConditions onFail="MARK_RAN">
            <not>
                <tableExists tableName="t1"/>
            </not>
        </preConditions>

        <sql>

            create table t1
            (
            id bigserial primary key,
            updated_by                       varchar   not null,
            updated_on                       varchar   not null
            );

            create table t2
            (
            id bigserial primary key,
            );
        </sql>
    </changeSet>

    <changeSet id="2" author="author1">
        <sql>
            create type status as enum ('ACTIVE', 'INACTIVE');
            alter table t1 add column status status default 'ACTIVE';
        </sql>
    </changeSet>
    <changeSet id="3" author="author2">
        <sql>
            drop table t2;
            ALTER TABLE t1 RENAME TO t11;
            ALTER TABLE t11 ADD created_by varchar(50), ADD created_on timestamp;
        </sql>
    </changeSet>
</databaseChangeLog>    

The second changset was failing in the dev environment because the table was never created using the changeset 1. The existing table had status enum so when changeset 2 ran it failed. We were using the earlier table we created using the XML based script.

The fix was simple for this. We just had to add a new condition. Since changeset 2 and 3 never ran we also took the liberty of improving our SQL scripts. We changed status enum to customer_status and drop table t2; to drop table IF EXISTS t2;. So, this is our changeset on date d2.

<changeSet id="2" author="author1">
    <preConditions onFail="MARK_RAN">
        <not>
            <columnExists tableName="t1" columnName="status" />
        </not>
    </preConditions>
    <sql>
        create type customer_status as enum ('ACTIVE', 'INACTIVE');
        alter table t1 add column status customer_status default 'ACTIVE';
    </sql>
</changeSet>
<changeSet id="3" author="author2">
        <sql>
            drop table IF EXISTS t2;
            ALTER TABLE t1 RENAME TO t11;
            ALTER TABLE t11 ADD created_by varchar(50), ADD created_on timestamp;
        </sql>
</changeSet>

We pushed the changes and our CD pipeline deployed these changes successfully. We were happy.

We completely missed that platform team is setting up staging environment in parallel. So, somewhere between dates d1 and d2 they deployed this service so the changeset on date d1 got applied. Since, it didn’t have an existing table t1 so all changetset got applied successfully. So, once we changed the migration script at d2 our deployment started to fail. We noticed this a week later when development team was made aware that staging is available.

So, now what to do. We can’t bring our scripts to d1 state because that will fail our dev environment and we can’t bring staging to d2 because d1 changes are already applied. Since we don’t have access to the database the only way is by finding some mechanism in liquibase to not run changesets if they are applied even when there is checksum mismatch.

After reading how Liquibase calculates checksum and reading documentation on different checksum related options we figured out a possible answer. The answer is using <validCheckSum>1:any</validCheckSum> [2]

Adds a checksum that is considered valid for this changeset, regardless of what is stored in the database. It is primarily used when you need to change a changeset and don’t want errors thrown on databases on which it has already been run (not a recommended procedure). Special value "1:any" will match to any checksum and will not execute the changeset on ANY change.

We changed our Liquibase changeset to use validCheckSum. This made both environments happy and we were able to resolve the problem.

<changeSet id="2" author="author1">
    <validCheckSum>1:any</validCheckSum>
    <preConditions onFail="MARK_RAN">
        <not>
            <columnExists tableName="t1" columnName="status" />
        </not>
    </preConditions>
    <sql>
        create type customer_status as enum ('ACTIVE', 'INACTIVE');
        alter table t1 add column status customer_status default 'ACTIVE';
    </sql>
</changeSet>
<changeSet id="3" author="author2">
    <validCheckSum>1:any</validCheckSum>
    <sql>
        drop table IF EXISTS t2;
        ALTER TABLE t1 RENAME TO t11;
        ALTER TABLE t11 ADD created_by varchar(50), ADD created_on timestamp;
    </sql>
</changeSet>

References

  1. https://simonwillison.net/2022/Nov/6/what-to-blog-about/
  2. https://docs.liquibase.com/concepts/changelogs/changelog-formats.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: