Reducing size of Docling Pytorch Docker image

Last couple of days I’ve been working on optimizing the Docker image size of a PDF processing microservice. The service uses Docling, an open-source library developed by IBM Research, which internally uses PyTorch. Docling can extract text from PDFs and various other document types. Here’s a simplified version of our FastAPI microservice that wraps Docling’s functionality.

import os
import shutil
from pathlib import Path
from docling.document_converter import DocumentConverter
from fastapi import FastAPI, UploadFile

app = FastAPI()
UPLOAD_DIR = "uploads"
os.makedirs(UPLOAD_DIR, exist_ok=True)
converter = DocumentConverter()

@app.post("/")
async def root(file: UploadFile):
    file_location = os.path.join(UPLOAD_DIR, file.filename)
    with open(file_location, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = converter.convert(Path(file_location))
    md = result.document.export_to_markdown()
    return {"filename": file.filename, "text": md}

The microservice workflow is straightforward:

  • Files are uploaded to the uploads directory
  • Docling converter processes the uploaded file and converts it to markdown
  • The markdown content is returned in the response

Here are the dependencies listed in requirements.txt:

fastapi==0.115.8
uvicorn==0.34.0
python-multipart==0.0.20
docling==2.18.0

You can test the service using this cURL command:

curl --request POST \
  --url http://localhost:8000/ \
  --header 'content-type: multipart/form-data' \
  --form file=@/Users/shekhargulati/Downloads/example.pdf

On the first request, Docling downloads the required model from HuggingFace and stores it locally. On my Intel Mac machine, the initial request for a 4-page PDF took 137 seconds, while subsequent requests took less than 5 seconds. For production environments, using a GPU-enabled machine is recommended for better performance.

The Docker Image Size Problem

Initially, building the Docker image with this basic Dockerfile resulted in a massive 9.74GB image:

FROM python:3.12-slim
RUN apt-get update \
    && apt-get install -y
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
docling-blog  v1  51d223c334ea   22 minutes ago   9.74GB

The large size is because PyTorch’s default pip installation includes CUDA packages and other GPU-related dependencies, which aren’t necessary for CPU-only deployments.

The Solution

To optimize the image size, modify the pip installation command to download only CPU-related packages using PyTorch’s CPU-specific package index. Here’s the optimized Dockerfile:

FROM python:3.12-slim
RUN apt-get update \
    && apt-get install -y \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Building with this optimized Dockerfile reduces the image size significantly:

docling-blog v2 ac40f5cd0a01   4 hours ago     1.74GB

The key changes that enabled this optimization:

  1. Added --no-cache-dir to prevent pip from caching downloaded packages
  2. Used --extra-index-url https://download.pytorch.org/whl/cpu to specifically download CPU-only PyTorch packages
  3. Added rm -rf /var/lib/apt/lists/* to clean up apt cache

This optimization reduces the Docker image size by approximately 82%, making it more practical for deployment and distribution.

PostgreSQL Enum Types with SQLModel and Alembic

While working on a product that uses FastAPI, SQLModel, Alembic, and PostgreSQL, I encountered a situation where I needed to add an enum column to an existing table. Since it took me some time to figure out the correct approach, I decided to document the process to help others who might face similar challenges.

Let’s start with a basic scenario. Assume you have a data model called Task as shown below:

import uuid
from datetime import datetime
from typing import Optional
from sqlmodel import SQLModel, Field

class Task(SQLModel, table=True):
    __tablename__ = "tasks"
    id: uuid.UUID = Field(default_factory=uuid.uuid4, primary_key=True)
    title: str = default=None
    description: str | None = Field(default=None)
    created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))

Using Alembic, you can generate the initial migration script with these commands:

alembic revision --autogenerate -m "Created task table"
alembic upgrade head

Now, let’s say you want to add a status field that should be an enum with two values – OPEN and CLOSED. First, define the enum class:

import enum
class TaskStatus(str, enum.Enum):
    OPEN = "open"
    CLOSED = "closed"

Then, add the status field to the Task class:

class Task(SQLModel, table=True):
    __tablename__ = "tasks"
    id: uuid.UUID = Field(default_factory=uuid.uuid4, primary_key=True)
    title: str = default=None
    description: str | None = Field(default=None)
    created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
    status: Optional[TaskStatus] = None

If you run the Alembic migration commands at this point, it will define the status column as text. However, if you want to create a proper PostgreSQL enum type instead of storing the data as text, you’ll need to follow these additional steps:

  1. Install the alembic-postgresql-enum library:
pip install alembic-postgresql-enum 

or if you’re using Poetry:

poetry add alembic-postgresql-enum
  1. Add the library import to your Alembic env.py file:
import alembic_postgresql_enum
  1. Modify the status field declaration in your Task class to explicitly use the enum type:
from sqlmodel import SQLModel, Field, Enum, Column


class Task(SQLModel, table=True):
    __tablename__ = "tasks"
    id: uuid.UUID = Field(default_factory=uuid.uuid4, primary_key=True)
    title: str = default=None
    description: str | None = Field(default=None)
    created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
    status: Optional[TaskStatus] = Field(default=None, sa_column=Column(Enum(TaskStatus)))

Now you can run the Alembic commands to create a new PostgreSQL type for TaskStatus and use it for the column type:

alembic revision --autogenerate -m "Added status column in tasks table"
alembic upgrade head

To verify that the enum type was created correctly, connect to your PostgreSQL instance using psql and run the \dT+ command:

taskdb=# \dT+
                                          List of data types
 Schema |    Name     | Internal name | Size | Elements  |  Owner   | Access privileges | Description
--------+-------------+---------------+------+-----------+----------+-------------------+-------------
 public | taskstatus  | taskstatus    | 4    | OPEN +    | postgres |                   |
        |             |               |      | CLOSED   +|          |                   |

This approach ensures that your enum values are properly constrained at the database level, providing better data integrity than using a simple text field.

Using a Tor proxy to bypass IP restrictions

Yesterday I was working on a personal project where I need to download transcript of Youtube video. I use unofficial youtube-transcript-api Python library to fetch transcript for any youtube video. It worked fine locally but as soon as I deployed my app on a cloud VM it started giving error.

The code to list transcripts for a video is shown below.

from youtube_transcript_api import YouTubeTranscriptApi

video_id = "eIho2S0ZahI"
transcripts = YouTubeTranscriptApi.list_transcripts(video_id)

The code throws following error on cloud VM.

Could not retrieve a transcript for the video https://www.youtube.com/watch?v=w8rYQ40C9xo! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!

As described in the GitHub issue https://github.com/jdepoix/youtube-transcript-api/issues/303 the only way to overcome this issue is to use a proxy. Most resedential proxies cost somewhere between $ 5-10 for 1 GB data transfer. They are priced per GB so depending on your usage cost will vary. One of the commenter suggested to use Tor proxy. Tor provides anonymity by routing your internet traffic through a series of volunteer-operated servers, which can help you avoid being blocked by services like YouTube.

There is an open source project https://github.com/dperson/torproxy that allows you run Tor proxy in a Docker container.

You can run it as follows:

docker run -p 8118:8118 -p 9050:9050 -d dperson/torproxy

This will run proxy on 9050 port.

You can then change your code to use the proxy.

from youtube_transcript_api import YouTubeTranscriptApi

proxies = {
    'http': "socks5://127.0.0.1:9050",
    'https': "socks5://127.0.0.1:9050",
}

video_id = "eIho2S0ZahI"
transcripts = YouTubeTranscriptApi.list_transcripts(video_id, proxies=proxies)

TIL #3: Using xbar to build ArgoCD deployment monitor

This week I was going over the latest edition(Volume 27) of Thoughtworks Technology Radar and found the addition of xbar in their Tools section. xbar lets you put output from any script/program in your macOS menu bar. I first wrote about it in October 2021 when I showed how you can use it show WordPress page views analytics in your macOS menu bar.

From the Thoughtworks Radar entry on xbar

On remote teams, we sorely lack having a dedicated build monitor in the room; unfortunately, newer continuous integration (CI) tools lack support for the old CCTray format. The result is that broken builds aren’t always picked up as quickly as we’d like. To solve this problem, many of our teams have started using xbar for build monitoring. With xbar, one can execute a script to poll build status, displaying it on the menu bar.

Continue reading “TIL #3: Using xbar to build ArgoCD deployment monitor”

TIL #2: Kafka poison pill message and CommitFailedException

Yesterday I was working with a team that was facing issue with their Kafka related code. The Kafka consumer was failing with the exception

[] ERROR [2022-11-22 08:32:52,853] com.abc.MyKakfaConsumer: Exception while processing events
! java.lang.NullPointerException: Cannot invoke "org.apache.kafka.common.header.Header.value()" because the return value of "org.apache.kafka.common.header.Headers.lastHeader(String)" is null
! at com.abc.MyKakfaConsumer.run(MyKakfaConsumer.java:83)
! at java.base/java.lang.Thread.run(Thread.java:833)

The consumer code looked like as shown below.

Continue reading “TIL #2: Kafka poison pill message and CommitFailedException”

TIL #1 : Using liquibase validCheckSum to solve a deployment issue

Taking inspiration from Simon Willison[1] I will start posting TIL (Today I learned) posts on something new/interesting I learn while building software. Today, I was working with a colleague on a problem where in our database migration script was working in the dev environment but failing in the staging environment. The customer platform team has mandated that we can’t access database directly and the only way to fix things is via liquibase scripts. In this post I will not discuss if I agree with them or not. That’s a rant for another day.

In our staging environment we were getting following exception

changelog-main.xml::2::author1 was: 8:a67c8ccae76190339d0fe7211ffa8d98 but is now: 8:d76c3d3a528a73a083836cb6fd6e5654
changelog-main.xml::3::author2 was: 8:0f90fb0771052231b1ax45c1x8bdffax but is now: 8:a25ca918b2eb27a2b453d6e3bf56ff77

If you have worked with Liquibase or any other similar database migration tool you will understand that this happens when developer has changed an existing changeset. This causes checksum to change for an existing changset. So, when next time liquibase tries to apply changset it gives validation error and fails.

Developer should never change an existing changeset and this is one thing we make sure we don’t miss during our code reviews.

Continue reading “TIL #1 : Using liquibase validCheckSum to solve a deployment issue”

Sending Outlook Calendar Invite using Java Mail API

Today I had to implement a functionality related to sending Outlook Calendar invite. The app backend was written in Java. As it turns out the simple task of sending a calendar invite is much more complicated than I expected. You have to construct the message with all the right parameters set else your calendar invite will not behave as you expect. I wasted an hour or two figuring out why RSVP buttons are not coming in the invite. As it turned out it was one of the missing parameters that caused the issue. Also, I wanted to send calendar invite to both outlook and Google calendar.

Continue reading “Sending Outlook Calendar Invite using Java Mail API”

Building a Sentiment Analysis Python Microservice with Flair and Flask

Flair delivers state-of-the-art performance in solving NLP problems such as named entity recognition (NER), part-of-speech tagging (PoS), sense disambiguation and text classification. It’s an NLP framework built on top of PyTorch.

In this post, I will cover how to build sentiment analysis Microservice with flair and flask framework.

Continue reading “Building a Sentiment Analysis Python Microservice with Flair and Flask”

How Docker uses cgroups to set resource limits?

Today, I was interested to know how does Docker uses cgroups to set resource limits. In this short post, I will share with you what I learnt.

I will assume that you have a machine on which Docker is installed.

Docker allows you to pass resource limits using the command-line options. Let’s assume that you want to limit the IO read rate to 1mb per second for a container. You can start a new container with the device-read-bps option as shown below

$ docker run -it --device-read-bps /dev/sda:1mb centos

In the above command, we are instantiating a new centos container. We specified device-read-bps option to limit the read rate to 1mb per second for /dev/sda device.

Continue reading “How Docker uses cgroups to set resource limits?”

TIL #8: Installing and Managing PostgreSQL with Brew

This post documents how to install and manage PostgreSQL using brew package manager for macOS.

To install PostgreSQL using brew, run the following command.

$ brew install postgresql

If you wish to start and stop PostgreSQL, then you do that using following brew commands.

$ brew services stop postgresql
$ brew services start postgresql

Create a database for your username

createdb `whoami`

To fix the error role "postgres" does not exist, run the following command

createuser -s postgres

Now, you will be able to log into psql using both your username and postgres

psql
psql -U postgres

To get information about your PostgreSQL installation, you can run following brew command.

brew info postgresql

There are time you might forget if you installed PostgreSQL using brew. You can check if PostgreSQL was installed by brew by running following command.

brew list |grep postgres

The brew package manager installs PostgreSQL under following location.

/usr/local/Cellar/postgresql/

The configuration files as inside following directory.

/usr/local/var/postgres