Making sense of screenshots with CLIP model embeddings

Today I was reading Chapter 9 “Multimodal Large Language Models” of Hands-On Large Language Models book and thought of applying it to a problem I face occassionally. The chapter covers CLIP model and how you can use them to embed both text and images in the same vector space.

Like most normal humans, I take a lot of screenshots, and if I don’t categorize them at the time I took the screenshot, then there’s a lot of manual effort required to find them when I need them. So, I decided to build a quick semantic search on it using the llm utility.

This blog post uses llm CLI tool. You can read installation instructions here.

It is dead simple to generate embeddings for images using llm. You first need to install llm-clip plugin and then embed the images in a collection.

llm install llm-clip
llm embed-multi screenshots --files ~/screenshots '*.png' --binary -m clip

In the command above, we asked llm utility to embed multiple .png images in screenshots directory using the clip model. CLIP model is loaded via Sentence Transformer.

My screenshot directory has 100 images. It took close to 40 seconds on my Apple M2 Max machine to generate the embeddings.

These embeddings are saved in the sqlite database that llm uses. You can check the path of your sqlite database by running following command.

llm collections path

The path on my machine is

/Users/shekhar.gulati/Library/Application Support/io.datasette.llm/embeddings.db

You can open the sqlite database by running following command.

sqlite3 "$(llm collections path)"

If you list tables you will see two tables

collections
embeddings

collections groups together a set of stored embeddings created using the same model, each with a unique ID within that collection.

embeddings table store the actual embedding.

Before we build the search let’s use clustering to build some intution on what these embeddings are. Let’s first install a plugin.

llm install llm-cluster

Then we can cluster screenshot embeddings using the command shown below.

llm cluster screenshots 10

I was pleased to see that it worked well. Yesterday, I took multiple screenshots of Turkish Olympian Yusuf Dikec meme pics and they were all grouped in a single group. Also, it grouped screenshots with graphs and charts together. It grouped archietcture diagrams, sequence diagrams, and block diagrams together. I am impressed.

Below is a photo collage of Turkish Olympian cluster. It correctly add all of them to a single cluster.

llm makes it dead simple to add semantic search over embeddigs. Let’s search for shooter

llm similar screenshots -c 'shooter'

The top 6 results are the same which are in the collage above.

You can limit results by specifying -n parameter.

llm similar screenshots -n 3 -c 'shooter'

We can create the collage by piping the output to the ImageKick montage CLI utility.

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.

Discover more from Shekhar Gulati

Share this:

Related

Leave a comment Cancel reply