Today I was reading Chapter 9 “Multimodal Large Language Models” of Hands-On Large Language Models book and thought of applying it to a problem I face occassionally. The chapter covers CLIP model and how you can use them to embed both text and images in the same vector space.
Like most normal humans, I take a lot of screenshots, and if I don’t categorize them at the time I took the screenshot, then there’s a lot of manual effort required to find them when I need them. So, I decided to build a quick semantic search on it using the llm utility.
Continue reading “Making sense of screenshots with CLIP model embeddings”