From Screenshots to Markdown Tables with LLMs

One of the tasks I frequently use ChatGPT-like tools for is extracting markdown text from images. I enjoy watching conference videos on YouTube. Often, I find slides during these videos that I want to keep for future reference. To achieve this, I take screenshots and add them to my notebook. However, if I forget to add any textual comments with the screenshots, searching for them later becomes difficult. Additionally, there are times when I need to extract text in markdown format from the screenshots for future use.

Let’s look at an example screenshot that I took yesterday from a talk by OpenAI engineer on Fine Tuning.

Option 1. Using ChatGPT to extract markdown table

I used a simple prompt Convert the following to a markdown table on https://chatgpt.com/

It gave me following output.

When to fine-tune
Good for ✅	Not good for ❌
– Following a given format or tone for the output	– Teaching the model new knowledge
– Processing the input following specific, complex instructions	→ Use RAG or custom models instead
– Improving latency	– Performing well at multiple, unrelated tasks
– Reducing token usage	→ Do prompt-engineering or create multiple FT models instead
	– Including up-to-date content in responses
	→ Use RAG instead

To ensure we keep indentation right. We can improve our prompt to Convert the following to a markdown table. Make sure to keep indentation right. Do not include bulletpoint characters or any other redundant text.. ChatGPT output is shown below.

When to fine-tune
Good for ✅	Not good for ❌
Following a given format or tone for the output	Teaching the model new knowledge ➔ Use RAG or custom models instead
Processing the input following specific, complex instructions	Performing well at multiple, unrelated tasks ➔ Do prompt-engineering or create multiple FT models instead
Improving latency	Including up-to-date content in responses ➔ Use RAG instead
Reducing token usage

I will give this 5/5. With almost zero effort I was able to extract the markdown table from the image.

Option 2. Using OpenAI API

ChatGPT is a great product. They are not simply wrapping OpenAI API calls. They do a lot of magic that we are unaware. So, I always prefer to reproduce my results with OpenAI API.

Below is the Python code for extracting markdown table from the image.

from openai import OpenAI
import base64

client = OpenAI()

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


img_path = "fine-tuning.png"
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.0,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Convert the following to a markdown table. Make sure to keep indentation right. Do not include bulletpoint characters or any other redundant text."},
                {"type": "image_url",
                 "image_url": {"url": f"data:image/png;base64,{encode_image(img_path)}", "detail": "high"}}
            ]
        }
    ]
)

Below produce the same output as ChatGPT.

Good for ✅	Not good for ❌
Following a given format or tone for the output	Teaching the model new knowledge ➔ Use RAG or custom models instead
Processing the input following specific, complex instructions	Performing well at multiple, unrelated tasks ➔ Do prompt-engineering or create multiple FT models instead
Improving latency	Including up-to-date content in responses ➔ Use RAG instead
Reducing token usage

I also tried gpt-4o-mini. The smaller and cheaper gpt-4o version. It also did a fairly good job.

When to fine-tune	Good for ✅	Not good for ❌
	Following a given format or tone for the output	Teaching the model new knowledge Use RAG or custom models instead
	Processing the input following specific, complex instructions	Performing well at multiple, unrelated tasks Do prompt-engineering or create multiple FT models instead
	Improving latency	Including up-to-date content in responses Use RAG instead
	Reducing token usage

3. Using Google’s Gemini

Below is the output of https://gemini.google.com/ for the sample prompt Convert the following to a markdown table. Make sure to keep indentation right. Do not include bulletpoint characters or any other redundant text.

When to fine-tune	Good for	Not good for
Following a given format or tone for the output	Processing the input following specific, complex instructions	Teaching the model new knowledge
Improving latency	Reducing token usage	Performing well at multiple, unrelated tasks
Including up-to-date content in responses		Use RAG instead

I will give Gemini 2/5 for this task.

4. Using open-source LLaVA Model

Lastly, I tried open source LLaVA model as well.

I ran 4bit quantizied LLaVA model on my M2 MacBook Pro using llamafile. llamafile is an open source project by Mozilla. It lets you turn large language model (LLM) weights into executables. You can use it in three ways – web browser using packaged web UI accessible at http://127.0.0.1:8080/, using the CLI, or via OpenAI compatible REST API.

Using LLaVA model is as simple as running following commands on your terminal

wget https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile?download=true
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile

You can then either use the web interface or CLI. The web interface runs on http://127.0.0.1:8080/. You can upload an image and prompt it to Convert the following to a markdown table. Below is the markdown output with the text.

When to Fine-Tune	Good for Fine-tuning	Not good for fine-tuning
Red text with a green arrow pointing upwards and the word “Good” written in it.	Following guidelines for optimal performance.	Changing the model’s new knowledge to improve its output.
Green text with an orange X through it.	Not good for fine-tuning.	Processing input data using a specific algorithm.

Output is not great. I will give it 1/5 on this task.

Conclusion

OpenAI models shine in their ability to accurately extract text, especially in specific formats like markdown tables. This makes them a powerful asset for anyone working with images and needing to convert it into usable data. As AI continues to evolve, expect even more impressive text extraction capabilities from LLMs.

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.

Option 1. Using ChatGPT to extract markdown table

Option 2. Using OpenAI API

3. Using Google’s Gemini

4. Using open-source LLaVA Model

Conclusion

Discover more from Shekhar Gulati

Share this:

Related

Leave a comment Cancel reply