One of the tasks I frequently use ChatGPT-like tools for is extracting markdown text from images. I enjoy watching conference videos on YouTube. Often, I find slides during these videos that I want to keep for future reference. To achieve this, I take screenshots and add them to my notebook. However, if I forget to add any textual comments with the screenshots, searching for them later becomes difficult. Additionally, there are times when I need to extract text in markdown format from the screenshots for future use.
Let’s look at an example screenshot that I took yesterday from a talk by OpenAI engineer on Fine Tuning.

Option 1. Using ChatGPT to extract markdown table
I used a simple prompt Convert the following to a markdown table on https://chatgpt.com/
It gave me following output.
| When to fine-tune | |
|---|---|
| Good for ✅ | Not good for ❌ |
| – Following a given format or tone for the output | – Teaching the model new knowledge |
| – Processing the input following specific, complex instructions | → Use RAG or custom models instead |
| – Improving latency | – Performing well at multiple, unrelated tasks |
| – Reducing token usage | → Do prompt-engineering or create multiple FT models instead |
| – Including up-to-date content in responses | |
| → Use RAG instead |
To ensure we keep indentation right. We can improve our prompt to Convert the following to a markdown table. Make sure to keep indentation right. Do not include bulletpoint characters or any other redundant text.. ChatGPT output is shown below.
| When to fine-tune | |
|---|---|
| Good for ✅ | Not good for ❌ |
| Following a given format or tone for the output | Teaching the model new knowledge ➔ Use RAG or custom models instead |
| Processing the input following specific, complex instructions | Performing well at multiple, unrelated tasks ➔ Do prompt-engineering or create multiple FT models instead |
| Improving latency | Including up-to-date content in responses ➔ Use RAG instead |
| Reducing token usage |
I will give this 5/5. With almost zero effort I was able to extract the markdown table from the image.
Option 2. Using OpenAI API
ChatGPT is a great product. They are not simply wrapping OpenAI API calls. They do a lot of magic that we are unaware. So, I always prefer to reproduce my results with OpenAI API.
Below is the Python code for extracting markdown table from the image.
from openai import OpenAI
import base64
client = OpenAI()
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
img_path = "fine-tuning.png"
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.0,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Convert the following to a markdown table. Make sure to keep indentation right. Do not include bulletpoint characters or any other redundant text."},
{"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{encode_image(img_path)}", "detail": "high"}}
]
}
]
)
Below produce the same output as ChatGPT.
| Good for ✅ | Not good for ❌ |
|---|---|
| Following a given format or tone for the output | Teaching the model new knowledge ➔ Use RAG or custom models instead |
| Processing the input following specific, complex instructions | Performing well at multiple, unrelated tasks ➔ Do prompt-engineering or create multiple FT models instead |
| Improving latency | Including up-to-date content in responses ➔ Use RAG instead |
| Reducing token usage |
I also tried gpt-4o-mini. The smaller and cheaper gpt-4o version. It also did a fairly good job.
| When to fine-tune | Good for ✅ | Not good for ❌ |
|---|---|---|
| Following a given format or tone for the output | Teaching the model new knowledge Use RAG or custom models instead | |
| Processing the input following specific, complex instructions | Performing well at multiple, unrelated tasks Do prompt-engineering or create multiple FT models instead | |
| Improving latency | Including up-to-date content in responses Use RAG instead | |
| Reducing token usage |
3. Using Google’s Gemini
Below is the output of https://gemini.google.com/ for the sample prompt Convert the following to a markdown table. Make sure to keep indentation right. Do not include bulletpoint characters or any other redundant text.
| When to fine-tune | Good for | Not good for |
|---|---|---|
| Following a given format or tone for the output | Processing the input following specific, complex instructions | Teaching the model new knowledge |
| Improving latency | Reducing token usage | Performing well at multiple, unrelated tasks |
| Including up-to-date content in responses | Use RAG instead |
I will give Gemini 2/5 for this task.
4. Using open-source LLaVA Model
Lastly, I tried open source LLaVA model as well.
I ran 4bit quantizied LLaVA model on my M2 MacBook Pro using llamafile. llamafile is an open source project by Mozilla. It lets you turn large language model (LLM) weights into executables. You can use it in three ways – web browser using packaged web UI accessible at http://127.0.0.1:8080/, using the CLI, or via OpenAI compatible REST API.
Using LLaVA model is as simple as running following commands on your terminal
wget https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile?download=true
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile
You can then either use the web interface or CLI. The web interface runs on http://127.0.0.1:8080/. You can upload an image and prompt it to Convert the following to a markdown table. Below is the markdown output with the text.
| When to Fine-Tune | Good for Fine-tuning | Not good for fine-tuning |
|---|---|---|
| Red text with a green arrow pointing upwards and the word “Good” written in it. | Following guidelines for optimal performance. | Changing the model’s new knowledge to improve its output. |
| Green text with an orange X through it. | Not good for fine-tuning. | Processing input data using a specific algorithm. |
Output is not great. I will give it 1/5 on this task.
Conclusion
OpenAI models shine in their ability to accurately extract text, especially in specific formats like markdown tables. This makes them a powerful asset for anyone working with images and needing to convert it into usable data. As AI continues to evolve, expect even more impressive text extraction capabilities from LLMs.
Discover more from Shekhar Gulati
Subscribe to get the latest posts sent to your email.