search – Shekhar Gulati

I am using an open-source library called Docling to extract text from PDF documents. It was developed by the IBM research team, and the library works surprisingly well for my PDF documents.

from pathlib import Path
from docling.document_converter import DocumentConverter

source = "document.pdf"
converter = DocumentConverter()
result = converter.convert(source)
result.document.save_as_markdown(filename=Path("document.md"))

The code above generated a good-looking Markdown document. It cleanly extracted tables from my PDF. I am still benchmarking it with multiple PDFs, but it has been a good first experience with Docling.

Its README.md mentions that it uses an OCR engine, but it does not specify which one. Before diving into the source code, I decided to see if any GenAI search solutions could find the answer for me.

Tag: search

Only ChatGPT Search Got It Right