Yes, I remember that I posted a couple of days back that I was not impressed by Grok 3. Until then I didn’t have access to the model. I only saw the launch video by Elon and his team. For some reason I found the video too boring, scripted, and self-boasting that I decided to write off xAI Grok 3. Today, I saw on X that Grok 3 is freely available, so I decided to give it a try.
I am working on a problem where we have multiple Excel reports with different formats and we want to read all the tables in these Excel files in a generic manner. We can have one or more tables in an Excel sheet. I have tried two approaches before trying out the Grok 3 approach, and they worked to some extent.
- In the first approach, we created a JSON configuration that defined how to read the Excel. This required end users to write these configuration files. I tried using a vision language model to help developers write the configuration. It worked for simpler Excel files, but as the complexity of Excel increased, it required a lot of work.
- The second approach I tried was to convert Excel to PDF and then use OCR to extract tables. This worked surprisingly well, but it felt wrong to rely on a lossy mechanism to get the data.
Today, I tried Grok 3 on this problem. I had already tried ChatGPT, Claude, and Deepseek, but they couldn’t provide an end-to-end solution.
Below is the prompt to Grok 3 with “Thinking” mode:
Help me develop a generic Excel report parsing system. The goal is to transform diverse Excel reports into a dataframe that can be later transformed. In our use case we can have multiple tables per sheet. So, we need to handle this scenario.
We can leverage LLM like OpenAI gpt-4o to infer table structure. Write the first version that makes use of both heuristics and langauge models.
We will use Python. We will give an excel and it will return table dataframes.
Also, add logs to the code for easier debugging.
It thought for a minute and then generated flawless code that uses openpyxl, pandas, and openai libraries.
import openpyxl
import pandas as pd
from openai import OpenAI
# Initialize OpenAI client (replace with your API key)
client = OpenAI()
def get_cell_value(sheet, cell):
"""Get cell value, handling merged cells by returning the top-left value."""
for merged_range in sheet.merged_cells:
if cell.coordinate in merged_range:
return sheet[merged_range.start_cell.coordinate].value
return cell.value if cell.value is not None else ""
def is_empty_row(sheet, row):
"""Check if a row is empty (all cells are None or whitespace)."""
return all(sheet.cell(row, col).value is None or str(sheet.cell(row, col).value).strip() == ''
for col in range(1, sheet.max_column + 1))
def find_potential_start_rows(sheet):
"""Find rows that might start a table based on bold text or empty row transitions."""
potential_starts = []
for row in range(1, sheet.max_row + 1):
first_cell = sheet.cell(row, 1)
if first_cell.value is not None and first_cell.font.bold:
potential_starts.append(row)
elif row > 1 and is_empty_row(sheet, row - 1) and not is_empty_row(sheet, row):
potential_starts.append(row)
return potential_starts
def serialize_window(sheet, start_row, window_size=15):
"""Serialize a window of rows into a text description with formatting."""
end_row = min(start_row + window_size - 1, sheet.max_row)
serialized = ""
for row in range(start_row, end_row + 1):
row_data = []
for col in range(1, sheet.max_column + 1):
cell = sheet.cell(row, col)
value = str(get_cell_value(sheet, cell))
formatting = "bold" if cell.font.bold else ""
if cell.coordinate in sheet.merged_cells:
formatting += ", merged" if formatting else "merged"
row_data.append(f"{value} ({formatting})" if formatting else value)
serialized += f"Row {row}: [{' | '.join(row_data)}]\n"
return serialized
def query_gpt4o(serialized_window, sheet_name, start_row):
"""Ask the model to identify the first table in the window."""
prompt = (
f"Analyze the Excel sheet '{sheet_name}' starting from row {start_row}. "
f"Rows {start_row} to {start_row + 14}:\n\n{serialized_window}\n\n"
f"Identify the first table. Specify the header row and data start row. "
f"Respond with 'Header row: X, Data start row: Y' or 'No table found'."
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
def parse_model_response(response):
"""Parse the model’s response to get header and data start rows."""
if "no table found" in response.lower():
return None, None
try:
parts = response.split(",")
header_row = int(parts[0].split(":")[1].strip())
data_start_row = int(parts[1].split(":")[1].strip())
return header_row, data_start_row
except Exception:
return None, None # Invalid response, treat as no table
def find_data_end_row(sheet, start_row, potential_start_rows):
"""Find the last data row, stopping at an empty row or another table start."""
for row in range(start_row, sheet.max_row + 1):
if is_empty_row(sheet, row) or row in potential_start_rows:
return row - 1
return sheet.max_row
def extract_table(sheet, header_row, data_start_row, data_end_row):
"""Extract the table into a DataFrame, handling merged headers."""
headers = [get_cell_value(sheet, sheet.cell(header_row, col))
for col in range(1, sheet.max_column + 1)]
data = [[sheet.cell(row, col).value for col in range(1, sheet.max_column + 1)]
for row in range(data_start_row, data_end_row + 1)]
return pd.DataFrame(data, columns=headers)
def extract_tables_from_sheet(sheet, sheet_name):
"""Extract all tables from a sheet."""
potential_start_rows = find_potential_start_rows(sheet)
tables = []
current_row = 1
while current_row <= sheet.max_row:
next_start = next((r for r in potential_start_rows if r >= current_row), None)
if not next_start:
break
serialized = serialize_window(sheet, next_start)
response = query_gpt4o(serialized, sheet_name, next_start)
header_row, data_start_row = parse_model_response(response)
if header_row and data_start_row and data_start_row > header_row:
data_end_row = find_data_end_row(sheet, data_start_row, potential_start_rows)
if data_end_row >= data_start_row:
df = extract_table(sheet, header_row, data_start_row, data_end_row)
tables.append(df)
print(f"Extracted table from rows {header_row}-{data_end_row} in '{sheet_name}'")
current_row = data_end_row + 1
else:
current_row = next_start + 1
else:
current_row = next_start + 1
return tables
def parse_excel_to_dataframes(excel_file):
"""Parse the Excel file and return all tables as DataFrames."""
wb = openpyxl.load_workbook(excel_file)
all_tables = []
for sheet_name in wb.sheetnames:
sheet = wb[sheet_name]
if sheet.max_row == 0 or sheet.max_column == 0:
print(f"Sheet '{sheet_name}' is empty. Skipping.")
continue
tables = extract_tables_from_sheet(sheet, sheet_name)
for idx, df in enumerate(tables):
all_tables.append((sheet_name, idx + 1, df))
return all_tables
# Example usage
if __name__ == "__main__":
excel_file = "IDR2024-Part1-AllCharts-Tables.xlsx"
tables = parse_excel_to_dataframes(excel_file)
print("Total tables:", len(tables))
for sheet_name, table_idx, df in tables:
# Drop columns where all values are NaN
df = df.dropna(axis=1, how='all')
print(f"Number of rows: {len(df)}")
print(f"Sheet: {sheet_name}, Table {table_idx}")
print("Columns:", df.columns)
print(df.to_string(index=False))
The above code works as follows (from Grok 3 output):
- Serialization:
- The serialize_sheet function converts the Excel sheet into a text format that describes each row, preserving formatting and structure.
- Model Analysis:
- The serialized text is sent to gpt-4o-mini, which identifies tables by recognizing patterns like bold headers followed by data rows.
- Boundary Extraction:
- The model’s response is parsed to get precise row numbers for headers and data ranges.
- Data Extraction:
- openpyxl extracts the raw data within the identified boundaries, and pandas structures it into DataFrames.
- Multiple Tables:
- The process naturally supports multiple tables per sheet, as the model can detect distinct table blocks in the serialized text.
I tried the above code on multiple internal Excel reports, and it worked fine. For this blog, I tried an Excel report from the World Bank. You can access it here.
This Excel file has close to 30 sheets. The code extracted 28 tables. Below are some of the tables it extracted.
Below is a table from sheet Fig1.1:
Number of rows: 5
Sheet: Fig1.1, Table 1
Columns: Index(['', 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023], dtype='object')
2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
Bilateral creditors 0.366970 0.371380 0.377771 0.404280 0.443366 0.470214 0.477443 0.507959 0.503992 0.492445 0.498977
Multilateral creditors (incl. IMF) 0.631531 0.633970 0.653714 0.684524 0.747430 0.798379 0.853867 0.990499 1.195996 1.239521 1.323926
Bondholders 1.017324 1.211877 1.237212 1.350969 1.641447 1.762545 1.976626 2.198213 2.298837 2.162042 2.140701
Banks and other private 1.835008 2.050797 2.101781 2.201981 2.243818 2.322858 2.397311 2.414909 2.479336 2.472614 2.532256
Short-term 1.861783 2.022696 1.530982 1.483022 1.808832 2.027953 2.048435 2.061853 2.341120 2.264032 2.340647
Number of rows: 4
Below is the table from sheet Fig1.3:
Number of rows: 4
Sheet: Fig1.3, Table 1
Columns: Index(['US$ (trillion)', 2013, 2014, 2015,
2016, 2017, 2018, 2019,
2020, 2021, 2022, 2023],
dtype='object')
US$ (trillion) 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
Bilateral creditors 0.334637 0.340524 0.357688 0.377822 0.421346 0.450154 0.459031 0.490972 0.488794 0.479388 0.486920
Multilateral creditors (incl. IMF) 0.585712 0.589597 0.610460 0.641698 0.703285 0.754219 0.808774 0.944425 1.108898 1.155497 1.240161
Bondholders 0.664609 0.774435 0.816182 0.908023 1.071782 1.135259 1.218169 1.261615 1.280056 1.240152 1.275009
Other private creditors 0.194034 0.245576 0.260939 0.264953 0.265712 0.294386 0.298296 0.301451 0.312462 0.312019 0.313955
Below is the table from the last sheet Fig3.2:
Number of rows: 5
Sheet: Fig3.2, Table 1
Columns: Index(['', 'March ', 'April', 'June', 'Late and non-reporters'], dtype='object')
March April June Late and non-reporters
2019 38.0 51 74.0 28.0
2020 47.0 68 87.0 13.0
2021 45.0 66 88.0 12.0
2022 48.0 74 90.0 9.0
2023 51.0 80 95.0 7.0
There are still some improvements we can make to enhance the table detection logic. Overall, it is a good and solid starting point which I don’t think most developers can come up with in less than 30 minutes. One more thing I will add is that Grok 3 is really fast. Latency is a feature.
Discover more from Shekhar Gulati
Subscribe to get the latest posts sent to your email.