xAI Grok 3 is impressive

Yes, I remember that I posted a couple of days back that I was not impressed by Grok 3. Until then I didn’t have access to the model. I only saw the launch video by Elon and his team. For some reason I found the video too boring, scripted, and self-boasting that I decided to write off xAI Grok 3. Today, I saw on X that Grok 3 is freely available, so I decided to give it a try.

I am working on a problem where we have multiple Excel reports with different formats and we want to read all the tables in these Excel files in a generic manner. We can have one or more tables in an Excel sheet. I have tried two approaches before trying out the Grok 3 approach, and they worked to some extent.

In the first approach, we created a JSON configuration that defined how to read the Excel. This required end users to write these configuration files. I tried using a vision language model to help developers write the configuration. It worked for simpler Excel files, but as the complexity of Excel increased, it required a lot of work.
The second approach I tried was to convert Excel to PDF and then use OCR to extract tables. This worked surprisingly well, but it felt wrong to rely on a lossy mechanism to get the data.

Today, I tried Grok 3 on this problem. I had already tried ChatGPT, Claude, and Deepseek, but they couldn’t provide an end-to-end solution.

Below is the prompt to Grok 3 with “Thinking” mode:

Help me develop a generic Excel report parsing system. The goal is to transform diverse Excel reports into a dataframe that can be later transformed. In our use case we can have multiple tables per sheet. So, we need to handle this scenario. 

We can leverage LLM like OpenAI gpt-4o to infer table structure. Write the first version that makes use of both heuristics and langauge models. 

We will use Python. We will give an excel and it will return table dataframes.

Also, add logs to the code for easier debugging.

It thought for a minute and then generated flawless code that uses openpyxl, pandas, and openai libraries.

import openpyxl
import pandas as pd
from openai import OpenAI

# Initialize OpenAI client (replace with your API key)
client = OpenAI()


def get_cell_value(sheet, cell):
    """Get cell value, handling merged cells by returning the top-left value."""
    for merged_range in sheet.merged_cells:
        if cell.coordinate in merged_range:
            return sheet[merged_range.start_cell.coordinate].value
    return cell.value if cell.value is not None else ""


def is_empty_row(sheet, row):
    """Check if a row is empty (all cells are None or whitespace)."""
    return all(sheet.cell(row, col).value is None or str(sheet.cell(row, col).value).strip() == ''
               for col in range(1, sheet.max_column + 1))


def find_potential_start_rows(sheet):
    """Find rows that might start a table based on bold text or empty row transitions."""
    potential_starts = []
    for row in range(1, sheet.max_row + 1):
        first_cell = sheet.cell(row, 1)
        if first_cell.value is not None and first_cell.font.bold:
            potential_starts.append(row)
        elif row > 1 and is_empty_row(sheet, row - 1) and not is_empty_row(sheet, row):
            potential_starts.append(row)
    return potential_starts


def serialize_window(sheet, start_row, window_size=15):
    """Serialize a window of rows into a text description with formatting."""
    end_row = min(start_row + window_size - 1, sheet.max_row)
    serialized = ""
    for row in range(start_row, end_row + 1):
        row_data = []
        for col in range(1, sheet.max_column + 1):
            cell = sheet.cell(row, col)
            value = str(get_cell_value(sheet, cell))
            formatting = "bold" if cell.font.bold else ""
            if cell.coordinate in sheet.merged_cells:
                formatting += ", merged" if formatting else "merged"
            row_data.append(f"{value} ({formatting})" if formatting else value)
        serialized += f"Row {row}: [{' | '.join(row_data)}]\n"
    return serialized


def query_gpt4o(serialized_window, sheet_name, start_row):
    """Ask the model to identify the first table in the window."""
    prompt = (
        f"Analyze the Excel sheet '{sheet_name}' starting from row {start_row}. "
        f"Rows {start_row} to {start_row + 14}:\n\n{serialized_window}\n\n"
        f"Identify the first table. Specify the header row and data start row. "
        f"Respond with 'Header row: X, Data start row: Y' or 'No table found'."
    )
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


def parse_model_response(response):
    """Parse the model’s response to get header and data start rows."""
    if "no table found" in response.lower():
        return None, None
    try:
        parts = response.split(",")
        header_row = int(parts[0].split(":")[1].strip())
        data_start_row = int(parts[1].split(":")[1].strip())
        return header_row, data_start_row
    except Exception:
        return None, None  # Invalid response, treat as no table


def find_data_end_row(sheet, start_row, potential_start_rows):
    """Find the last data row, stopping at an empty row or another table start."""
    for row in range(start_row, sheet.max_row + 1):
        if is_empty_row(sheet, row) or row in potential_start_rows:
            return row - 1
    return sheet.max_row


def extract_table(sheet, header_row, data_start_row, data_end_row):
    """Extract the table into a DataFrame, handling merged headers."""
    headers = [get_cell_value(sheet, sheet.cell(header_row, col))
               for col in range(1, sheet.max_column + 1)]
    data = [[sheet.cell(row, col).value for col in range(1, sheet.max_column + 1)]
            for row in range(data_start_row, data_end_row + 1)]
    return pd.DataFrame(data, columns=headers)


def extract_tables_from_sheet(sheet, sheet_name):
    """Extract all tables from a sheet."""
    potential_start_rows = find_potential_start_rows(sheet)
    tables = []
    current_row = 1

    while current_row <= sheet.max_row:
        next_start = next((r for r in potential_start_rows if r >= current_row), None)
        if not next_start:
            break

        serialized = serialize_window(sheet, next_start)
        response = query_gpt4o(serialized, sheet_name, next_start)
        header_row, data_start_row = parse_model_response(response)

        if header_row and data_start_row and data_start_row > header_row:
            data_end_row = find_data_end_row(sheet, data_start_row, potential_start_rows)
            if data_end_row >= data_start_row:
                df = extract_table(sheet, header_row, data_start_row, data_end_row)
                tables.append(df)
                print(f"Extracted table from rows {header_row}-{data_end_row} in '{sheet_name}'")
                current_row = data_end_row + 1
            else:
                current_row = next_start + 1
        else:
            current_row = next_start + 1

    return tables


def parse_excel_to_dataframes(excel_file):
    """Parse the Excel file and return all tables as DataFrames."""
    wb = openpyxl.load_workbook(excel_file)
    all_tables = []
    for sheet_name in wb.sheetnames:
        sheet = wb[sheet_name]
        if sheet.max_row == 0 or sheet.max_column == 0:
            print(f"Sheet '{sheet_name}' is empty. Skipping.")
            continue
        tables = extract_tables_from_sheet(sheet, sheet_name)
        for idx, df in enumerate(tables):
            all_tables.append((sheet_name, idx + 1, df))
    return all_tables


# Example usage
if __name__ == "__main__":
    excel_file = "IDR2024-Part1-AllCharts-Tables.xlsx"
    tables = parse_excel_to_dataframes(excel_file)
    print("Total tables:", len(tables))
    for sheet_name, table_idx, df in tables:
        # Drop columns where all values are NaN
        df = df.dropna(axis=1, how='all')
        print(f"Number of rows: {len(df)}")
        print(f"Sheet: {sheet_name}, Table {table_idx}")
        print("Columns:", df.columns)
        print(df.to_string(index=False))

The above code works as follows (from Grok 3 output):

Serialization:

The serialize_sheet function converts the Excel sheet into a text format that describes each row, preserving formatting and structure.

Model Analysis:

The serialized text is sent to gpt-4o-mini, which identifies tables by recognizing patterns like bold headers followed by data rows.

Boundary Extraction:

The model’s response is parsed to get precise row numbers for headers and data ranges.

Data Extraction:

openpyxl extracts the raw data within the identified boundaries, and pandas structures it into DataFrames.

Multiple Tables:

The process naturally supports multiple tables per sheet, as the model can detect distinct table blocks in the serialized text.

I tried the above code on multiple internal Excel reports, and it worked fine. For this blog, I tried an Excel report from the World Bank. You can access it here.

This Excel file has close to 30 sheets. The code extracted 28 tables. Below are some of the tables it extracted.

Below is a table from sheet Fig1.1:

Number of rows: 5
Sheet: Fig1.1, Table 1
Columns: Index(['', 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023], dtype='object')
                                       2013     2014     2015     2016     2017     2018     2019     2020     2021     2022     2023
               Bilateral creditors 0.366970 0.371380 0.377771 0.404280 0.443366 0.470214 0.477443 0.507959 0.503992 0.492445 0.498977
Multilateral creditors (incl. IMF) 0.631531 0.633970 0.653714 0.684524 0.747430 0.798379 0.853867 0.990499 1.195996 1.239521 1.323926
                       Bondholders 1.017324 1.211877 1.237212 1.350969 1.641447 1.762545 1.976626 2.198213 2.298837 2.162042 2.140701
           Banks and other private 1.835008 2.050797 2.101781 2.201981 2.243818 2.322858 2.397311 2.414909 2.479336 2.472614 2.532256
                        Short-term 1.861783 2.022696 1.530982 1.483022 1.808832 2.027953 2.048435 2.061853 2.341120 2.264032 2.340647
Number of rows: 4

Below is the table from sheet Fig1.3:

Number of rows: 4
Sheet: Fig1.3, Table 1
Columns: Index(['US$ (trillion)',             2013,             2014,             2015,
                   2016,             2017,             2018,             2019,
                   2020,             2021,             2022,             2023],
      dtype='object')
                     US$ (trillion)     2013     2014     2015     2016     2017     2018     2019     2020     2021     2022     2023
                Bilateral creditors 0.334637 0.340524 0.357688 0.377822 0.421346 0.450154 0.459031 0.490972 0.488794 0.479388 0.486920
Multilateral creditors  (incl. IMF) 0.585712 0.589597 0.610460 0.641698 0.703285 0.754219 0.808774 0.944425 1.108898 1.155497 1.240161
                        Bondholders 0.664609 0.774435 0.816182 0.908023 1.071782 1.135259 1.218169 1.261615 1.280056 1.240152 1.275009
            Other private creditors 0.194034 0.245576 0.260939 0.264953 0.265712 0.294386 0.298296 0.301451 0.312462 0.312019 0.313955

Below is the table from the last sheet Fig3.2:

Number of rows: 5
Sheet: Fig3.2, Table 1
Columns: Index(['', 'March ', 'April', 'June', 'Late and non-reporters'], dtype='object')
      March   April  June  Late and non-reporters
2019    38.0     51  74.0                    28.0
2020    47.0     68  87.0                    13.0
2021    45.0     66  88.0                    12.0
2022    48.0     74  90.0                     9.0
2023    51.0     80  95.0                     7.0

There are still some improvements we can make to enhance the table detection logic. Overall, it is a good and solid starting point which I don’t think most developers can come up with in less than 30 minutes. One more thing I will add is that Grok 3 is really fast. Latency is a feature.

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.

Discover more from Shekhar Gulati

Share this:

Related

Leave a comment Cancel reply