Generating architecture.md with code2prompt and OpenAI gpt-4o-mini model


New contributors often struggle to grasp project intricacies without a well-defined architectural understanding documentation. This post exposes how we can leverage the power of Large Language Models (LLMs) to automate the generation of architecture documentation (architecture.md) directly from a project’s codebase. I first learnt about architecture.md from a post that was published in 2021. Many popular open source projects like Caddy have architecture.md in their source code.

We call our script as shown below.

./architecturemd-generator.sh https://github.com/frdel/agent-zero.git

and it generates architecture.md as shown below in the screenshot. You can look at the complete generated architecture.md file in the GitHub repository.

GitHub Repo

You can get the complete source code here https://github.com/shekhargulati/llm-tools.

architecturemd-generator.sh

Following is the code for architecture-generator.sh script. It uses llm, code2prompt, and git CLI tools. For latest script always refer to GitHub repo.

  • code2prompt: This command-line tool simplifies the process of providing context to LLMs by generating a comprehensive Markdown file containing the content of your codebase.
  • llm: This is a CLI utility and Python library for interacting with LLMs, allowing you to leverage their capabilities directly from your terminal.
#!/bin/bash
repo_url="$1"
exclude_dirs="$2"

temp_dir=$(mktemp -d)

git clone "$repo_url" "$temp_dir"

checked_out_path="$temp_dir"

code_md_path="${checked_out_path}/code.md"

code2prompt --path $checked_out_path --exclude "${exclude_dirs}" --tokens --output $code_md_path

code_md_content=$(cat "$code_md_path")

prompt=$(cat <<EOF
# IDENTITY and PURPOSE
You are an experienced software architect. You take as input code content of a Git repository and you generate an architecture document.
Take a deep breath and think step by step about how to best accomplish this goal using the following steps.

# STEPS

- Combine all of your understanding of the project architecture in a section called HIGH LEVEL UNDERSTANDING.
    - Provide a high-level overview of the architecture. Use Mermaid diagram to illustrate the overall architecture and data flow.
- List the technology stack used by the application in a section called TECHNOLOGY STACK
- Create a section called DESIGN DECISIONS where you discuss different design decisions project creators took
- Create a section GETTING STARTED that guides user in a step by step manner on how to use the project
- Create a section called ENTRY POINTS that tells user how to get started with the code base. This should help new developers learn navigate the code faster
- Create a section called COMPONENTS that describe major components of the system
    - Use headings (e.g., `### Component Name`) to clearly label each section.
    - For each component:
        - Describe the purpose of the component.
        - Explain the key data structures or algorithms used.
        - Mention any important design patterns or architectural styles employed (like microservices, event-driven architecture, etc.).
        - Highlight any architectural invariants or rules that govern the component.
- Create a section called CODE MAP
    - Provide a brief description of the directory structure, if applicable. Highlight important directories and their functions.
    - Help user understand API of the application

# INPUT
EOF
)

model="gpt-4o-mini"

echo "$code_md_content" | llm -m "$model" -s "$prompt" -o temperature 0.4 -o max_tokens 2000

Before diving in, ensure you have the following tools installed:

  • Git (https://git-scm.com/)
  • code2prompt (installable via pip: pip install code2prompt)
  • llm (installable via pip: pip install llm)

The Script Breakdown

The provided bash script automates the process of generating architecture documentation from a Git repository URL. Here’s a step-by-step breakdown of what the script does:

  1. Input Validation:
  • The script takes two arguments:
    • $1: The URL of the Git repository to analyze.
    • $2 (Optional): A comma-separated list of directories to exclude from the analysis (e.g., “tests,docs”).
  • The script checks if git, llm, and code2prompt commands are available. If not, it throws an error message and exits.
  1. Cloning the Repository:
  • The script clones the provided Git repository URL into a temporary directory created using mktemp -d.
  • The script then stores the path to the cloned repository in the checked_out_path variable.
  1. Generating Code Prompts:
  • The script utilizes code2prompt to analyze the codebase in the temporary directory. The generated output, containing code snippets and comments, is stored in a file named code.md within the temporary directory.
  • The --exclude flag allows you to specify directories to exclude from the analysis, ensuring a focused prompt for the LLM.
  1. Crafting the LLM Prompt:
  • The script defines a multi-line string variable named prompt. This string essentially acts as a detailed instruction manual for the LLM, guiding it through the process of generating the architecture document.
  • The prompt outlines the desired structure of the document, including sections as shown below. You can customize it to your needs.
    • HIGH LEVEL UNDERSTANDING (with a Mermaid diagram)
    • TECHNOLOGY STACK
    • DESIGN DECISIONS
    • GETTING STARTED
    • ENTRY POINTS
    • COMPONENTS (detailed breakdown of each component)
    • CODE MAP
    • CROSS CUTTING CONCERNS (testing, error handling, etc.)
  • The prompt also includes placeholders for the actual code content, which will be filled in later.
  1. Feeding the Code and Generating Output:
  • The script uses cat to read the contents of the code.md file, effectively capturing the analyzed code information.
  • This code content is then piped to the standard input of the llm command using the - symbol.
  • The LLM processes the prompt, code content, and its own internal knowledge to generate the architecture documentation. The generated output is displayed on the terminal.

Conclusion

This script demonstrates how the combination of code2prompt and llm can automate the generation of architecture documentation from code. This approach can save developers significant time and effort, allowing them to focus on more strategic tasks. Remember to experiment with different LLM models and edit the prompt instructions to meet your needs.

I am building a course on how to build production apps using LLMs. We will cover topics like prompt engineering, RAG, search, testing and evals, fine tuning, feedback analysis, and agents. You can register now and get 50% discount. Register using form – https://forms.gle/twuVNs9SeHzMt8q68


Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.

Leave a comment