How Good Are LLMs at Generating Functional and Aesthetic UIs? An Experiment

I conducted an LLM training session last week. To teach attendees about structured output, I built an HTML/JS web application. This application allows users to input a webpage and specify fields they want to extract. The web app uses OpenAI’s LLM to extract the relevant information. Before making the OpenAI call, the app first sends a request to Jina to retrieve a markdown version of the webpage. Then, the extracted markdown is passed to OpenAI for further processing. You can access the tool here: Structured Extraction Tool.

Note: The tool will prompt you to enter your OpenAI key, which is stored in your browser’s local storage.

Below, I will demonstrate the app’s workflow using screenshots. The user starts by entering the webpage URL. In this example, I want to extract some information from a case study.

Next, users specify the fields they want to extract. We also support field templates for reusability. For each field, users provide a name, description, and its type.

After specifying the fields, users press the Extract Data button. The app displays the extracted data, along with token usage and cost.

This application was entirely generated using Claude in a five-message, back-and-forth conversation.

I wanted to explore the kind of UI/UX other LLMs could generate, so I experimented with multiple models using WebDev Arena. This platform allows you to run a prompt in an “AI battle mode,” where two random LLMs generate and render a Next.js React web app. Interestingly, they didn’t opt for plain HTML/JS. I can think of two possible reasons:

LLMs are trained on more React applications than plain HTML/JS code.
React is more suitable for typical enterprise use cases, making it a more realistic choice.

WebDev Arena is an open-source benchmark evaluating AI capabilities in web development, developed by LMArena.

I asked Claude to summarize my multi-message conversation into a single prompt.

Below is the complete prompt:

Generate a web application using **HTML**, **JavaScript**, and **Tailwind CSS** with the following features:

1. **Purpose**:
The app should extract structured data from web pages using a Large Language Model (LLM). It should be intuitive, visually pleasing, and optimized for users who will spend a significant portion of their day using it.

2. **UI Design**:

- Use a **pleasing color palette** and ensure all UI components are styled using **Tailwind CSS**.
- Create a **beautiful text input box** for users to enter the URL of a webpage.
- Add tooltips to provide contextual help wherever needed.
- Design for **ease of use** and consider best UX practices for intuitive interaction.

3. **Workflow**:

- After entering the webpage URL, prompt users to **list the fields** they want to extract.
- Each field requires:
- **Name**
- **Description**
- **Type** (choose from: int, string, boolean, array, or date).
- Display an initial form with one field, but include a **"Add More Fields" button** to dynamically add additional fields.
- Include validation for inputs where appropriate (e.g., ensure URL is valid, required fields are filled).

4. **Data Extraction**:

- Once the user defines all fields, show an **"Extract" button**.

- When the button is clicked, display a **loader animation**.

- Use the following cURL request to fetch the webpage's markdown via Jina

"""
curl -X POST "https://api.jina.ai/v1/webpage" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'
"""
- Parse and display the extracted data in a clean, structured format based on the user's specified fields.

5. **Additional Features**:

- Make the UI **responsive** for different screen sizes.
- Include error handling for failed API calls or invalid inputs.
- Optimize the app for fast performance and a seamless user experience as there can be many fields that a user might want to extract.

Generate the HTML, JavaScript, and Tailwind CSS code required to implement this application.

My expectations from this excercise

I wanted to see what was possible in a single shot.
My primary interest was in the app’s UX. I didn’t expect it to make actual Jina or OpenAI API calls.
I wanted to evaluate how the models handled a long-form prompt. Did they follow all the instructions? I was particularly curious about how reasoning-focused models like o1 would perform.
Would the models consider UX aspects, such as adding a delete button for fields? I explicitly mentioned in my prompt: “a seamless user experience as there can be many fields that a user might want to extract.”
What title would they use for the generated web page or form?
Would they include any surprise UX elements? I hinted at this multiple times in the prompt.

Model Details

We will try multiple LLM models. Below are details of each of them.

S No	Model Name	Description
1	deepseek-v3	An advanced Mixture-of-Experts (MoE) language model developed by DeepSeek AI, featuring 671 billion total parameters with 37 billion activated per token.
2	o1-2024-12-17	OpenAI’s latest AI model, o1, represents a significant advancement towards human-like reasoning capabilities. This model is part of OpenAI’s initiative to usher in the “Intelligence Age,” aiming to leverage AI for solving complex global challenges.
3	gemini-2.0-flash-thinking-exp-1219	A cutting-edge AI model developed by Google DeepMind, Gemini 2.0 integrates advanced reasoning capabilities with real-time data processing.
4	o1-mini-2024-09-12	A scaled-down version of OpenAI’s o1 model, o1-Mini retains the core reasoning capabilities of its predecessor while being optimized for deployment in resource-constrained environments.
5	gpt-4o-2024-11-20	An iteration of OpenAI’s GPT-4 model, GPT-4o incorporates optimizations for enhanced performance in natural language understanding and generation tasks.
6	gemini-2.0-flash-exp	A variant of the Gemini 2.0 series by Google DeepMind, this model emphasizes rapid processing capabilities, enabling it to handle tasks that require swift data analysis and response generation.
7	gemini-exp-1206	An experimental model in the Gemini series, Gemini-Exp-1206 explores novel architectures and training methodologies to push the boundaries of AI capabilities.
8	qwen2p5-coder-32b-instruct	A specialized AI model tailored for coding assistance, Qwen2.5 Coder is designed to understand and generate code across multiple programming languages.
9	gemini-1.5-pro-002	Part of the Gemini 1.5 series, this professional-grade model offers robust performance in various AI tasks, including language understanding, translation, and summarization.
10	claude-3-5-sonnet-20241022	Developed by Anthropic, Claude 3.5 Sonnet is an AI assistant model optimized for conversational interactions. It focuses on providing coherent and contextually relevant responses, enhancing user experience in dialogue systems.

Let’s battle now.

1. deepseek-v3 vs o1-2024-12-17

The first battle was between deepseek-v3 and o1-2024-12-17.

Deep-seek-v3 generated the following UI.

As you can see it generated a standard form with standard color palette. User can add one or more fields.

My analysis of deepseek output:

Adding remove button in case user wants to delete a field
No visual indicators for required field
UX is prêtty standard
Name of the app Data Extraction App also sounded too generic and boring
Not sure why we have long buttons.

o1-2024-12-17 generated following UI.

Nice to see that it added ? for tooltip
Added validations but they were not inline with the field
No remove button for fields. This I was expecting o1 to figure out.
It also chose Data Extraction App as the name of the app.

2. gemini-2.0-flash-thinking-exp-1219 vs o1-mini-2024-09-12

gemini-2.0-flash-thinking-exp-1219 is the thinking model from Google. Gemini 2.0 Flash Thinking Mode is an experimental model that’s trained to generate the “thinking process” the model goes through as part of its response. As a result, Thinking Mode is capable of stronger reasoning capabilities in its responses than the Gemini 2.0 Flash Experimental model.

gemini-2.0-flash-thinking-exp-1219 generated following UI.

No delete button for fields
Each field is rendered in a horizontal row format with all its input. This help avoid long form but if description is long or we decide to add more fields then it will struggle.
No tooltip and validation
I like that it added a sub-title to the page Enter a URL and specify the fields to extract.
Button sizes also looked fine.

o1-mini-2024-09-12 generated following UI.

Has tooltip and validation. But, again validation happen when you press Extract button and they are not inlined.
No delete button for fields
Each field is rendered in a horizontal row format with all its input. This help avoid long form but if description is long or we decide to add more fields then it will struggle.
Better field placeholder names
Button size looked fine

This is another o1-mini output

This time it added Remove button as little red cross
Also, added required field indicators along with tooltip
Not sure why size of Extract button is large

3. gpt-4o-2024-11-20 vs gemini-2.0-flash-exp

Below is gpt-4o-2024-11-20 generated version.

No validation and tooltip
No delete button for fields
No visual indicators for required field and tooltips
Standard UX and color palette
Long button sizes
Standard placeholder for fields

Below is gemini-2.0-flash-exp version generated version

It added Remove button
No tooltip
Buttons were attached to each other. They also had different size
Standard color palette

4. qwen2p5-coder-32b-instruct vs gemini-exp-1206

qwen2p5-coder-32b-instruct genenerated following UI.

It added Remove button
Has validation but no tooltip markers
Standard color palette

gemini-exp-1206 generated UI below. Nothing much to add.

5. gemini-1.5-pro-002 vs claude-3-5-sonnet-20241022

gemini-1.5-pro-002 generated a very poor UI. I don’t know what to write about it.

claude-3-5-sonnet-20241022 generated a good looking UI

Added delete button for removing the field. For delete action it is always recommended to confirm action. Claude Sonnet didn’t add it.
Added validation and tooltip. I would have liked if validation messages are shown with the HTML elements.
Didn’t add required field indicators
Seperate section for entering web page URL and fields.
I like Web Data Extractor name
Seperation bewteen sections
Add field button at right top. Nice.

Conclusion

This exercise highlighted several strengths and weaknesses in the UX generated by various LLMs. While some models, like Claude, showcased thoughtful design elements such as tooltips and delete buttons, others, like gemini-1.5-pro-002, produced subpar UIs with little to no attention to UX.

What stood out the most:

Many models failed to inline validation messages with the fields, an essential UX feature for form-heavy applications.
The lack of required field indicators in most UIs was surprising, given its necessity for usability.
Some models showed creativity by adding titles or subtitles, but overall, most outputs stuck to the generic stuff.
The inclusion of a delete button for fields was inconsistent, even though it’s critical for dynamic forms.

These experiments helped me understand how different LLMs approach UI generation and how they interpret user prompts. While no model delivered a flawless UX, each provided insights into their design reasoning and capabilities.

As Paul Graham’s tweet suggests, the potential of AI to replace tools like Figma with generative solutions like Replit is growing. It’s exciting to imagine how far AI-driven UI design can evolve in the near future.

My experiments with language models for UI generation show that they can quickly create a generic first draft of a UI. However, they often miss critical usability requirements, as discussed above. I believe there is significant value in focusing on design before moving to prototyping. Design encourages thoughtful consideration of the problem, which may not happen if you jump straight to prototyping. My approach is to invest just enough effort in design and then use LLMs for rapid prototyping. This workflow has proven effective for me.

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.