shekhargulati

STT Disambiguation in Voice Agents: It is “Now” Not “No”

I was reviewing our voice calls today and discovered that in one of voice agents when user was saying “Now” our system was handling it as “No”. After we reset their password we ask them “Would you like to try logging in with this temporary password now, or would you prefer to do it later?”. In many cases when user said Now STT made converted it to No so we ended the call instead of being there with user while they login.

Speech-to-text layer adds failure modes which are frustrating for end users and they start thinking your voice agent is stupid.

If Speech-to-Text layer misheard a single word, then tiny error cascade into a completely broken user experience.

Why This Happens

Speech-to-text systems face a fundamental challenge with words that sound alike. Research on homophone disambiguation shows that while humans use phonetic detail and conversational context to disambiguate similar-sounding words, STT models typically don’t incorporate this linguistic information explicitly. We use Deepgram for STT model.

The problem gets worse in real-world conditions. As noted in Amazon’s research on ASR error detection, speech-enabled agents are vulnerable to errors stemming from misrecognition, and these errors propagate through the system, resulting in unexpected changes in conversation flow.

“Now” and “no” are particularly tricky because:

They share the same initial consonant
The vowel sounds are close enough that background noise, microphone quality, or accent variations can blur the distinction
Both are extremely common words, so the language model doesn’t have strong prior expectations either way

The Fix (That Doesn’t Work)

Your first instinct might be to reach for keyword boosting. This feature lets you tell the model to favor certain words when it’s uncertain. You could boost “now” and hope the model picks it up more reliably.

But here’s the thing: keyword boosting works best for uncommon words, proper nouns, or domain-specific terminology. As the documentation notes, you shouldn’t send common words that the model already handles. “Now” is about as common as words get. Boosting it aggressively risks introducing false positives elsewhere in your transcripts.

Our fix

We applied the fix at the LLM layer so that it becomes smarter about handling ambiguous input. I am still looking if we can use some metadata from Deepgram to make better decision. It is still work-in-progress.

Here’s what we changed:

1. Changed the question we ask user

Users typically reply with what they hear so we changed our question to following. This helps avoid Now in most of the calls.

Would you like to try logging in right away, or would you prefer to try later?

2. Flagged the ambiguous case explicitly

We told the LLM that “no” by itself is suspicious:

HANDLING AMBIGUOUS "NO":
If the user says just "no" without additional context, this is likely
STT mishearing "now". Ask for clarification: "Just to confirm - would
you like to try logging in right away, or save it for later?"

We also updated our function schema to prevent the LLM from proceeding with ambiguous input:

Do NOT call with false if user just said 'no' without context -clarify first.

Why Clarification Beats Assumption

Voice UX research distinguishes between implicit and explicit confirmation. Implicit confirmation keeps things moving but can lead to unrecoverable errors. Explicit confirmation slows things down but catches mistakes.

The conventional wisdom is to use explicit confirmation for high-stakes actions (payments, bookings) and implicit confirmation for low-stakes ones. But there’s a third case that doesn’t fit neatly: when the input itself is unreliable.

In our scenario, asking “Just to confirm…” adds maybe 3 seconds to the conversation. Routing the user down the wrong path means they hang up confused, call back, and start over. The math isn’t close.

The natural flow of conversation already includes methods for “repairing” misunderstandings through repetition and paraphrasing. Your voice agent should do the same.

Broader Lessons

STT errors are systematic, not random

If “now” gets mistranscribed as “no” once, it will happen again. These aren’t one-off glitches. They’re predictable failure modes based on acoustic similarity. Identify them through testing and handle them explicitly.

We recommend simulating background noise during testing specifically to reveal which keywords get dropped or substituted by ASR.

Your LLM is your safety net

LLMs can compensate for ASR errors by using conversational context. The LLM knows what question was just asked. It knows what answers make sense. Use that knowledge.

A bare “no” in response to “would you like to try now or later?” is semantically odd. The user didn’t say “later” or “not right now” or “I’ll pass.” They just said “no.” A well-prompted LLM can recognize this as a signal to seek clarification.

Function schemas are prompts too

If you’re using function calling, the descriptions in your schema are instruction surface area. They influence how and when the LLM invokes functions. We found that adding disambiguation guidance directly to the function description reinforced the behavior we wanted in the task prompt.

Building Production-Ready Voice Agents

I spent the second half of 2025 building a voice agent platform for a company that provides IT support services to higher education institutions. The platform is now live and handling calls from students and staff across multiple universities.

We built a multi-tenant system where students call a phone number, speak with an AI agent, and get help with common IT tasks. The three primary use cases we’ve deployed are:

Password resets — verifying identity and generating new credentials
FAQ responses — answering common questions about IT services
Front desk routing — transferring calls to the appropriate department or staff member

The platform is extensible, allowing us to add new use cases with minimal changes to the core system.

Tech Stack

Backend: Python 3.x, FastAPI, Pipecat, Pipecat Flows
Speech-to-Text: Deepgram
LLM: OpenAI GPT-4.1 and GPT-4.1-mini
Text-to-Speech: Cartesia
Telephony: Twilio
Database: PostgreSQL

This stack worked well for a small team of three developers building and iterating quickly.

Architecture Overview

Incoming calls connect via Twilio (or WebRTC for browser-based testing). The WebSocket handler creates a processing pipeline that orchestrates the speech-to-text, LLM, and text-to-speech services. The Flow Manager maintains conversation state, determining which prompts and functions are available at each step.

In this post, I’ll share the principles, best practices, and lessons learned from building this system. While our examples use Pipecat, the concepts apply regardless of your tech stack.

Debugging failed HTTP request with Claude Code

I have a Python HTTP client built using the httpx library. The client worked fine when I was working with the test environment, but as soon as I pointed it to the prod environment, the request started failing with an HTTP 403 error. To be 100% sure, I tested the same request using the Bruno REST client and it worked. I used Bruno to generate cURL and Python 3 (without any dependencies) code, and they both worked as well. I spent 15 minutes debugging myself but couldn’t figure out the reason. The same request passing in Bruno, cURL, and Python 3 (without dependencies) was working, but the client code written with the httpx client library was failing.

So, I opened Claude Code and prompted it to debug the issue for me.

 I am facing an issue where if I run the cURL request below it succeeeds. The same call if we make it to @app/clients/identity_client.py  via id_test.py give 403. I am 
  having hard time understanding the difference between the two. Help me debug what is the difference between two and why it is failing. Then, help me fix it.


  curl --request POST \
    --url URL \
    --header 'content-type: application/json' \
    --data '{
    "key": "00e78489023b4d6ea2ed440ef5fb3e2f",
    "tenantDomain": "domain",
    "username": "user"
  }'

Claude Code first suggested that the 403 error was coming because of:

Different base URLs: Your Python client uses env1.identitysystem.io but the working cURL uses env2.identitysystem.io
Different key: The cURL uses a specific key while your client generates a random UUID

I told Claude that this was not the case. In my id_test.py I set the base URL to env2.identitysystem.io.

Claude Code then said:

You’re absolutely right! You are already setting the correct base URL in id_test.py. Looking at the client code again, the issue is likely the random key generation.

Claude Code then ran the Python client code with the key value from the cURL request and found out that the key is not the issue. Even with the same key, the API call was failing.

The issue persists. Let me debug further by checking if there are any HTTP headers differences. Let me modify the client to print the actual request being made and add proper headers:

Claude Code added debug logs to print headers and data to debug the issue.

Using the debug logs, it figured out the issue.

Great! Now I can see the issue more clearly. The response is an HTML 403 error from the load balancer (server: awselb/2.0), not from the API itself. This suggests there might be a User-Agent or other header requirement.

Now let’s test if setting a cURL-like User-Agent fixes the issue

Perfect! The issue was indeed the User-Agent header. The API/load balancer was blocking requests with the default httpx User-Agent but allows requests with a cURL-like User-Agent.

httpx library uses python-httpx/X.Y.Z as user agent where X.Y.Z depends on the httpx library version.

Claude Code successfully fixed the issue.

Root Cause: The Identity provider API/load balancer was blocking requests based on the User-Agent header. The default httpx User-Agent was being rejected with a 403 Forbidden error.

Conclusion

This type of issue is particularly frustrating for developers because environment dissimilarities create false assumptions. When the same request works in Bruno and cURL but fails in your application, you don’t immediately suspect something as basic as the User-Agent header.

These bugs are time-consuming because they violate our expectations – if the HTTP request is identical, it should work regardless of the client. The root cause often lies in subtle differences that aren’t obvious, like default headers that vary between tools.

Having a systematic debugging approach, whether through AI assistance or methodical logging, helps identify these hidden variables more efficiently than manual trial and error. Sometimes an external perspective is needed to spot what you’ve overlooked.

Notes from Gemini Embedding Paper

I was reading a paper by the Google DeepMind team on how they trained Gemini Embedding, a state-of-the-art, unified embedding model. This is the second paper I’ve read this month on training embedding models. Last week, I read about how the Jina embedding model was trained. The Jina embedding paper was thin and lacked details, so I didn’t write about it. This paper is full of insights, so I thought I’d write a short post sharing what I learned.

Gemini Embedding achieves state-of-the-art performance across MMTEB’s multilingual, English, and code benchmarks.

Gemini embeddings use a multi-resolution loss (MRL) so that a single model can produce embeddings of different sizes (768, 1536, 3072). During training, the model applies separate contrastive losses on different sub-portions of the embedding vector, ensuring that both shorter and longer embeddings are well-trained. This provides flexibility: smaller embeddings for efficiency, larger ones for accuracy — all from the same model.

They cite two main reasons why the Gemini Embedding model achieves state-of-the-art performance in benchmarks:

The Gemini Embedding model is initialized from the weights of the Gemini LLM backbone. They also note that several recent embedding models such as E5-Mistral, SFR-Mistral, BGE-ICL, and NV-Embed have been initialized from the Mistral-7B (Jiang et al., 2023) backbone and then further adapted as embedding models. The same is true for the jina-code-embeddings-0.5b and 1.5b models, as they are built on the Qwen2.5-Coder-0.5B and Qwen2.5-Coder-1.5B backbones.
The second reason they cite is high-quality datasets. These datasets are synthetically generated using Gemini LLM. They mention: “Leveraging Gemini’s diverse capabilities, we train Gemini Embedding on a comprehensive suite of embedding tasks. To construct a high-quality, heterogeneous training dataset, we employ Gemini for several critical data curation steps: filtering low-quality examples, determining relevant positive and negative passages for retrieval, and generating rich synthetic datasets. This curated dataset facilitates training with a contrastive learning objective, enabling Gemini Embedding to learn robust semantic representations.”

In the paper, they also mention that the Gemini embedding model is trained with a contrastive loss that pulls queries close to their correct targets while pushing away incorrect ones. Negatives are usually sampled from the same batch, and sometimes hard negatives are added to make learning more robust. Each example is also tagged with a task type, which conditions the model to learn embeddings useful across different domains like Q&A or fact-checking.

Each training example also includes a task description such as "question answering" or "fact checking". This string tells the model what kind of relationship between the query and target it should focus on. In effect, it makes the embeddings task-aware, allowing a single embedding model to generalize across multiple use cases.

They also discuss that to train the model they used a two-stage process — Pre-finetuning and Finetuning.

Pre-finetuning: First, the model is “pre-finetuned” on a large number of potentially noisy (query, target) pairs, omitting the hard-negative term from the loss function. They found it beneficial to use a large batch size, as the primary objective is to adapt the parameters from autoregressive generation to encoding.
Finetuning: Next, the model is fine-tuned on a large mixture of task-specific datasets containing (query, target, hard negative target) triples. For this phase of training, they found it beneficial to use smaller batch sizes (e.g., less than 1024) and to limit each batch to a single dataset, as distinguishing a given positive target from in-batch targets from the same task provides greater signal than discerning (say) a retrieval target from a classification label.

Comparing Different OpenAI Models on Extracting Structured Information from PDF Documents

I was working on a problem where I needed to extract information from hotel tariff sheet PDF documents. These documents provide details on seasonal room rates, occupancy terms, and related supplements. They serve as standard reference material for travel agents, tour operators, and partners when contracting accommodations. Below is a screenshot of a synthetic document (similar to the original) that I created using ChatGPT.

For this use case I used OpenAI responses API. I tried extraction with gpt-4.1-mini, gpt-4o, gpt-4o-mini, gpt-5-nanoand gpt-5-mini models.

Notes on mini-swe-agent

I was going over the code base of mini-swe-agent today. The core agent loop is 100 lines long. All agentic framework does something similar. Interesting facts about mini-swe-agent:

Only uses bash tool
Does not depend on function calling. It parses the response to extract commands that need to be run

The Mini-SWE-Agent operates in a continuous loop, iteratively solving problems by querying an LLM for actions, executing bash commands, and observing results until the task is complete.

How executives select GenAI vendors

I was reading state of AI in Business 2025 report today and below resonated with me. I am also building enterprise Generative AI products and found it useful.

Memory is becoming critical for succeeding at Generative AI products. People expect Generative AI products to become smart. I have listened to multiple talks this year. These talks were from folks at Microsoft and different AI labs. They all are talking about memory.

Paper: Working with AI: Measuring the Occupational Implications of Generative AI

Today I was going over a paper by Microsoft Research team on how AI is impacting professsional work. This paper was published in July 2025. They analyzed 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot to understand how generative AI impacts different occupations and work activities.

They seperated analysis into two distinct perspectives:

User Goals: What people are trying to accomplish with AI assistance
AI Actions: What work activities the AI actually performs

They used O*NET database’s 332 Intermediate Work Activities (IWAs) as the basis of their classification. One of the surprising finding of the paper is that in 40% of conversations, user goals and AI actions were completely different – AI often acts as a coach/advisor rather than directly performing the user’s task.

They also list occupations where there is highest AI applicability like translators, sales reps, customer service representatives, writers, etc.

As per their study currently AI augments human work rather than fully automating it. Most occupations have some AI applicability, but none are fully automated. They also mentions that impact is uneven – some work activities highly affected, others not at all. Even successful AI assistance typically covers only moderate portions of work activities.

I Tested Gemma 3 270M on the Simplest NLP Task

Google recently released Gemma 3 270M, a remarkably compact 270 million parameter language model that promises efficient AI capabilities in a tiny package. As someone building AI voice agents, I was immediately interested in testing whether this model could handle one of my simplest but frequent use cases: generating message variations for conversational AI.

For example, given a message like “Please wait. I am checking if your username exists in the system,” I want the LLM to generate semantically equivalent variations such as “One moment please while I verify your username in our system.” This is a lightweight task that models like GPT-4.1-mini, Claude Haiku, or Gemini Flash handle well, but they still add latency. To minimize this, I’m considering using the Gemma 270M model in a sidecar to eliminate unnecessary network delays.

The Gemma 3 270M represents Google’s “right tool for the job” philosophy—a model designed specifically for fine-tuning rather than general-purpose use. According to Google’s release:

“Its true power is unlocked through fine-tuning. Once specialized, it can execute tasks like text classification and data extraction with remarkable accuracy, speed, and cost-effectiveness.”

What makes this model particularly interesting from a technical perspective is its parameter allocation: approximately 170M parameters are dedicated to embeddings, with only 100M for the transformer layers. This unusual split reflects Google’s strategy to maintain a large vocabulary while keeping the model compact—a design choice that facilitates adaptation to different languages and domains through fine-tuning.

The model is available in GGUF format and can run efficiently on CPU, making it accessible for edge deployment scenarios where larger models would be prohibitive.

Making coderunner-ui work with Docker using Claude Code

Today, I was browsing Hacker News when I stumbled upon an interesting project: coderunner-ui. The premise was compelling – a local-first AI workspace that lets you chat with LLMs and execute generated code in isolated environments, all without sending your data to the cloud. As someone who’s always looking for tools that respect privacy while providing powerful capabilities, this caught my attention immediately.

I cloned the repository, excited to try it out. Then I hit a wall: “Requires macOS on Apple Silicon.”

I use an Intel Mac. The Apple container system that coderunner-ui depends on is only available on Apple Silicon Macs.I have spent considerable time last few weeks solving something similar so I decided to dig deeper.