The Hard Parts of Voice AI: Latency, Prompts, and the Pink Elephant Problem

· 5 min read ·
·
AI Voice Prompt Engineering GenAI Latency

After months of building Path AI at Layerpath, I’ve learned that voice AI is a completely different beast from text-based chatbots. The problems aren’t what you’d expect. It’s not about making the AI smarter. It’s about making it faster, more predictable, and surprisingly, teaching it what to do instead of what to avoid.

Here’s what actually matters when you’re building a voice-first inbound sales agent.

The Latency Problem Nobody Talks About

When someone calls your product, they expect a response in under a second. Not two seconds. Not “let me think about that.” Under one second.

The naive approach is: User speaks → Transcribe → Send full conversation to LLM → Get response → Speak it back.

The problem? If you send the entire conversation history to the LLM every turn, you’re looking at 2+ second response times by turn 10. Users will say “Hello? Are you there?” and interrupt the bot just as it starts responding.

The State + Window Solution

Instead of relying on the LLM to read through 50 messages to remember that the user’s name is “Sarah,” we extract key facts into a structured state object and inject it into the system prompt. Then we aggressively truncate the actual message history.

Before (slow):

System Prompt
+ Message 1
+ Message 2
+ ...
+ Message 50
+ User's latest message
= 2000+ tokens

After (fast):

System Prompt
+ Extracted State (name, intent, stage)
+ Last 6 messages only
+ User's latest message
= ~400 tokens

The state object contains extracted entities: user name, role, intent, email, current conversation stage. This gets rebuilt into the system prompt on every turn. The actual message history? We keep only the last 6-8 turns.

The result: consistent sub-800ms response times even in long conversations.

What About “What Did You Say Earlier?”

Sometimes users ask about something that fell out of the sliding window. The solution isn’t to keep more history. It’s to give the agent a tool that can search the full transcript stored separately.

If the user asks “what was that price you mentioned?”, the agent calls the search tool rather than hallucinating.

The Pink Elephant Problem

This one took me weeks to figure out.

We kept adding guardrails to our prompts. “NEVER say you’re checking the database.” “NEVER mention internal systems.” “CRITICAL: Do not say ‘let me search.’”

The agent got worse. It started hallucinating answers instead of calling tools. It would skip the knowledge base entirely and make up pricing information.

Here’s why: LLMs struggle with negation. When you write “NEVER say ‘I am checking’”, the model attends strongly to the concept of “checking.” Combined with instructions to “be natural,” it skips the tool call entirely to avoid the forbidden phrase.

Positive Instructions Beat Negative Constraints

Instead of listing 10 things NOT to say, give one instruction on what TO do.

Bad: “NEVER mention internal systems or searching.”

Good: “Your knowledge retrieval is instantaneous. Speak as if you already possess the information.”

Bad: “Do not talk about tools to the user.”

Good: “Use internal tools silently. Present results directly without narrating your process.”

We cut our “forbidden phrases” section from 30 lines to 3 positive guidelines. Tool calling accuracy went up significantly.

Stage-Based Tool Loading

Another latency killer: we were loading all 20 tools into the context at conversation start. The LLM had to reason about booking tools when the user just said “Hi.”

The fix is a simple state machine. Different conversation stages get different tools.

%%{init: {"layout": "dagre"}}%%
stateDiagram-v2
  direction TB
  [*] --> Discovery
  Discovery --> Demo: User asks to see product
  Discovery --> Booking: User wants to schedule
  Demo --> Booking: User ready to meet
  Booking --> [*]: Meeting booked
  
  note right of Discovery
    Tools: search_knowledge_base
  end note
  
  note right of Demo
    Tools: navigate_demo, search_demo
  end note
  
  note right of Booking
    Tools: check_availability, book_meeting
  end note

Fewer tools means faster inference and fewer hallucinated tool calls. The model can’t call book_demo in the discovery stage because it literally doesn’t have access to it.

Filler Phrases: Perceived Latency vs Actual Latency

When the agent calls a tool like check_availability, there’s a 2-4 second delay while the API runs. Dead silence. Users think the bot died.

The trick: inject filler phrases immediately when a tool call is detected, before the tool completes.

  • check_availability detected → “Let me pull up the calendar…”
  • search_knowledge_base detected → “Let me see…”
  • book_demo detected → “Locking that in for you…”

Perceived latency drops to near zero even though actual latency stays the same. Users hear acknowledgment and wait patiently.

Prompt Architecture That Scales

After rewriting our prompts multiple times, we landed on a modular architecture. Instead of one massive text block, the prompt is assembled from partials at runtime based on feature flags.

%%{init: {"layout": "elk"}}%%
flowchart TB
    Config[Client Config] --> Builder[Prompt Builder]

    subgraph Partials
        Identity[Identity Partial]
        Demo[Demo Instructions]
        Booking[Booking Instructions]
        Safety[Safety Protocols]
    end

    Builder --> Identity
    Builder -->|"if demo_enabled"| Demo
    Builder -->|"if booking_enabled"| Booking
    Builder --> Safety

    Identity --> Final[Final Prompt]
    Demo --> Final
    Booking --> Final
    Safety --> Final

Want to disable demos for a client? Flip a boolean. The demo instructions and demo tools both disappear from the agent’s context. No more editing prompts manually per client.

The Backchannel Problem

Voice has a problem text doesn’t: humans say “uh-huh,” “right,” and “okay” while listening. Standard voice activity detection treats these as interruptions. The bot stops mid-sentence, apologizes, and the conversation flow is destroyed.

The fix: filter short utterances that match common backchannels. If speech is under 0.8 seconds and transcribes to “yeah,” “okay,” “uh-huh,” or “right,” ignore it. Let the bot keep talking.

Testing Prompts Like Code

Prompts don’t crash. They drift. The agent slowly stops calling tools it used to call perfectly. It starts adding parameters that don’t exist. It forgets mandatory scripts.

We treat prompts like code now:

  1. Golden test cases: Input/output pairs for critical paths. Every prompt change runs against them.

  2. Tool call assertions: Did the agent call the expected tool? Did it pass valid parameters?

  3. Semantic evaluation: A smaller model grades the transcript. “Did the agent collect the email before booking? YES/NO.”

No more “let’s listen to 50 calls and hope we catch regressions.”

What I Wish I Knew Earlier

  1. Voice latency budget is ~800ms. Everything you build has to fit in that window.

  2. State extraction beats long context. Extract facts, truncate history.

  3. Positive instructions only. Every “NEVER do X” makes the model more likely to do X.

  4. Tools should match the conversation stage. Don’t give the model capabilities it doesn’t need yet.

  5. Perceived latency matters more than actual latency. Fill the silence.

  6. Prompts drift silently. Test them automatically.

Building voice AI taught me that the hard problems aren’t about AI capability. They’re about latency, predictability, and understanding how LLMs actually interpret instructions. The model is smart enough. The engineering around it is what makes or breaks the experience.


Building voice AI? I’d love to hear what challenges you’re facing. Reach out on LinkedIn.