Guide

How Real-Time AI Voice Agent Stacks Actually Work

A practical explanation of the modern real-time AI voice agent stack: audio transport, turn detection, STT, reasoning, tools, TTS, latency, and safety.

·7 min read

Real-time AI voice agents look simple when they work. You speak, the assistant answers, and the whole exchange feels like a call with a fast human. Under the surface, though, a voice agent is a stack of timing-sensitive systems: audio capture, transport, turn detection, speech recognition or direct audio reasoning, tool calls, response planning, speech generation, playback, logging, and safety controls.

That is why voice agents are a different category from normal text-to-speech. A narration tool can take a paragraph, generate audio, and export a file. A real-time agent has to listen while the user is still speaking, decide when to interrupt or wait, reason under latency pressure, call tools, speak naturally, and recover when the user changes direction halfway through a sentence.

The Short Version

  • A real-time voice agent is usually built from audio transport, turn detection, speech understanding, reasoning, tools, TTS, playback, memory, and safety layers.
  • There are two main architectures: a pipeline stack that chains STT, LLM, and TTS, and a speech-to-speech stack that handles audio more directly.
  • WebRTC is common for browser and mobile audio; WebSockets are common for server media pipelines; SIP matters when the agent needs phone calls.
  • The hardest product problems are latency, interruption handling, transcript accuracy, tool timing, and making the voice sound intentional instead of rushed.
  • Murmur is not a real-time agent runtime. It fits the adjacent creator workflow: scripted voice assets, local drafts, private narration, and reusable voice experiments on Mac.

The Basic Real-Time Voice Agent Stack

LayerWhat it doesCommon failure mode
Audio captureGets microphone or phone audio into the systemBad device input, echo, noise, clipping
TransportStreams audio between user, agent, and backendJitter, dropped packets, high round-trip latency
Turn detectionDecides when the user is done or interruptingTalk-over, awkward pauses, missed interruptions
Speech understandingTurns audio into text or model-understandable speechWrong names, numbers, accents, domain terms
ReasoningPlans the response and decides what to do nextSlow answers, brittle prompts, weak context handling
ToolsCalls search, CRM, booking, billing, or app APIsWrong tool call, stale data, no recovery path
Speech generationTurns the answer into audioRobotic pacing, hallucinated words, bad pronunciation
PlaybackStreams audio back to the user quicklyDelayed first audio, choppy output, poor interruption behavior
MonitoringLogs transcripts, latency, errors, and outcomesNo way to debug why a call failed

Modern docs make this stack visible. OpenAI's Realtime API overview separates voice agents, live translation, transcription, speech generation, connection methods, voice activity detection, tools, webhooks, and cost management. That is the right mental model: the voice is only one part of a realtime system.

Architecture 1: STT to LLM to TTS

The classic voice-agent pipeline is speech-to-text, then an LLM, then text-to-speech. It is easy to understand and flexible because each layer can be swapped. You can use one provider for transcription, another for reasoning, and another for speech generation. Frameworks like LiveKit Agents describe this kind of stack as streaming audio through an STT-LLM-TTS pipeline with turn detection, interruptions, orchestration, tools, and provider integrations.

The tradeoff is latency and error propagation. If STT hears the wrong product name, the LLM reasons over the wrong input. If the LLM takes too long, the TTS layer cannot start. If the TTS layer is slow, the user feels the delay. The pipeline is modular, but every boundary adds timing pressure.

Architecture 2: Speech-to-Speech

The newer architecture is speech-to-speech: one realtime model handles more of the listening, reasoning, and speaking loop directly. OpenAI describes gpt-realtime as a production voice model that processes and generates audio directly through a single model and API, reducing latency and preserving more speech nuance than a chain of separate models.

This does not make the stack disappear. It changes where the stack lives. You still need transport, session state, tool calls, prompt design, audio quality controls, evaluation, and fallbacks. The advantage is that the model can react to tone, interruption, and speech context with less handoff between systems.

Transport: WebRTC, WebSockets, and SIP

Transport choice shapes the whole product. OpenAI's Realtime docs recommend WebRTC for browser and mobile clients that capture and play audio directly, WebSockets when a server already receives raw audio from a media pipeline, and SIP when the use case is telephony. That maps cleanly to three product surfaces: web app, backend media service, and phone call.

The security shape changes too. In browser WebRTC flows, OpenAI's WebRTC guide uses ephemeral client credentials minted by a developer-controlled server. The point is straightforward: the browser should not hold your standard API key, and the server should bind identity and session policy before the client connects.

Phone calls add another layer. Twilio Conversation Relay is a good example of telephony packaging: Twilio handles speech recognition, TTS, voice synthesis, and streaming to the caller while the developer focuses on the application logic over WebSockets.

Turn Detection Is the Product

A voice agent that cannot handle turns will feel broken even if the model is smart. Humans interrupt. They pause mid-thought. They say "wait" after the assistant starts speaking. They trail off and then continue. Turn detection decides whether the agent should wait, answer, stop speaking, or ask a repair question.

This is why LiveKit puts turn detection and interruptions near the center of its voice-agent framing, and why realtime APIs expose voice activity detection, conversation state, and session events. The naturalness of a voice agent is not just voice quality. It is timing.

Where TTS Still Matters

Even in speech-to-speech systems, the speech layer still matters. The voice has to start quickly, sound appropriate for the situation, preserve numbers and names, handle multiple languages, and avoid strange substitutions. Soniox's TTS model docs show why this matters for voice systems: they emphasize real-time generation, 60+ languages, hallucination-free output, alphanumeric accuracy, and streaming generation before a sentence ends.

Expressive control matters too. Inworld Realtime TTS-2 includes natural-language steering, stronger multilingual support, cross-lingual voice synthesis, voice localization, and delivery modes that trade consistency against emotional range. That is exactly the kind of control a voice agent needs when it has to sound calm during support, concise during verification, or expressive during a game dialogue scene.

Latency Is Not One Number

When people say a voice agent is fast, they usually mean several things at once. How quickly does it detect the user's turn? How soon does the model start reasoning? How soon does first audio play? Can audio stream before the full answer is ready? Can the user interrupt? Does a tool call freeze the conversation?

Latency pointWhat users feelHow teams improve it
End-of-turn delayThe agent waits too longBetter VAD and turn models
Reasoning delayThe agent feels slow to thinkShorter prompts, lower reasoning effort, smaller task scope
Tool delayThe agent stalls while fetching dataPrefetch, streaming status, faster APIs
First-audio delaySilence before the answerStreaming TTS or direct speech-to-speech
Playback delayChoppy or late audioWebRTC tuning, region choice, buffering
Interruption delayThe agent talks over the userBarge-in handling and cancellation

The Production Checklist

  1. Test with noisy microphones, mobile networks, and real accents.
  2. Include phone numbers, names, emails, addresses, SKUs, dates, and prices.
  3. Measure end-of-turn delay, first-audio latency, interruption recovery, and tool-call latency separately.
  4. Log transcript deltas, model events, tool calls, TTS output, and user interruptions.
  5. Add repair paths for unclear audio, failed tools, and policy-sensitive requests.
  6. Decide which parts need human review, especially for sales, healthcare, finance, legal, and support workflows.
  7. Run real conversations, not only demo prompts.

Where Murmur Fits

Murmur is not trying to be a real-time voice-agent runtime. It does not replace WebRTC infrastructure, SIP, live interruption handling, or a production agent framework. Murmur sits on the adjacent creator side of the voice market: scripted audio, local narration, private drafts, voice experiments, and export-ready files on Apple Silicon Macs.

That distinction matters. If you are building a call-center agent, you need a realtime stack. If you are producing a product video, training lesson, podcast intro, game line, course narration, or client preview, you need a repeatable audio workflow. Murmur helps with the second job: turn scripts into local voice output, compare models, revise without a credit meter, and export audio for the rest of your project.

The overlap is learning. The same market pressure that makes voice agents better also makes creator TTS better: lower latency, stronger multilingual speech, better turn and sentence handling, more expressive control, and fewer hallucinations. You can follow the realtime agent stack to understand where AI voice is going, then choose a tool that fits your actual job.

The Takeaway

Real-time AI voice agents are not just TTS with a chatbot attached. They are event-driven audio systems where timing, turn-taking, reasoning, tools, and voice output all have to cooperate. The best teams think in stacks, not demos. They test the whole conversation loop before trusting the voice.

Need scripted AI voice, not a live agent?

Murmur gives Mac creators local text-to-speech, model choice, private revision, and export-ready audio for videos, courses, podcasts, and client work.

macOS 14+ · Apple Silicon required · 7-day refund policy