How Real-Time AI Voice Agent Stacks Actually Work
A practical explanation of the modern real-time AI voice agent stack: audio transport, turn detection, STT, reasoning, tools, TTS, latency, and safety.
Real-time AI voice agents look simple when they work. You speak, the assistant answers, and the whole exchange feels like a call with a fast human. Under the surface, though, a voice agent is a stack of timing-sensitive systems: audio capture, transport, turn detection, speech recognition or direct audio reasoning, tool calls, response planning, speech generation, playback, logging, and safety controls.
That is why voice agents are a different category from normal text-to-speech. A narration tool can take a paragraph, generate audio, and export a file. A real-time agent has to listen while the user is still speaking, decide when to interrupt or wait, reason under latency pressure, call tools, speak naturally, and recover when the user changes direction halfway through a sentence.
The Short Version
- A real-time voice agent is usually built from audio transport, turn detection, speech understanding, reasoning, tools, TTS, playback, memory, and safety layers.
- There are two main architectures: a pipeline stack that chains STT, LLM, and TTS, and a speech-to-speech stack that handles audio more directly.
- WebRTC is common for browser and mobile audio; WebSockets are common for server media pipelines; SIP matters when the agent needs phone calls.
- The hardest product problems are latency, interruption handling, transcript accuracy, tool timing, and making the voice sound intentional instead of rushed.
- Murmur is not a real-time agent runtime. It fits the adjacent creator workflow: scripted voice assets, local drafts, private narration, and reusable voice experiments on Mac.
The Basic Real-Time Voice Agent Stack
| Layer | What it does | Common failure mode |
|---|---|---|
| Audio capture | Gets microphone or phone audio into the system | Bad device input, echo, noise, clipping |
| Transport | Streams audio between user, agent, and backend | Jitter, dropped packets, high round-trip latency |
| Turn detection | Decides when the user is done or interrupting | Talk-over, awkward pauses, missed interruptions |
| Speech understanding | Turns audio into text or model-understandable speech | Wrong names, numbers, accents, domain terms |
| Reasoning | Plans the response and decides what to do next | Slow answers, brittle prompts, weak context handling |
| Tools | Calls search, CRM, booking, billing, or app APIs | Wrong tool call, stale data, no recovery path |
| Speech generation | Turns the answer into audio | Robotic pacing, hallucinated words, bad pronunciation |
| Playback | Streams audio back to the user quickly | Delayed first audio, choppy output, poor interruption behavior |
| Monitoring | Logs transcripts, latency, errors, and outcomes | No way to debug why a call failed |
Modern docs make this stack visible. OpenAI's Realtime API overview separates voice agents, live translation, transcription, speech generation, connection methods, voice activity detection, tools, webhooks, and cost management. That is the right mental model: the voice is only one part of a realtime system.
Architecture 1: STT to LLM to TTS
The classic voice-agent pipeline is speech-to-text, then an LLM, then text-to-speech. It is easy to understand and flexible because each layer can be swapped. You can use one provider for transcription, another for reasoning, and another for speech generation. Frameworks like LiveKit Agents describe this kind of stack as streaming audio through an STT-LLM-TTS pipeline with turn detection, interruptions, orchestration, tools, and provider integrations.
The tradeoff is latency and error propagation. If STT hears the wrong product name, the LLM reasons over the wrong input. If the LLM takes too long, the TTS layer cannot start. If the TTS layer is slow, the user feels the delay. The pipeline is modular, but every boundary adds timing pressure.
Architecture 2: Speech-to-Speech
The newer architecture is speech-to-speech: one realtime model handles more of the listening, reasoning, and speaking loop directly. OpenAI describes gpt-realtime as a production voice model that processes and generates audio directly through a single model and API, reducing latency and preserving more speech nuance than a chain of separate models.
This does not make the stack disappear. It changes where the stack lives. You still need transport, session state, tool calls, prompt design, audio quality controls, evaluation, and fallbacks. The advantage is that the model can react to tone, interruption, and speech context with less handoff between systems.
Transport: WebRTC, WebSockets, and SIP
Transport choice shapes the whole product. OpenAI's Realtime docs recommend WebRTC for browser and mobile clients that capture and play audio directly, WebSockets when a server already receives raw audio from a media pipeline, and SIP when the use case is telephony. That maps cleanly to three product surfaces: web app, backend media service, and phone call.
The security shape changes too. In browser WebRTC flows, OpenAI's WebRTC guide uses ephemeral client credentials minted by a developer-controlled server. The point is straightforward: the browser should not hold your standard API key, and the server should bind identity and session policy before the client connects.
Phone calls add another layer. Twilio Conversation Relay is a good example of telephony packaging: Twilio handles speech recognition, TTS, voice synthesis, and streaming to the caller while the developer focuses on the application logic over WebSockets.
Turn Detection Is the Product
A voice agent that cannot handle turns will feel broken even if the model is smart. Humans interrupt. They pause mid-thought. They say "wait" after the assistant starts speaking. They trail off and then continue. Turn detection decides whether the agent should wait, answer, stop speaking, or ask a repair question.
This is why LiveKit puts turn detection and interruptions near the center of its voice-agent framing, and why realtime APIs expose voice activity detection, conversation state, and session events. The naturalness of a voice agent is not just voice quality. It is timing.
Where TTS Still Matters
Even in speech-to-speech systems, the speech layer still matters. The voice has to start quickly, sound appropriate for the situation, preserve numbers and names, handle multiple languages, and avoid strange substitutions. Soniox's TTS model docs show why this matters for voice systems: they emphasize real-time generation, 60+ languages, hallucination-free output, alphanumeric accuracy, and streaming generation before a sentence ends.
Expressive control matters too. Inworld Realtime TTS-2 includes natural-language steering, stronger multilingual support, cross-lingual voice synthesis, voice localization, and delivery modes that trade consistency against emotional range. That is exactly the kind of control a voice agent needs when it has to sound calm during support, concise during verification, or expressive during a game dialogue scene.
Latency Is Not One Number
When people say a voice agent is fast, they usually mean several things at once. How quickly does it detect the user's turn? How soon does the model start reasoning? How soon does first audio play? Can audio stream before the full answer is ready? Can the user interrupt? Does a tool call freeze the conversation?
| Latency point | What users feel | How teams improve it |
|---|---|---|
| End-of-turn delay | The agent waits too long | Better VAD and turn models |
| Reasoning delay | The agent feels slow to think | Shorter prompts, lower reasoning effort, smaller task scope |
| Tool delay | The agent stalls while fetching data | Prefetch, streaming status, faster APIs |
| First-audio delay | Silence before the answer | Streaming TTS or direct speech-to-speech |
| Playback delay | Choppy or late audio | WebRTC tuning, region choice, buffering |
| Interruption delay | The agent talks over the user | Barge-in handling and cancellation |
The Production Checklist
- Test with noisy microphones, mobile networks, and real accents.
- Include phone numbers, names, emails, addresses, SKUs, dates, and prices.
- Measure end-of-turn delay, first-audio latency, interruption recovery, and tool-call latency separately.
- Log transcript deltas, model events, tool calls, TTS output, and user interruptions.
- Add repair paths for unclear audio, failed tools, and policy-sensitive requests.
- Decide which parts need human review, especially for sales, healthcare, finance, legal, and support workflows.
- Run real conversations, not only demo prompts.
Where Murmur Fits
Murmur is not trying to be a real-time voice-agent runtime. It does not replace WebRTC infrastructure, SIP, live interruption handling, or a production agent framework. Murmur sits on the adjacent creator side of the voice market: scripted audio, local narration, private drafts, voice experiments, and export-ready files on Apple Silicon Macs.
That distinction matters. If you are building a call-center agent, you need a realtime stack. If you are producing a product video, training lesson, podcast intro, game line, course narration, or client preview, you need a repeatable audio workflow. Murmur helps with the second job: turn scripts into local voice output, compare models, revise without a credit meter, and export audio for the rest of your project.
The overlap is learning. The same market pressure that makes voice agents better also makes creator TTS better: lower latency, stronger multilingual speech, better turn and sentence handling, more expressive control, and fewer hallucinations. You can follow the realtime agent stack to understand where AI voice is going, then choose a tool that fits your actual job.
The Takeaway
Real-time AI voice agents are not just TTS with a chatbot attached. They are event-driven audio systems where timing, turn-taking, reasoning, tools, and voice output all have to cooperate. The best teams think in stacks, not demos. They test the whole conversation loop before trusting the voice.
Need scripted AI voice, not a live agent?
Murmur gives Mac creators local text-to-speech, model choice, private revision, and export-ready audio for videos, courses, podcasts, and client work.
macOS 14+ · Apple Silicon required · 7-day refund policy