What is a real-time AI voice agent?

A real-time AI voice agent is a system that listens to spoken input, reasons over the conversation, may call tools, and responds with spoken audio quickly enough to feel conversational.

What is the basic voice agent architecture?

The common architecture includes audio transport, turn detection, speech understanding, reasoning, tools, text-to-speech or direct speech generation, playback, monitoring, and safety controls.

Is speech-to-speech better than STT to LLM to TTS?

Speech-to-speech can reduce latency and preserve more audio nuance, but pipeline architectures are still useful because they are modular and easier to customize. The right choice depends on the product.

Does Murmur build real-time voice agents?

No. Murmur is a local Mac text-to-speech app for generated and exported audio, not a realtime agent framework. It is better for narration, voiceovers, drafts, and creator audio workflows.

What should I test before shipping a voice agent?

Test turn-taking, interruptions, noisy audio, accents, numbers, names, tool-call failures, first-audio latency, and human review paths with real conversations.

Guide

How Real-Time AI Voice Agent Stacks Actually Work

A practical explanation of the modern real-time AI voice agent stack: audio transport, turn detection, STT, reasoning, tools, TTS, latency, and safety.

June 12, 2026·7 min read

Real-time AI voice agents look simple when they work. You speak, the assistant answers, and the whole exchange feels like a call with a fast human. Under the surface, though, a voice agent is a stack of timing-sensitive systems: audio capture, transport, turn detection, speech recognition or direct audio reasoning, tool calls, response planning, speech generation, playback, logging, and safety controls.

That is why voice agents are a different category from normal text-to-speech. A narration tool can take a paragraph, generate audio, and export a file. A real-time agent has to listen while the user is still speaking, decide when to interrupt or wait, reason under latency pressure, call tools, speak naturally, and recover when the user changes direction halfway through a sentence.

The Short Version

A real-time voice agent is usually built from audio transport, turn detection, speech understanding, reasoning, tools, TTS, playback, memory, and safety layers.
There are two main architectures: a pipeline stack that chains STT, LLM, and TTS, and a speech-to-speech stack that handles audio more directly.
WebRTC is common for browser and mobile audio; WebSockets are common for server media pipelines; SIP matters when the agent needs phone calls.
The hardest product problems are latency, interruption handling, transcript accuracy, tool timing, and making the voice sound intentional instead of rushed.
Murmur is not a real-time agent runtime. It fits the adjacent creator workflow: scripted voice assets, local drafts, private narration, and reusable voice experiments on Mac.

The Basic Real-Time Voice Agent Stack

Layer	What it does	Common failure mode
Audio capture	Gets microphone or phone audio into the system	Bad device input, echo, noise, clipping
Transport	Streams audio between user, agent, and backend	Jitter, dropped packets, high round-trip latency
Turn detection	Decides when the user is done or interrupting	Talk-over, awkward pauses, missed interruptions
Speech understanding	Turns audio into text or model-understandable speech	Wrong names, numbers, accents, domain terms
Reasoning	Plans the response and decides what to do next	Slow answers, brittle prompts, weak context handling
Tools	Calls search, CRM, booking, billing, or app APIs	Wrong tool call, stale data, no recovery path
Speech generation	Turns the answer into audio	Robotic pacing, hallucinated words, bad pronunciation
Playback	Streams audio back to the user quickly	Delayed first audio, choppy output, poor interruption behavior
Monitoring	Logs transcripts, latency, errors, and outcomes	No way to debug why a call failed

Modern docs make this stack visible. OpenAI's Realtime API overview separates voice agents, live translation, transcription, speech generation, connection methods, voice activity detection, tools, webhooks, and cost management. That is the right mental model: the voice is only one part of a realtime system.

Architecture 1: STT to LLM to TTS

The classic voice-agent pipeline is speech-to-text, then an LLM, then text-to-speech. It is easy to understand and flexible because each layer can be swapped. You can use one provider for transcription, another for reasoning, and another for speech generation. Frameworks like LiveKit Agents describe this kind of stack as streaming audio through an STT-LLM-TTS pipeline with turn detection, interruptions, orchestration, tools, and provider integrations.

The tradeoff is latency and error propagation. If STT hears the wrong product name, the LLM reasons over the wrong input. If the LLM takes too long, the TTS layer cannot start. If the TTS layer is slow, the user feels the delay. The pipeline is modular, but every boundary adds timing pressure.

Architecture 2: Speech-to-Speech

The newer architecture is speech-to-speech: one realtime model handles more of the listening, reasoning, and speaking loop directly. OpenAI describes gpt-realtime as a production voice model that processes and generates audio directly through a single model and API, reducing latency and preserving more speech nuance than a chain of separate models.

This does not make the stack disappear. It changes where the stack lives. You still need transport, session state, tool calls, prompt design, audio quality controls, evaluation, and fallbacks. The advantage is that the model can react to tone, interruption, and speech context with less handoff between systems.

Transport: WebRTC, WebSockets, and SIP

Transport choice shapes the whole product. OpenAI's Realtime docs recommend WebRTC for browser and mobile clients that capture and play audio directly, WebSockets when a server already receives raw audio from a media pipeline, and SIP when the use case is telephony. That maps cleanly to three product surfaces: web app, backend media service, and phone call.

The security shape changes too. In browser WebRTC flows, OpenAI's WebRTC guide uses ephemeral client credentials minted by a developer-controlled server. The point is straightforward: the browser should not hold your standard API key, and the server should bind identity and session policy before the client connects.

Phone calls add another layer. Twilio Conversation Relay is a good example of telephony packaging: Twilio handles speech recognition, TTS, voice synthesis, and streaming to the caller while the developer focuses on the application logic over WebSockets.

Turn Detection Is the Product

A voice agent that cannot handle turns will feel broken even if the model is smart. Humans interrupt. They pause mid-thought. They say "wait" after the assistant starts speaking. They trail off and then continue. Turn detection decides whether the agent should wait, answer, stop speaking, or ask a repair question.

This is why LiveKit puts turn detection and interruptions near the center of its voice-agent framing, and why realtime APIs expose voice activity detection, conversation state, and session events. The naturalness of a voice agent is not just voice quality. It is timing.

Where TTS Still Matters

Even in speech-to-speech systems, the speech layer still matters. The voice has to start quickly, sound appropriate for the situation, preserve numbers and names, handle multiple languages, and avoid strange substitutions. Soniox's TTS model docs show why this matters for voice systems: they emphasize real-time generation, 60+ languages, hallucination-free output, alphanumeric accuracy, and streaming generation before a sentence ends.

Expressive control matters too. Inworld Realtime TTS-2 includes natural-language steering, stronger multilingual support, cross-lingual voice synthesis, voice localization, and delivery modes that trade consistency against emotional range. That is exactly the kind of control a voice agent needs when it has to sound calm during support, concise during verification, or expressive during a game dialogue scene.

Latency Is Not One Number

When people say a voice agent is fast, they usually mean several things at once. How quickly does it detect the user's turn? How soon does the model start reasoning? How soon does first audio play? Can audio stream before the full answer is ready? Can the user interrupt? Does a tool call freeze the conversation?

Latency point	What users feel	How teams improve it
End-of-turn delay	The agent waits too long	Better VAD and turn models
Reasoning delay	The agent feels slow to think	Shorter prompts, lower reasoning effort, smaller task scope
Tool delay	The agent stalls while fetching data	Prefetch, streaming status, faster APIs
First-audio delay	Silence before the answer	Streaming TTS or direct speech-to-speech
Playback delay	Choppy or late audio	WebRTC tuning, region choice, buffering
Interruption delay	The agent talks over the user	Barge-in handling and cancellation

The Production Checklist

Test with noisy microphones, mobile networks, and real accents.
Include phone numbers, names, emails, addresses, SKUs, dates, and prices.
Measure end-of-turn delay, first-audio latency, interruption recovery, and tool-call latency separately.
Log transcript deltas, model events, tool calls, TTS output, and user interruptions.
Add repair paths for unclear audio, failed tools, and policy-sensitive requests.
Decide which parts need human review, especially for sales, healthcare, finance, legal, and support workflows.
Run real conversations, not only demo prompts.

Where Murmur Fits

Murmur is not trying to be a real-time voice-agent runtime. It does not replace WebRTC infrastructure, SIP, live interruption handling, or a production agent framework. Murmur sits on the adjacent creator side of the voice market: scripted audio, local narration, private drafts, voice experiments, and export-ready files on Apple Silicon Macs.

That distinction matters. If you are building a call-center agent, you need a realtime stack. If you are producing a product video, training lesson, podcast intro, game line, course narration, or client preview, you need a repeatable audio workflow. Murmur helps with the second job: turn scripts into local voice output, compare models, revise without a credit meter, and export audio for the rest of your project.

The overlap is learning. The same market pressure that makes voice agents better also makes creator TTS better: lower latency, stronger multilingual speech, better turn and sentence handling, more expressive control, and fewer hallucinations. You can follow the realtime agent stack to understand where AI voice is going, then choose a tool that fits your actual job.

The Takeaway

Real-time AI voice agents are not just TTS with a chatbot attached. They are event-driven audio systems where timing, turn-taking, reasoning, tools, and voice output all have to cooperate. The best teams think in stacks, not demos. They test the whole conversation loop before trusting the voice.

Need scripted AI voice, not a live agent?

Murmur gives Mac creators local text-to-speech, model choice, private revision, and export-ready audio for videos, courses, podcasts, and client work.

Buy Murmur · $49

macOS 15+ · Apple Silicon required · 7-day refund policy