Text to Speech Trends 2026: AI Voice Update
A detailed look at the latest text-to-speech updates, from MAI-Voice-2 and Supertonic 3 to realtime voice agents, Qwen3, Soniox, and local TTS workflows.
Text to speech changed shape quickly in the last month. The category is no longer just about making a sentence sound natural. The new center of gravity is expressive control, realtime conversation, local privacy, multilingual reach, and whether a model can follow the script without inventing, dropping, or rearranging words.
That shift matters for creators. A YouTube narrator, audiobook author, course maker, podcast producer, indie app builder, and voice-agent team all ask different questions now. Which model sounds best is still important, but it is no longer enough. The better question is: which voice system stays reliable in the workflow you actually use?
The Short Version
- Microsoft launched MAI-Voice-2 with broader language coverage, emotion tags, voice prompting, and consent-gated cloning.
- Supertone released Supertonic 3, a lightweight TTS model focused on 31 languages, stable reading, and efficient generation.
- AWS added Qwen3 speech models to SageMaker JumpStart, making open speech models easier to deploy in enterprise workflows.
- Inworld and Soniox pushed realtime TTS forward with low-latency streaming and stronger control for voice agents.
- OpenAI moved the voice-agent stack toward realtime reasoning, translation, and streaming transcription rather than plain one-shot TTS.
- Community discussion keeps circling the same pain points: hallucinations, long-form drift, voice consistency, cost, and privacy.
1. Expressive TTS Is Becoming the Default
The biggest product signal is that expressive control is no longer a premium novelty. Microsoft MAI-Voice-2, announced June 2, expands from English-only to a broader multilingual model family and adds granular emotion control, zero-shot voice prompting from short reference audio, and stronger speaker consistency for long-form content.
Microsoft also frames consent as a product boundary: production voice synthesis is limited to authorized, licensed voices. That is not a small detail. Voice cloning is getting easier, so the guardrails are becoming part of the product story, not a legal footnote.
Google Gemini 3.1 Flash TTS, released earlier this spring and still central to current discussion, points in the same direction. Google emphasizes audio tags, multi-speaker dialogue, 70+ languages, and SynthID watermarking. The pattern is obvious: modern TTS is moving from voice selection to voice direction.
2. Local and On-Device TTS Got More Serious
The local TTS story is not just hobbyist energy anymore. Supertonic 3 is a good example. Supertone positions it around fast generation, stable reading quality, and lower cost, with support for 31 languages and availability in both Play and the API.
The Supertonic 3 model card adds the technical angle: ONNX Runtime, local inference, fewer repeat and skip failures, expression tags, and a compact model footprint around 99M parameters. That is exactly the kind of release that helps developers build voice into local apps, browser tools, mobile experiments, and edge workflows.
This is why local TTS is becoming more than a privacy slogan. It offers a different creative loop. You can generate, listen, revise, and generate again without watching a cloud credit meter. For drafts, private documents, client scripts, and long-form experiments, that matters.
3. Enterprise Platforms Are Packaging Open Speech Models
On May 14, AWS added Qwen3-TTS and Qwen3-ASR models to SageMaker JumpStart. The lineup includes Qwen3-TTS-12Hz-1.7B-CustomVoice, Qwen3-TTS-12Hz-1.7B-Base, and Qwen3-ASR-1.7B.
That move is important because Qwen3-TTS already had traction with local and open-source users, but SageMaker packaging makes it easier for teams to deploy the models inside familiar enterprise infrastructure. The TTS models support multilingual speech, custom voice styles, instruction-driven control over timbre, emotion, and prosody, and rapid voice cloning from reference audio.
For creators, this signals a wider trend: the useful models are no longer trapped in research repos. They are being wrapped into APIs, cloud catalogs, local runtimes, and desktop products. The winners will not only have the best model. They will have the least annoying path from script to final audio.
4. Realtime Voice Is Pulling TTS Into Agent Stacks
Realtime voice is changing what people expect from TTS. A voice agent cannot wait for a whole paragraph to render. It needs low latency, turn-taking, interruption handling, emotional continuity, and speech that responds to the conversation context.
Inworld Realtime TTS-2, launched May 5, shows where this is going. It supports natural language steering with bracketed directions, production-quality synthesis across 15 languages, experimental support for 90+ additional languages, cross-lingual voice synthesis, voice localization, and delivery modes that trade off consistency against emotional range.
Soniox TTS also moved quickly, with tts-rt-v1 now generally available after an April preview. Its docs emphasize realtime streaming, 60+ languages, hallucination-free output, accurate alphanumeric pronunciation, and WebSocket plus REST support. That is a very developer-shaped value proposition: predictable speech in systems where errors are expensive.
OpenAI's May 7 voice API update is adjacent but important. GPT-Realtime-2, live translation, and streaming speech-to-text point toward speech-to-speech systems where the old STT to LLM to TTS pipeline starts to blur. For voice products, TTS is becoming one component in a larger realtime interface.
5. Research Is Targeting Stability, Not Just Naturalness
The most interesting research releases are not only chasing prettier voices. They are trying to reduce drift, improve cloning, and make open systems more reproducible.
dots.tts, posted June 5, presents a 2B-parameter continuous autoregressive TTS foundation model. The paper claims strong open-source performance, voice cloning ability, emotional expressiveness, and low first-packet latencies after distillation.
PilotTTS, posted May 26, is interesting for a different reason. It argues that a disciplined architecture and data pipeline can compete without millions of hours of proprietary data. It supports zero-shot voice cloning, emotion synthesis, paralinguistic synthesis, and Chinese dialect synthesis.
Regional and language-specific work is also getting sharper. JaiTTS targets Thai voice cloning and Thai-English code-switching. Praxy Voice focuses on Indic TTS gaps for Telugu, Tamil, Hindi, and code-mixed speech. This matters because global TTS quality is uneven. English demos are not enough.
6. The Community Still Cares About the Boring Problems
The community signal is less glamorous than the launch posts, but probably more useful. In r/TextToSpeech long-form testing, users are not asking for one universal winner. They are comparing models by workflow: fast draft narration, personality, voice cloning, controllable voices, emotional quality, multilingual support, and long-form reliability.
Another voice AI discussion names the hard problems directly: hallucinated words, dropped words, repeated phrases, unnatural turn-taking, weak evaluation, and the need to validate TTS output with speech-to-text loops. That is not a niche complaint. It is the gap between a demo and a production workflow.
The reaction to MisoTTS 8B shows the same thing. The model is ambitious, with an 8B Sesame-style architecture for conversational speech and voice continuation. But the Reddit thread includes reports of hallucinations, pronunciation issues, missing punctuation pauses, and rough product polish. Bigger is not automatically more usable.
What This Means for Creators
| Trend | What changed | Creator takeaway |
|---|---|---|
| Expressive TTS | Tags, emotion control, voice prompting, multi-speaker direction | Write scripts like performance notes, not only plain text. |
| Local TTS | Smaller models and ONNX runtimes are improving | Private drafts and repeated revisions are easier to justify locally. |
| Enterprise packaging | Qwen3 speech models are easier to deploy through AWS | Open models are moving into production infrastructure. |
| Realtime voice | TTS is part of agent stacks with STT, turn-taking, and memory | Voice apps need conversation design, not just nice audio. |
| Stability focus | Research and docs emphasize fewer hallucinations and better transcript following | Test models with real scripts, numbers, names, and long sections. |
For a creator, the practical lesson is simple: choose TTS by job, not by leaderboard. A fast local model can be better for draft narration than a more expressive cloud model. A realtime model can be better for a voice agent than a beautiful long-form narrator. A commercial cloud studio can be better for teams, while a local app can be better for private scripts and unlimited revision.
Where Murmur Fits
Murmur sits on the local creator side of this shift. It is a Mac app for Apple Silicon that turns text into exportable audio locally after setup. The point is not to replace every cloud voice API. The point is to give Mac creators a private, predictable workflow for narration, drafts, voice experiments, long-form content, and repeated revision.
That positioning is more relevant as TTS gets better. When local models were clearly worse, cloud was the obvious answer. Now the tradeoff is more nuanced. Cloud tools still win for broad collaboration, hosted APIs, some top-tier voices, and the widest managed language coverage. Local tools win when privacy, offline generation, cost control, and ownership matter more.
Murmur costs $49 one-time. There is no free trial and no monthly subscription. For people who generate audio every week, that changes the creative loop. You can regenerate a paragraph because the pacing is slightly off. You can test a different voice. You can make a draft for a client without worrying about burning credits.
How to Evaluate a TTS Tool in 2026
- Test your real script, not the vendor demo.
- Include numbers, acronyms, names, URLs, and mixed-language phrases.
- Generate at least one long section to catch drift and voice changes.
- Check how easy it is to fix one bad sentence without redoing the whole project.
- Compare total workflow cost, not only the first-month price.
- Decide whether your text and voice samples are safe to upload before using cloud cloning.
- Listen for fatigue over several minutes, not only naturalness over ten seconds.
The Takeaway
The latest text-to-speech updates point in one direction: voice generation is becoming more controllable, more realtime, more multilingual, and more practical to run locally. But the best products will be the ones that solve the whole workflow. A model that sounds amazing for one line can still fail if it hallucinates words, drifts over long-form audio, costs too much to revise, or forces private scripts into the cloud.
For creators, that is good news. You have more choices than ever. The smart move is not to chase every new release. It is to map the release to the job: local drafts, polished narration, voice agents, multilingual dubbing, audiobooks, course videos, client work, or accessibility audio. Text to speech is not one market anymore. It is becoming a stack of very different workflows.
Try local text-to-speech on your Mac.
Murmur helps Mac creators generate private narration, test voices, revise scripts, and export audio locally for $49 once.
macOS 14+ · Apple Silicon required · 7-day refund policy