Guide

The Best Free and Local TTS Models in 2026

Kokoro, Chatterbox, Fish Audio, Qwen3: a hands-on comparison of the open-source TTS models you can run on your Mac today.

·5 min read

Local TTS Has Caught Up

Two years ago, running a text-to-speech model on your laptop meant settling for robotic output that no one wanted to listen to. That era is over. In 2026, several open-source TTS models produce speech that sounds natural, expressive, and genuinely pleasant. The best part: they run entirely on your hardware, no cloud account required.

This guide covers four models that stand out from the field. Each one is available for local inference on Apple Silicon Macs, and each excels at something different. We tested all four through Murmur, which bundles them into a single app, but the models themselves are open-source and can be run independently.

Kokoro: The Lightweight Workhorse

Kokoro is an 82-million parameter model, making it one of the smallest capable TTS engines available. That compact size translates to fast generation times and low memory usage. On an M1 MacBook Air with 8GB of RAM, Kokoro generates a 1,500-word passage in roughly 30 to 45 seconds. On M2 Pro or M3 hardware, it is noticeably faster.

The voice quality is clean and consistent. Kokoro handles long-form narration well because it maintains steady pacing and tone across extended passages. It supports 9 languages and ships with over 860 community-contributed voices spanning different accents, genders, and age ranges. For creators who need reliable, everyday TTS, Kokoro is the default choice for good reason.

Where it falls short: Kokoro does not match the emotional range of larger models. Sarcasm, dramatic pauses, and subtle tonal shifts are not its strength. If your content is primarily informational (tutorials, blog posts, documentation), that barely matters. For fiction or dramatic narration, you will want one of the models below.

Qwen3 TTS: Multilingual and Versatile

Qwen3 TTS comes from Alibaba's Qwen team and brings strong multilingual capability to the local TTS landscape. It handles code-switching between languages more naturally than any other local model we tested. A passage that mixes English and Japanese, for example, flows without the jarring accent shifts you hear from single-language models forced into multilingual duty.

Qwen3 also supports voice design, meaning you can describe the voice characteristics you want (age, gender, speaking style) rather than picking from a preset library. The model is larger than Kokoro, so generation takes longer and memory usage is higher. Expect roughly 60 to 90 seconds for a 1,500-word passage on M-series hardware. The quality jump in expressiveness and multilingual handling often justifies that tradeoff.

Fish Audio S2 Pro: Studio-Grade Output

Fish Audio S2 Pro is the model you reach for when quality matters more than speed. It produces the most natural-sounding speech of any local model we tested. The prosody is remarkably human: pauses land where a real speaker would pause, emphasis feels earned rather than algorithmic, and the overall rhythm avoids the mechanical cadence that plagues lesser models.

The tradeoff is generation time. Fish Audio S2 Pro is significantly slower than Kokoro, often taking 2 to 3 minutes for a 1,500-word passage. It also requires more memory. If you are producing a polished audiobook or a client-facing video, the wait is worth it. For quick drafts or iterative work, Kokoro or Qwen3 are better choices.

Chatterbox: Emotional Range

Chatterbox Turbo specializes in expressive, emotionally varied speech. Where Kokoro maintains a steady, professional tone, Chatterbox shifts its delivery based on content. Dialogue sounds like dialogue. Questions sound like questions. Exclamations carry genuine energy. This makes it excellent for fiction, storytelling, and any content where emotional texture matters.

Chatterbox also handles voice cloning well. With a 10-second sample, it captures speaker characteristics and applies them to new text with reasonable fidelity. The model sits between Kokoro and Fish Audio in terms of generation speed, typically finishing a 1,500-word passage in about 60 seconds.

Model Comparison at a Glance

FeatureKokoroQwen3 TTSFish Audio S2 ProChatterbox Turbo
Model size82M params~500M params~600M params~300M params
Speed (1,500 words)30-45 sec60-90 sec2-3 min~60 sec
Languages915+86
Voice cloningYes (10s sample)Yes (voice design)Yes (10s sample)Yes (10s sample)
Best forEveryday narrationMultilingual contentStudio-quality audioExpressive/fiction
Emotional rangeModerateGoodExcellentExcellent
Memory usageLow (~2GB)Medium (~4GB)High (~6GB)Medium (~3GB)

Which Model Should You Use?

There is no single best model. The right choice depends on what you are creating. For blog posts, tutorials, and documentation, Kokoro's speed and consistency make it the practical default. For multilingual content or voice-designed characters, Qwen3. For polished final audio that needs to sound broadcast-ready, Fish Audio S2 Pro. For fiction and emotionally rich narration, Chatterbox.

Murmur bundles all four models into one macOS app, so you can switch between them without managing Python environments, downloading weights, or configuring inference pipelines. Pick the model that fits the job, generate, and export. One purchase, all models, unlimited use.

Frequently Asked Questions

All four models. One app. $49.

Kokoro, Qwen3, Fish Audio, and Chatterbox, bundled into a native Mac app. No subscriptions, no cloud, no usage limits. Generate unlimited speech on your own hardware.

macOS 14+ · Apple Silicon required · 7-day refund policy