Open Weight Text-to-Speach with Voxtral TTS

Image by Editor

# Introduction

Voice-enabled applications are everywhere, from virtual assistants to customer service chatbots. But for developers, building natural-sounding speech into apps has often meant relying on expensive cloud APIs or dealing with robotic, unnatural voices.

Mistral AI aims to change that with Voxtral TTS. It is a powerful, open-weight text-to-speech (TTS) model that you can run on your own hardware. Released on March 26, 2026, this 4-billion-parameter model generates human-like speech in nine languages and adapts to a new voice from as little as three seconds of reference audio.

In this Voxtral TTS tutorial, you will learn how the model works, what makes its voice cloning and low-latency performance special, and how to start generating speech with just a few lines of Python code.

# What Is Voxtral TTS?

Voxtral TTS is Mistral AI's first TTS model. Unlike many commercial offerings that lock you into cloud APIs, Voxtral TTS is released with open weights. You can download the model and run it entirely on your own infrastructure. This gives you full control over your data, costs, and customization.

The model is built on Mistral's existing Ministral 3B architecture, making it small enough to run on consumer hardware, including laptops and edge devices. According to Mistral, Voxtral TTS delivers "frontier-quality" performance that matches or exceeds leading proprietary systems in human listening tests.

// Open Weight vs. Open Source

It is important to understand that "open weight" is not the same as fully open source. Voxtral TTS gives you access to the trained model weights, which you can use for research and personal projects under a CC BY-NC 4.0 license. However, commercial use requires a separate licensing agreement or using Mistral's paid API.

// Key Features

Voxtral TTS offers a powerful set of features designed for real-world voice applications:

It can clone a new voice from just 3 seconds of reference audio.
Delivers low latency with 70ms model latency and approximately 100ms time-to-first-audio.
Achieves a real-time factor (RTF) of 9.7x, which means it generates 10 seconds of speech in about 1.6 seconds.
Supports 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.
Has 4 billion parameters.
Provides open weights under CC BY-NC 4.0 for non-commercial use, with an API option for commercial projects, and includes native support for low-latency streaming inference.

# Cloning a Voice from Three Seconds of Audio

One of Voxtral TTS's most impressive capabilities is zero-shot voice cloning. Traditional voice cloning systems often need 30 seconds or more of reference audio to capture a person's voice. Voxtral TTS works with as little as 3 seconds.

When you provide a short voice prompt, the model analyses the speaker's unique characteristics — like accent, intonation, rhythm, and even emotional tone — and can then generate new speech in that same voice. This works across all nine supported languages, meaning you can create a multilingual voice clone that speaks English, French, or Hindi while preserving the original voice identity.

// How Voxtral TTS Compares to ElevenLabs

In blind human evaluations conducted by native speakers across all nine languages, Voxtral TTS achieved a 68.4% win rate over ElevenLabs Flash v2.5. The model performed exceptionally well in:

Language	Win Rate vs. ElevenLabs Flash v2.5
Spanish	87.8%
Hindi	79.8%
Portuguese	74.4%
Arabic	72.9%
German	72.0%
English	60.8%
Italian	57.1%
French	54.4%
Dutch	49.4%

Source: Hugging Face community blog: Voxtral TTS vs. ElevenLabs

# Latency Performance: Built for Real-Time Conversations

For voice agents and interactive applications, speed matters. A delay of even a few hundred milliseconds can make a conversation feel awkward or broken.

Voxtral TTS is designed specifically for low-latency streaming inference. According to Mistral's official documentation, the model achieves:

70ms model latency for a typical input of 10 seconds of voice sample and 500 characters of text.
~100ms time-to-first-audio (TTFA) — the time from when you send the text to when you hear the first sound.
An RTF of 9.7x — meaning it can generate nearly ten times faster than real time.

To put that in perspective: a 10-second audio clip can be generated in just over 1 second. This makes Voxtral TTS suitable for real-time applications like:

Conversational AI agents
Live customer support systems
Real-time translation tools
Voice-enabled IoT devices

The model can natively generate up to two minutes of continuous audio without breaking.

// Understanding Real-Time Factor

RTF measures how quickly a model generates audio compared to the actual duration of that audio. An RTF of 1.0 means generation takes the same time as the audio length. An RTF of 9.7 means generation is 9.7 times faster — a 10-second clip takes only about 1.03 seconds to produce.

# How Voxtral TTS Works

Without going too deep into the mathematics, here is a high-level overview of the model's architecture.

Voxtral TTS uses a hybrid approach that combines two techniques:

Semantic token generation. The model first generates "semantic tokens" that represent the meaning and structure of what needs to be spoken. This is similar to how a language model generates text tokens.
Flow matching for acoustic tokens. These semantic tokens are then converted into acoustic tokens that represent the actual sound waves of speech.

Both types of tokens are encoded and decoded using the Voxtral Codec, a custom speech tokenizer trained from scratch with a hybrid vector quantization — finite scalar quantization (VQ-FSQ) scheme.

This two-stage process allows the model to separate what to say (content) from how to say it (voice style, emotion, accent). That is why the model can clone a voice from a short sample; it learns the "how" from the reference audio and applies it to any text.

For a deeper technical dive, see the full Voxtral TTS paper on arXiv.

# Getting Started: Installation and Setup

You can use Voxtral TTS in two ways:

Via Mistral's API — easiest for quick testing and commercial use.
Self-hosted with open weights — full control, free for non-commercial use.

Prerequisites:

Basic familiarity with Python and the command line.
Python 3.10 or higher.
The pip package manager.
For self-hosting: an NVIDIA GPU (8GB+ VRAM recommended) or Apple Silicon Mac.

// Option 1: Using the Mistral API

Mistral offers a simple Python SDK. First, install the Mistral AI client:

Then, generate speech with just a few lines:

from mistralai import Mistral

api_key = "your-api-key"  # Get from console.mistral.ai
client = Mistral(api_key=api_key)

response = client.audio.speech.create(
    model="voxtral-tts-26-03",
    input="Hello, world! This is a test of Voxtral TTS.",
    voice="alloy",  # or a custom voice prompt
)

# Save the audio to a file
with open("output.wav", "wb") as f:
    f.write(response.audio)

The API costs $0.016 per 1,000 characters. You can also test the model for free in Mistral Studio.

// Option 2: Self-Hosting with Open Weights

For self-hosting, you can download the model weights from Hugging Face. The model is released under a CC BY-NC 4.0 license. A popular community-developed option is to use int4 quantization for efficient inference. The voxtral-int4 implementation achieves:

4.6x real-time speech generation.
3.7GB VRAM usage on an RTX 3090.
54% VRAM reduction compared to full precision.

# Voice Cloning with a Custom Voice: A Practical Example

One of the most powerful features is adapting the model to any voice. Here is a complete example using the Mistral API:

from mistralai import Mistral

api_key = "your-api-key"
client = Mistral(api_key=api_key)

# Step 1: Load or record a reference audio file (3+ seconds)
reference_audio_path = "my_voice_sample.wav"

# Step 2: Open the audio file for upload
with open(reference_audio_path, "rb") as f:
    audio_content = f.read()

# Step 3: Generate speech using the cloned voice
response = client.audio.speech.create(
    model="voxtral-tts-26-03",
    input="This is my voice, cloned from just a few seconds of audio.",
    voice=audio_content,  # Pass the reference audio directly
)

# Save the generated speech
with open("cloned_voice_output.wav", "wb") as f:
    f.write(response.audio)

The reference audio should be clear, without background noise, and at least 3 seconds long. The longer the sample (up to about 25 seconds), the better the voice quality.

# Use Cases

Here are practical scenarios where Voxtral TTS excels:

Voice Assistants and Chatbots. The low latency (~100ms TTFA) means conversations feel natural and responsive. Unlike cloud-based APIs that add network costs, self-hosted Voxtral TTS can keep everything on your own servers.
Multilingual Customer Support. With support for nine major languages and cross-language voice cloning, a single model can serve global customers. For example, you can generate English speech with a French accent based on a short reference prompt.
Content Localization. Translate and dub videos, podcasts, or e-learning content into multiple languages while preserving the original speaker's voice identity across languages.
Accessibility Tools. Build screen readers and assistive technologies with natural, expressive voices that users can customize to their preferred voice.
Gaming and Interactive Media. Generate dynamic character dialogue in real time, adapting to player choices without pre-recording every line.

# Licensing and Deployment Considerations

// Open Weights (CC BY-NC 4.0)

Permitted: research, personal projects, academic use, internal testing.
Not permitted: commercial products, services that generate revenue, redistribution for commercial purposes.
Requires attribution to Mistral AI.

// Commercial Use

For commercial applications, you have two options:

Use Mistral's API — pay-as-you-go at $0.016 per 1,000 characters.
Negotiate a commercial license — contact Mistral for enterprise licensing.

If you need unlimited scaling without per-request costs, self-hosting with a commercial license is the most cost-effective path for high-volume use cases. For low to medium volume, the API is simpler.

# Conclusion

Voxtral TTS brings enterprise-grade, open-weight text-to-speech within reach of any developer. With just 3 seconds of audio for voice cloning, 70ms latency, and a 9.7x real-time factor, it is built for the real-time, conversational applications that users expect today.

Whether you choose the simplicity of Mistral's API or the full control of self-hosted deployment, Voxtral TTS gives you a powerful foundation for adding natural, expressive speech to your projects.

Next steps:

Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.