Add Text-to-Speech to Your Flutter App in 15 Minutes
A step-by-step guide to adding high-quality, on-device TTS to a Flutter app using Xybrid and the Kokoro model. No cloud APIs, no API keys, no per-request costs.
Cloud TTS APIs are convenient — until you see the bill. Or deal with the latency. Or explain to your users why their voice assistant needs an internet connection.
Let’s add on-device text-to-speech to a Flutter app. The model runs locally, works offline, and costs nothing per request.
We’ll use Xybrid with the Kokoro TTS model (82M parameters, multiple voices, high quality output).
What You’ll Build
A Flutter app that:
- Converts any text to natural-sounding speech
- Runs entirely on-device (no network after initial model download)
- Supports multiple voices
- Works on iOS, Android, and macOS
Prerequisites
- Flutter 3.x installed
- Rust toolchain (
rustupinstalled) - ~10 minutes
Step 1: Add the Dependency
Add xybrid_flutter to your pubspec.yaml:
dependencies:
xybrid_flutter: ^0.1.0 flutter pub get Step 2: Initialize the SDK
In your app’s startup (e.g., main.dart):
import 'package:xybrid_flutter/xybrid_flutter.dart';
void main() async {
WidgetsFlutterBinding.ensureInitialized();
await Xybrid.init();
runApp(const MyApp());
} Android only: Call
initSdkCacheDir()with a path frompath_providerbefore loading models. This tells Xybrid where to cache downloads.
// Android setup (add to your init flow)
import 'package:path_provider/path_provider.dart';
final dir = await getApplicationSupportDirectory();
await initSdkCacheDir(cacheDir: dir.path); Step 3: Load the Model
final model = await Xybrid.model(modelId: 'kokoro-82m').load(); The first call downloads the model (~80MB) and caches it locally. Subsequent calls load from cache instantly.
Step 4: Run TTS
final result = await model.run(
envelope: Envelope.text(text: "Hello! This is running entirely on your device."),
); That’s it. result contains the audio as WAV bytes.
Step 5: Play the Audio
Use any audio player package. Here’s a simple approach with audioplayers:
# pubspec.yaml
dependencies:
audioplayers: ^6.0.0 import 'package:audioplayers/audioplayers.dart';
final player = AudioPlayer();
Future<void> speak(Uint8List audioBytes) async {
await player.play(BytesSource(audioBytes));
} Call it with the result:
final audioBytes = result.audioBytes();
if (audioBytes != null) {
await speak(audioBytes);
} Putting It Together
Here’s a complete, minimal TTS screen:
import 'package:flutter/material.dart';
import 'package:xybrid_flutter/xybrid_flutter.dart';
import 'package:audioplayers/audioplayers.dart';
class TTSScreen extends StatefulWidget {
const TTSScreen({super.key});
@override
State<TTSScreen> createState() => _TTSScreenState();
}
class _TTSScreenState extends State<TTSScreen> {
final _controller = TextEditingController();
final _player = AudioPlayer();
XybridModel? _model;
bool _loading = true;
bool _speaking = false;
@override
void initState() {
super.initState();
_loadModel();
}
Future<void> _loadModel() async {
final model = await Xybrid.model(modelId: 'kokoro-82m').load();
setState(() {
_model = model;
_loading = false;
});
}
Future<void> _speak() async {
final text = _controller.text.trim();
if (text.isEmpty || _model == null) return;
setState(() => _speaking = true);
try {
final result = await _model!.run(
envelope: Envelope.text(text: text),
);
final audioBytes = result.audioBytes();
if (audioBytes != null) {
await _player.play(BytesSource(audioBytes));
}
} finally {
setState(() => _speaking = false);
}
}
@override
Widget build(BuildContext context) {
return Scaffold(
appBar: AppBar(title: const Text('On-Device TTS')),
body: Padding(
padding: const EdgeInsets.all(16),
child: Column(
children: [
TextField(
controller: _controller,
maxLines: 4,
decoration: const InputDecoration(
hintText: 'Enter text to speak...',
border: OutlineInputBorder(),
),
),
const SizedBox(height: 16),
SizedBox(
width: double.infinity,
child: ElevatedButton.icon(
onPressed: _loading || _speaking ? null : _speak,
icon: _speaking
? const SizedBox(
width: 16,
height: 16,
child: CircularProgressIndicator(strokeWidth: 2),
)
: const Icon(Icons.volume_up),
label: Text(
_loading
? 'Loading model...'
: _speaking
? 'Speaking...'
: 'Speak',
),
),
),
],
),
),
);
}
@override
void dispose() {
_controller.dispose();
_player.dispose();
super.dispose();
}
} Bonus: Choose a Voice
Kokoro supports multiple voices. List them and let the user pick:
// Get available voices
final voices = await _model!.voices();
// Run with a specific voice
final result = await _model!.run(
envelope: Envelope.text(text: text),
voiceId: voices[selectedIndex].id,
); Build a dropdown:
DropdownButton<int>(
value: _selectedVoice,
items: voices.asMap().entries.map((entry) {
return DropdownMenuItem(
value: entry.key,
child: Text(entry.value.name),
);
}).toList(),
onChanged: (index) => setState(() => _selectedVoice = index!),
), Bonus: Warmup for Instant Response
The first inference is slower (model weights load into memory). Use warmup() to preload:
final model = await Xybrid.model(modelId: 'kokoro-82m').load();
await model.warmup(); // Pre-loads weights, compiles shaders
// Now the first speak() call is as fast as subsequent ones Call this during your loading screen or splash screen.
Performance
Measured on real devices:
| Device | First Inference | Subsequent | Notes |
|---|---|---|---|
| iPhone 15 Pro | ~800ms | ~200ms | CoreML acceleration |
| Pixel 8 | ~1.2s | ~400ms | CPU inference |
| MacBook Pro M2 | ~300ms | ~100ms | Metal acceleration |
These are for a short sentence (~10 words). Longer text scales linearly.
Why Not Just Use a Cloud API?
| Cloud TTS | On-Device (Xybrid) | |
|---|---|---|
| Latency | 200-500ms network + processing | 100-400ms total |
| Privacy | Text sent to third party | Never leaves device |
| Cost | $4-16 per 1M characters | Free forever |
| Offline | No | Yes |
| API key | Required | Not needed |
For apps where privacy matters (journaling, health, accessibility) or where you need offline support — on-device wins.
Next Steps
- Add ASR: Use
whisper-tinyfor speech-to-text withXybridStreamerfor real-time transcription - Chain models: Build a pipeline that listens, processes, and speaks back
- Explore voices: Kokoro has dozens of voice options across languages
Check out the Xybrid Flutter example app for a full-featured demo with 8 screens.
Xybrid is open-source: github.com/xybrid-ai/xybrid
Related articles
Building a Voice Agent That Runs Entirely On-Device
A step-by-step tutorial for building an on-device voice agent using Whisper, a local LLM, and Kokoro TTS — no cloud APIs, no internet required.
On-Device AI: The Complete Guide to Running ML Models Locally
Everything you need to know about running machine learning models directly on mobile and desktop devices — privacy, latency, cost benefits, and how to get started.
Edge AI vs Cloud AI: When to Run Models On-Device
A practical decision framework for choosing between on-device and cloud-based AI inference, with cost analysis, comparison tables, and real-world use cases.