Add Text-to-Speech to Your Flutter App in 15 Minutes

Cloud TTS APIs are convenient — until you see the bill. Or deal with the latency. Or explain to your users why their voice assistant needs an internet connection.

Let’s add on-device text-to-speech to a Flutter app. The model runs locally, works offline, and costs nothing per request.

We’ll use Xybrid with the Kokoro TTS model (82M parameters, multiple voices, high quality output).

What You’ll Build

A Flutter app that:

Converts any text to natural-sounding speech
Runs entirely on-device (no network after initial model download)
Supports multiple voices
Works on iOS, Android, and macOS

Prerequisites

Flutter 3.x installed
Rust toolchain (rustup installed)
~10 minutes

Step 1: Add the Dependency

Add xybrid_flutter to your pubspec.yaml:

dependencies:
  xybrid_flutter: ^0.1.0

flutter pub get

Step 2: Initialize the SDK

In your app’s startup (e.g., main.dart):

import 'package:xybrid_flutter/xybrid_flutter.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await Xybrid.init();
  runApp(const MyApp());
}

Android only: Call initSdkCacheDir() with a path from path_provider before loading models. This tells Xybrid where to cache downloads.

// Android setup (add to your init flow)
import 'package:path_provider/path_provider.dart';

final dir = await getApplicationSupportDirectory();
await initSdkCacheDir(cacheDir: dir.path);

Step 3: Load the Model

final model = await Xybrid.model(modelId: 'kokoro-82m').load();

The first call downloads the model (~80MB) and caches it locally. Subsequent calls load from cache instantly.

Step 4: Run TTS

final result = await model.run(
  envelope: Envelope.text(text: "Hello! This is running entirely on your device."),
);

That’s it. result contains the audio as WAV bytes.

Step 5: Play the Audio

Use any audio player package. Here’s a simple approach with audioplayers:

# pubspec.yaml
dependencies:
  audioplayers: ^6.0.0

import 'package:audioplayers/audioplayers.dart';

final player = AudioPlayer();

Future<void> speak(Uint8List audioBytes) async {
  await player.play(BytesSource(audioBytes));
}

Call it with the result:

final audioBytes = result.audioBytes();
if (audioBytes != null) {
  await speak(audioBytes);
}

Putting It Together

Here’s a complete, minimal TTS screen:

import 'package:flutter/material.dart';
import 'package:xybrid_flutter/xybrid_flutter.dart';
import 'package:audioplayers/audioplayers.dart';

class TTSScreen extends StatefulWidget {
  const TTSScreen({super.key});

  @override
  State<TTSScreen> createState() => _TTSScreenState();
}

class _TTSScreenState extends State<TTSScreen> {
  final _controller = TextEditingController();
  final _player = AudioPlayer();

  XybridModel? _model;
  bool _loading = true;
  bool _speaking = false;

  @override
  void initState() {
    super.initState();
    _loadModel();
  }

  Future<void> _loadModel() async {
    final model = await Xybrid.model(modelId: 'kokoro-82m').load();
    setState(() {
      _model = model;
      _loading = false;
    });
  }

  Future<void> _speak() async {
    final text = _controller.text.trim();
    if (text.isEmpty || _model == null) return;

    setState(() => _speaking = true);

    try {
      final result = await _model!.run(
        envelope: Envelope.text(text: text),
      );

      final audioBytes = result.audioBytes();
      if (audioBytes != null) {
        await _player.play(BytesSource(audioBytes));
      }
    } finally {
      setState(() => _speaking = false);
    }
  }

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      appBar: AppBar(title: const Text('On-Device TTS')),
      body: Padding(
        padding: const EdgeInsets.all(16),
        child: Column(
          children: [
            TextField(
              controller: _controller,
              maxLines: 4,
              decoration: const InputDecoration(
                hintText: 'Enter text to speak...',
                border: OutlineInputBorder(),
              ),
            ),
            const SizedBox(height: 16),
            SizedBox(
              width: double.infinity,
              child: ElevatedButton.icon(
                onPressed: _loading || _speaking ? null : _speak,
                icon: _speaking
                    ? const SizedBox(
                        width: 16,
                        height: 16,
                        child: CircularProgressIndicator(strokeWidth: 2),
                      )
                    : const Icon(Icons.volume_up),
                label: Text(
                  _loading
                      ? 'Loading model...'
                      : _speaking
                          ? 'Speaking...'
                          : 'Speak',
                ),
              ),
            ),
          ],
        ),
      ),
    );
  }

  @override
  void dispose() {
    _controller.dispose();
    _player.dispose();
    super.dispose();
  }
}

Bonus: Choose a Voice

Kokoro supports multiple voices. List them and let the user pick:

// Get available voices
final voices = await _model!.voices();

// Run with a specific voice
final result = await _model!.run(
  envelope: Envelope.text(text: text),
  voiceId: voices[selectedIndex].id,
);

Build a dropdown:

DropdownButton<int>(
  value: _selectedVoice,
  items: voices.asMap().entries.map((entry) {
    return DropdownMenuItem(
      value: entry.key,
      child: Text(entry.value.name),
    );
  }).toList(),
  onChanged: (index) => setState(() => _selectedVoice = index!),
),

Bonus: Warmup for Instant Response

The first inference is slower (model weights load into memory). Use warmup() to preload:

final model = await Xybrid.model(modelId: 'kokoro-82m').load();
await model.warmup(); // Pre-loads weights, compiles shaders

// Now the first speak() call is as fast as subsequent ones

Call this during your loading screen or splash screen.

Performance

Measured on real devices:

Device	First Inference	Subsequent	Notes
iPhone 15 Pro	~800ms	~200ms	CoreML acceleration
Pixel 8	~1.2s	~400ms	CPU inference
MacBook Pro M2	~300ms	~100ms	Metal acceleration

These are for a short sentence (~10 words). Longer text scales linearly.

Why Not Just Use a Cloud API?

	Cloud TTS	On-Device (Xybrid)
Latency	200-500ms network + processing	100-400ms total
Privacy	Text sent to third party	Never leaves device
Cost	$4-16 per 1M characters	Free forever
Offline	No	Yes
API key	Required	Not needed

For apps where privacy matters (journaling, health, accessibility) or where you need offline support — on-device wins.

Next Steps

Add ASR: Use whisper-tiny for speech-to-text with XybridStreamer for real-time transcription
Chain models: Build a pipeline that listens, processes, and speaks back
Explore voices: Kokoro has dozens of voice options across languages

Check out the Xybrid Flutter example app for a full-featured demo with 8 screens.

Xybrid is open-source: github.com/xybrid-ai/xybrid