Back to blog
Tutorial

Add Text-to-Speech to Your Flutter App in 15 Minutes

A step-by-step guide to adding high-quality, on-device TTS to a Flutter app using Xybrid and the Kokoro model. No cloud APIs, no API keys, no per-request costs.

Glenn Sonna
8 min read
flutterttstutorialon-device-ai

Cloud TTS APIs are convenient — until you see the bill. Or deal with the latency. Or explain to your users why their voice assistant needs an internet connection.

Let’s add on-device text-to-speech to a Flutter app. The model runs locally, works offline, and costs nothing per request.

We’ll use Xybrid with the Kokoro TTS model (82M parameters, multiple voices, high quality output).

What You’ll Build

A Flutter app that:

  • Converts any text to natural-sounding speech
  • Runs entirely on-device (no network after initial model download)
  • Supports multiple voices
  • Works on iOS, Android, and macOS

Prerequisites

  • Flutter 3.x installed
  • Rust toolchain (rustup installed)
  • ~10 minutes

Step 1: Add the Dependency

Add xybrid_flutter to your pubspec.yaml:

dependencies:
  xybrid_flutter: ^0.1.0
flutter pub get

Step 2: Initialize the SDK

In your app’s startup (e.g., main.dart):

import 'package:xybrid_flutter/xybrid_flutter.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await Xybrid.init();
  runApp(const MyApp());
}

Android only: Call initSdkCacheDir() with a path from path_provider before loading models. This tells Xybrid where to cache downloads.

// Android setup (add to your init flow)
import 'package:path_provider/path_provider.dart';

final dir = await getApplicationSupportDirectory();
await initSdkCacheDir(cacheDir: dir.path);

Step 3: Load the Model

final model = await Xybrid.model(modelId: 'kokoro-82m').load();

The first call downloads the model (~80MB) and caches it locally. Subsequent calls load from cache instantly.

Step 4: Run TTS

final result = await model.run(
  envelope: Envelope.text(text: "Hello! This is running entirely on your device."),
);

That’s it. result contains the audio as WAV bytes.

Step 5: Play the Audio

Use any audio player package. Here’s a simple approach with audioplayers:

# pubspec.yaml
dependencies:
  audioplayers: ^6.0.0
import 'package:audioplayers/audioplayers.dart';

final player = AudioPlayer();

Future<void> speak(Uint8List audioBytes) async {
  await player.play(BytesSource(audioBytes));
}

Call it with the result:

final audioBytes = result.audioBytes();
if (audioBytes != null) {
  await speak(audioBytes);
}

Putting It Together

Here’s a complete, minimal TTS screen:

import 'package:flutter/material.dart';
import 'package:xybrid_flutter/xybrid_flutter.dart';
import 'package:audioplayers/audioplayers.dart';

class TTSScreen extends StatefulWidget {
  const TTSScreen({super.key});

  @override
  State<TTSScreen> createState() => _TTSScreenState();
}

class _TTSScreenState extends State<TTSScreen> {
  final _controller = TextEditingController();
  final _player = AudioPlayer();

  XybridModel? _model;
  bool _loading = true;
  bool _speaking = false;

  @override
  void initState() {
    super.initState();
    _loadModel();
  }

  Future<void> _loadModel() async {
    final model = await Xybrid.model(modelId: 'kokoro-82m').load();
    setState(() {
      _model = model;
      _loading = false;
    });
  }

  Future<void> _speak() async {
    final text = _controller.text.trim();
    if (text.isEmpty || _model == null) return;

    setState(() => _speaking = true);

    try {
      final result = await _model!.run(
        envelope: Envelope.text(text: text),
      );

      final audioBytes = result.audioBytes();
      if (audioBytes != null) {
        await _player.play(BytesSource(audioBytes));
      }
    } finally {
      setState(() => _speaking = false);
    }
  }

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      appBar: AppBar(title: const Text('On-Device TTS')),
      body: Padding(
        padding: const EdgeInsets.all(16),
        child: Column(
          children: [
            TextField(
              controller: _controller,
              maxLines: 4,
              decoration: const InputDecoration(
                hintText: 'Enter text to speak...',
                border: OutlineInputBorder(),
              ),
            ),
            const SizedBox(height: 16),
            SizedBox(
              width: double.infinity,
              child: ElevatedButton.icon(
                onPressed: _loading || _speaking ? null : _speak,
                icon: _speaking
                    ? const SizedBox(
                        width: 16,
                        height: 16,
                        child: CircularProgressIndicator(strokeWidth: 2),
                      )
                    : const Icon(Icons.volume_up),
                label: Text(
                  _loading
                      ? 'Loading model...'
                      : _speaking
                          ? 'Speaking...'
                          : 'Speak',
                ),
              ),
            ),
          ],
        ),
      ),
    );
  }

  @override
  void dispose() {
    _controller.dispose();
    _player.dispose();
    super.dispose();
  }
}

Bonus: Choose a Voice

Kokoro supports multiple voices. List them and let the user pick:

// Get available voices
final voices = await _model!.voices();

// Run with a specific voice
final result = await _model!.run(
  envelope: Envelope.text(text: text),
  voiceId: voices[selectedIndex].id,
);

Build a dropdown:

DropdownButton<int>(
  value: _selectedVoice,
  items: voices.asMap().entries.map((entry) {
    return DropdownMenuItem(
      value: entry.key,
      child: Text(entry.value.name),
    );
  }).toList(),
  onChanged: (index) => setState(() => _selectedVoice = index!),
),

Bonus: Warmup for Instant Response

The first inference is slower (model weights load into memory). Use warmup() to preload:

final model = await Xybrid.model(modelId: 'kokoro-82m').load();
await model.warmup(); // Pre-loads weights, compiles shaders

// Now the first speak() call is as fast as subsequent ones

Call this during your loading screen or splash screen.

Performance

Measured on real devices:

DeviceFirst InferenceSubsequentNotes
iPhone 15 Pro~800ms~200msCoreML acceleration
Pixel 8~1.2s~400msCPU inference
MacBook Pro M2~300ms~100msMetal acceleration

These are for a short sentence (~10 words). Longer text scales linearly.

Why Not Just Use a Cloud API?

Cloud TTSOn-Device (Xybrid)
Latency200-500ms network + processing100-400ms total
PrivacyText sent to third partyNever leaves device
Cost$4-16 per 1M charactersFree forever
OfflineNoYes
API keyRequiredNot needed

For apps where privacy matters (journaling, health, accessibility) or where you need offline support — on-device wins.

Next Steps

  • Add ASR: Use whisper-tiny for speech-to-text with XybridStreamer for real-time transcription
  • Chain models: Build a pipeline that listens, processes, and speaks back
  • Explore voices: Kokoro has dozens of voice options across languages

Check out the Xybrid Flutter example app for a full-featured demo with 8 screens.


Xybrid is open-source: github.com/xybrid-ai/xybrid

Share