This document outlines the requirements for the Playback API, which provides a unified interface for controlling text-to-speech (TTS) playback.
- Start, pause, resume, and stop
- Handle both individual and batched text/SSML input
- Report current playback state (playing, paused, stopped)
- Accept plain text and SSML input
- Support multiple utterances
- Emit events for state changes
- Provide word/sentence boundary information
- Report errors and warnings
- Select from available voices
- Configure voice parameters (rate, pitch, volume)
[WIP]
A PlaybackEngineProvider allows you to get available voices and create instances of the PlaybackEngine using one specific voice, language, etc.
This PlaybackEngine is using a voice, its parameters can be set, is loaded with utterances, can preload with context, and allows you to speak an utterance index.
A PlaybackNavigator then handles navigation, continuous play, etc.
After a talk with Hadrien, I understand a bit better what the issue was. Here is my proposal:
ReadAloudNavigatorusing audio files (media overlays) and TTS to read content aloud, based on guided navigation documents.PublicationHTML files into guided navigation documents.PublicationSMIL files (probably along with HTML files) into guided navigation documents following the original structure.ReadAloudNavigator.ReadAloudNavigatorfor an integrated experience.If we agree on this, the Playback API turns again into what I thought it was: a technical API adapting TTS engines/services for use inside the
ReadAloudNavigator. Then, we can freely use any of the discussed approaches without non-technical considerations.