This document outlines the requirements for the Playback API, which provides a unified interface for controlling text-to-speech (TTS) playback.
- Start, pause, resume, and stop
- Handle both individual and batched text/SSML input
- Report current playback state (playing, paused, stopped)
- Accept plain text and SSML input
- Support multiple utterances
- Emit events for state changes
- Provide word/sentence boundary information
- Report errors and warnings
- Select from available voices
- Configure voice parameters (rate, pitch, volume)
[WIP]
A PlaybackEngineProvider allows you to get available voices and create instances of the PlaybackEngine using one specific voice, language, etc.
This PlaybackEngine is using a voice, its parameters can be set, is loaded with utterances, can preload with context, and allows you to speak an utterance index.
A PlaybackNavigator then handles navigation, continuous play, etc.
I think Mickael's point makes sense for any caller, not only a navigator. The API caller is responsible for loading the utterances and starting playing each one, so it's aware of the number of utterances and of the current one. And, as Mickaël said, it probably should be the source of truth.
I also notice a lack of information about the preloading state. We need to report to the caller when the engine is ready to play the utterances because it should report when the playback is starving (we'll try to limit this but we can't prevent it entirely). The state could be
playing,paused,ready, 'loading,idle. I can't see the point of thestop` state and methods. Is it supposed to free resources?I suggest again, as I did during the call, to separate voice selection from the playback engine. In the mobile style, we'd probably have a
PlaybackEngineProviderproviding a list of the available voices along with the ability to create aPlaybackEnginefor a specific voice. I can see two advantages: