Skip to content

Instantly share code, notes, and snippets.

@JayPanoz
Last active August 28, 2025 15:06
Show Gist options
  • Select an option

  • Save JayPanoz/24dbad1bbf284124d43394ef66b2cbb5 to your computer and use it in GitHub Desktop.

Select an option

Save JayPanoz/24dbad1bbf284124d43394ef66b2cbb5 to your computer and use it in GitHub Desktop.
Readium – Playback API
interface Voice {
id: string; // Unique identifier
name: string; // Display name
language: string; // BCP 47 language tag
}
interface SpeechContent {
text: string; // Text or SSML content
ssml?: boolean; // If true, text contains SSML
}
type PlaybackState = "playing" | "paused" | "idle" | "loading" | "ready";
type PlaybackEvent =
| "start"
| "pause"
| "resume"
| "end"
| "error"
| "boundary"
| "mark"
| "idle"
| "loading"
| "ready";
interface PlaybackEngine {
// Queue Management
loadUtterances(contents: SpeechContent[]): void;
// Voice config
setRate(rate: number): void;
setPitch(pitch: number): void;
setVolume(volume: number): void;
speak(utteranceIndex?: number): void;
// State
getState(): PlaybackState;
// Events
on(
event: PlaybackEvent,
callback: (...args: any[]) => void
): () => void;
// Cleanup
destroy(): Promise<void>;
}
interface EngineProvider {
// Voice Management
getVoices(): Promise<Voice[]>;
// Engine Creation
createEngine(voice: Voice): Promise<PlaybackEngine>;
// Lifecycle
destroy(): Promise<void>;
}
interface PlaybackNavigator {
// TBD
// Playback controls + events: play(idx?), pause, resume, stop
// navigation + events: next, previous, jumpTo(idx)
// readiumSpeech-prefixed events?
}

Playback API - Requirements

This document outlines the requirements for the Playback API, which provides a unified interface for controlling text-to-speech (TTS) playback.

1.1 Playback Control

  • Start, pause, resume, and stop
  • Handle both individual and batched text/SSML input
  • Report current playback state (playing, paused, stopped)

1.2 Text Processing

  • Accept plain text and SSML input
  • Support multiple utterances

1.3 Event System

  • Emit events for state changes
  • Provide word/sentence boundary information
  • Report errors and warnings

1.4 Voice Management

  • Select from available voices
  • Configure voice parameters (rate, pitch, volume)

Design

[WIP]

A PlaybackEngineProvider allows you to get available voices and create instances of the PlaybackEngine using one specific voice, language, etc.

This PlaybackEngine is using a voice, its parameters can be set, is loaded with utterances, can preload with context, and allows you to speak an utterance index.

A PlaybackNavigator then handles navigation, continuous play, etc.

@mickael-menu
Copy link

  getCurrentUtteranceIndex(): number;
  getUtteranceCount(): number;

I'm not sure these APIs are useful for the caller. The list of utterances is useful for preloading/context purposes but in my opinion the source of truth for the currently played utterance should be in the navigator.

  getVoices(): Promise<SpeechSynthesisVoice[]>;
  setVoice(voiceURI: string): void;
  setRate(rate: number): void;
  setPitch(pitch: number): void;
  setVolume(volume: number): void;

On Mobile I think we'll use a Preferences API for these settings, so they can be exposed by the navigator.

@JayPanoz
Copy link
Author

Note as a “hidden” requirement that it has to work as a standalone module for web consumers as well, who will not rely on a Navigator and Preferences API.

In Readium Speech we will probably use an init where you pass your engines, and you can getVoices() after that. But pitch, rate, volume, etc. we discussed this morning they should go into the loadUtterances somewhat as some TTS engines require this to be set for each utterance otherwise they do not work well.

@qnga
Copy link

qnga commented Aug 26, 2025

Note as a “hidden” requirement that it has to work as a standalone module for web consumers as well, who will not rely on a Navigator and Preferences API.

I think Mickael's point makes sense for any caller, not only a navigator. The API caller is responsible for loading the utterances and starting playing each one, so it's aware of the number of utterances and of the current one. And, as Mickaël said, it probably should be the source of truth.

I also notice a lack of information about the preloading state. We need to report to the caller when the engine is ready to play the utterances because it should report when the playback is starving (we'll try to limit this but we can't prevent it entirely). The state could be playing, paused, ready, 'loading, idle. I can't see the point of the stop` state and methods. Is it supposed to free resources?

I suggest again, as I did during the call, to separate voice selection from the playback engine. In the mobile style, we'd probably have a
PlaybackEngineProvider providing a list of the available voices along with the ability to create a PlaybackEngine for a specific voice. I can see two advantages:

  • A bit less of mutability in the engine implementation and a bit more of simplicity in the contract. Online engines would have to start the work from scratch again if the voice changes.
  • We may need several instances of the same playback engine with different voices for preloading purposes.

@JayPanoz
Copy link
Author

I think Mickael's point makes sense for any caller, not only a navigator.

To clarify, I am not disagreeing with this. Sorry if that came across as a disagreement, it was just an additional detail I forgot when listing requirements. Actually we discussed about that yesterday and should be indeed removed – my bad that was added out of TS habits.

I also notice a lack of information about the preloading state.

That’s a good point.

I can't see the point of the stop state and methods. Is it supposed to free resources?

That’s another good point, do we want to have something that acts as cancel?

Re Voice configuration, it is not really thought out at the moment, it’s kinda here as a reminder this exists and should be handled. Sorry if that was unclear. I am actually working on this at the moment so any idea and feedback is highly appreciated. Thanks!

@JayPanoz
Copy link
Author

JayPanoz commented Aug 26, 2025

Could we perhaps start with a review/update of the list of requirements?

It’s been sounding like we are trying to implement two different approaches (preloading multiple utterances and playing one vs miniplayer) at the same time with the current one, and it makes it difficult to come up with something in terms of types and interfaces, and help manage platform idiosyncrasies. 🙏

@qnga
Copy link

qnga commented Aug 26, 2025

Sounds like a good idea. I suggest to even list the requirements for all TTS-related features in Readium so that we could properly split them into components with different responsibilities.

@qnga
Copy link

qnga commented Aug 28, 2025

After a talk with Hadrien, I understand a bit better what the issue was. Here is my proposal:

  • A non-visual ReadAloudNavigator using audio files (media overlays) and TTS to read content aloud, based on guided navigation documents. 
  • A component turning the Publication HTML files into guided navigation documents.
  • A component turning the Publication SMIL files (probably along with HTML files) into guided navigation documents following the original structure.
  • A component turning any piece of HTML into a guided navigation document for external use of the ReadAloudNavigator.
  • A new navigator putting together the Epub/Web navigators and this ReadAloudNavigator for an integrated experience.

If we agree on this, the Playback API turns again into what I thought it was: a technical API adapting TTS engines/services for use inside the ReadAloudNavigator. Then, we can freely use any of the discussed approaches without non-technical considerations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment