Skip to content

Instantly share code, notes, and snippets.

@JayPanoz
Last active August 28, 2025 15:06
Show Gist options
  • Select an option

  • Save JayPanoz/24dbad1bbf284124d43394ef66b2cbb5 to your computer and use it in GitHub Desktop.

Select an option

Save JayPanoz/24dbad1bbf284124d43394ef66b2cbb5 to your computer and use it in GitHub Desktop.
Readium – Playback API
interface Voice {
id: string; // Unique identifier
name: string; // Display name
language: string; // BCP 47 language tag
}
interface SpeechContent {
text: string; // Text or SSML content
ssml?: boolean; // If true, text contains SSML
}
type PlaybackState = "playing" | "paused" | "idle" | "loading" | "ready";
type PlaybackEvent =
| "start"
| "pause"
| "resume"
| "end"
| "error"
| "boundary"
| "mark"
| "idle"
| "loading"
| "ready";
interface PlaybackEngine {
// Queue Management
loadUtterances(contents: SpeechContent[]): void;
// Voice config
setRate(rate: number): void;
setPitch(pitch: number): void;
setVolume(volume: number): void;
speak(utteranceIndex?: number): void;
// State
getState(): PlaybackState;
// Events
on(
event: PlaybackEvent,
callback: (...args: any[]) => void
): () => void;
// Cleanup
destroy(): Promise<void>;
}
interface EngineProvider {
// Voice Management
getVoices(): Promise<Voice[]>;
// Engine Creation
createEngine(voice: Voice): Promise<PlaybackEngine>;
// Lifecycle
destroy(): Promise<void>;
}
interface PlaybackNavigator {
// TBD
// Playback controls + events: play(idx?), pause, resume, stop
// navigation + events: next, previous, jumpTo(idx)
// readiumSpeech-prefixed events?
}

Playback API - Requirements

This document outlines the requirements for the Playback API, which provides a unified interface for controlling text-to-speech (TTS) playback.

1.1 Playback Control

  • Start, pause, resume, and stop
  • Handle both individual and batched text/SSML input
  • Report current playback state (playing, paused, stopped)

1.2 Text Processing

  • Accept plain text and SSML input
  • Support multiple utterances

1.3 Event System

  • Emit events for state changes
  • Provide word/sentence boundary information
  • Report errors and warnings

1.4 Voice Management

  • Select from available voices
  • Configure voice parameters (rate, pitch, volume)

Design

[WIP]

A PlaybackEngineProvider allows you to get available voices and create instances of the PlaybackEngine using one specific voice, language, etc.

This PlaybackEngine is using a voice, its parameters can be set, is loaded with utterances, can preload with context, and allows you to speak an utterance index.

A PlaybackNavigator then handles navigation, continuous play, etc.

@JayPanoz
Copy link
Author

I think Mickael's point makes sense for any caller, not only a navigator.

To clarify, I am not disagreeing with this. Sorry if that came across as a disagreement, it was just an additional detail I forgot when listing requirements. Actually we discussed about that yesterday and should be indeed removed – my bad that was added out of TS habits.

I also notice a lack of information about the preloading state.

That’s a good point.

I can't see the point of the stop state and methods. Is it supposed to free resources?

That’s another good point, do we want to have something that acts as cancel?

Re Voice configuration, it is not really thought out at the moment, it’s kinda here as a reminder this exists and should be handled. Sorry if that was unclear. I am actually working on this at the moment so any idea and feedback is highly appreciated. Thanks!

@JayPanoz
Copy link
Author

JayPanoz commented Aug 26, 2025

Could we perhaps start with a review/update of the list of requirements?

It’s been sounding like we are trying to implement two different approaches (preloading multiple utterances and playing one vs miniplayer) at the same time with the current one, and it makes it difficult to come up with something in terms of types and interfaces, and help manage platform idiosyncrasies. 🙏

@qnga
Copy link

qnga commented Aug 26, 2025

Sounds like a good idea. I suggest to even list the requirements for all TTS-related features in Readium so that we could properly split them into components with different responsibilities.

@qnga
Copy link

qnga commented Aug 28, 2025

After a talk with Hadrien, I understand a bit better what the issue was. Here is my proposal:

  • A non-visual ReadAloudNavigator using audio files (media overlays) and TTS to read content aloud, based on guided navigation documents. 
  • A component turning the Publication HTML files into guided navigation documents.
  • A component turning the Publication SMIL files (probably along with HTML files) into guided navigation documents following the original structure.
  • A component turning any piece of HTML into a guided navigation document for external use of the ReadAloudNavigator.
  • A new navigator putting together the Epub/Web navigators and this ReadAloudNavigator for an integrated experience.

If we agree on this, the Playback API turns again into what I thought it was: a technical API adapting TTS engines/services for use inside the ReadAloudNavigator. Then, we can freely use any of the discussed approaches without non-technical considerations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment