Skip to content

Instantly share code, notes, and snippets.

@JayPanoz
Last active August 28, 2025 15:06
Show Gist options
  • Select an option

  • Save JayPanoz/24dbad1bbf284124d43394ef66b2cbb5 to your computer and use it in GitHub Desktop.

Select an option

Save JayPanoz/24dbad1bbf284124d43394ef66b2cbb5 to your computer and use it in GitHub Desktop.
Readium – Playback API
interface Voice {
id: string; // Unique identifier
name: string; // Display name
language: string; // BCP 47 language tag
}
interface SpeechContent {
text: string; // Text or SSML content
ssml?: boolean; // If true, text contains SSML
}
type PlaybackState = "playing" | "paused" | "idle" | "loading" | "ready";
type PlaybackEvent =
| "start"
| "pause"
| "resume"
| "end"
| "error"
| "boundary"
| "mark"
| "idle"
| "loading"
| "ready";
interface PlaybackEngine {
// Queue Management
loadUtterances(contents: SpeechContent[]): void;
// Voice config
setRate(rate: number): void;
setPitch(pitch: number): void;
setVolume(volume: number): void;
speak(utteranceIndex?: number): void;
// State
getState(): PlaybackState;
// Events
on(
event: PlaybackEvent,
callback: (...args: any[]) => void
): () => void;
// Cleanup
destroy(): Promise<void>;
}
interface EngineProvider {
// Voice Management
getVoices(): Promise<Voice[]>;
// Engine Creation
createEngine(voice: Voice): Promise<PlaybackEngine>;
// Lifecycle
destroy(): Promise<void>;
}
interface PlaybackNavigator {
// TBD
// Playback controls + events: play(idx?), pause, resume, stop
// navigation + events: next, previous, jumpTo(idx)
// readiumSpeech-prefixed events?
}

Playback API - Requirements

This document outlines the requirements for the Playback API, which provides a unified interface for controlling text-to-speech (TTS) playback.

1.1 Playback Control

  • Start, pause, resume, and stop
  • Handle both individual and batched text/SSML input
  • Report current playback state (playing, paused, stopped)

1.2 Text Processing

  • Accept plain text and SSML input
  • Support multiple utterances

1.3 Event System

  • Emit events for state changes
  • Provide word/sentence boundary information
  • Report errors and warnings

1.4 Voice Management

  • Select from available voices
  • Configure voice parameters (rate, pitch, volume)

Design

[WIP]

A PlaybackEngineProvider allows you to get available voices and create instances of the PlaybackEngine using one specific voice, language, etc.

This PlaybackEngine is using a voice, its parameters can be set, is loaded with utterances, can preload with context, and allows you to speak an utterance index.

A PlaybackNavigator then handles navigation, continuous play, etc.

@qnga
Copy link

qnga commented Aug 28, 2025

After a talk with Hadrien, I understand a bit better what the issue was. Here is my proposal:

  • A non-visual ReadAloudNavigator using audio files (media overlays) and TTS to read content aloud, based on guided navigation documents. 
  • A component turning the Publication HTML files into guided navigation documents.
  • A component turning the Publication SMIL files (probably along with HTML files) into guided navigation documents following the original structure.
  • A component turning any piece of HTML into a guided navigation document for external use of the ReadAloudNavigator.
  • A new navigator putting together the Epub/Web navigators and this ReadAloudNavigator for an integrated experience.

If we agree on this, the Playback API turns again into what I thought it was: a technical API adapting TTS engines/services for use inside the ReadAloudNavigator. Then, we can freely use any of the discussed approaches without non-technical considerations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment