Skip to content

Instantly share code, notes, and snippets.

@JarbasAl
Last active June 8, 2025 16:35
Show Gist options
  • Select an option

  • Save JarbasAl/ecf7c9012a8ad504f50001a3f516ea01 to your computer and use it in GitHub Desktop.

Select an option

Save JarbasAl/ecf7c9012a8ad504f50001a3f516ea01 to your computer and use it in GitHub Desktop.

Intent Pipeline Elo Benchmarking System

Overview

This proposal introduces an Elo-style rating system to benchmark OVOS intent pipeline configurations using real user utterances and human-in-the-loop feedback.

Instead of traditional dataset validation, users are presented with predictions from two different intent pipeline configurations and asked to judge which one is more accurate. This approach allows us to:

  • Benchmark pipelines using real-world utterances.
  • Collect high-quality, user-validated labeled data.
  • Engage the community with gamification features.

Motivation

  • Real-world benchmarking: Measure how well intent pipelines perform on live data from the OpenData Dashboard.
  • Data labeling: Collect ground-truth labels for ambiguous or unhandled utterances.
  • Community engagement: Make participation more interactive and rewarding via gamification.
  • Continuous evaluation: Track the performance of pipelines over time as models and configs evolve.

Core Concept

Each pipeline configuration (plugin combo + settings) is treated like a “player” in a competitive rating system.

Judgement UI

Users are shown:

  • One real utterance
  • Two predictions (from different pipelines)

Users can select:

  1. A is better
  2. B is better
  3. ⚖️ Both are correct
  4. Both are wrong

Elo Rating System

  • Pipelines gain or lose Elo points based on user feedback.
  • The Elo formula updates scores based on current rating difference and outcome.
  • Tie and double-loss cases are also handled:
    • Both correct → small Elo gain
    • Both wrong → small Elo loss

Implementation Goals

Phase 1: Core System

  • Define and store pipeline configuration hashes/IDs
  • Set up backend to:
    • Track Elo scores
    • Store match results (utterance, predictions, user choice)
  • Basic UI to show utterance + predictions, and collect feedback
  • Serve utterances sampled from OpenData Dashboard
  • Record matches for later analysis/labeled dataset generation

Phase 2: Dataset Generation

  • Export all match data in a structured format (e.g., JSONL)
  • Label each utterance with:
    • Ground truth (from majority or trusted votes)
    • Pipeline performance metadata

Phase 3: Gamification

  • User login or pseudo-anonymous ID system
  • Contributor leaderboard (based on activity/accuracy)
  • Badges and achievements (e.g., “100 matches rated”)
  • Voting streaks or challenge modes

Phase 4: Advanced Features (optional)

  • Rating decay over time
  • Confidence-based display (model certainty shown to user)
  • Weight votes by trust score (agreement with majority)
  • Admin dashboard for reviewing edge cases

Tech Stack Suggestions

  • Frontend: Web app (React/Svelte/Flask+HTMX)
  • Backend: FastAPI or Flask + SQLite/PostgreSQL
  • Elo logic: Simple Python Elo implementation or off-the-shelf Elo module
  • Storage: JSONL/Parquet for data export, database for live stats
  • Authentication (optional): GitHub/Matrix/TOTP guest login

Volunteer Tasks

If you're interested in helping out, we need:

  • UI developer (build the user feedback interface)
  • Backend developer (API, Elo logic, persistence)
  • Gamification designer (badges, rewards, UX)
  • Community testers (help test and improve the interface)

Feel free to fork and experiment. All contributions welcome!


License

All data and code in this project will be released under an OSI-approved license (e.g., Apache 2.0 or MIT).


Status

🟡 Spec in progress – looking for volunteers

STT Pipeline Elo Benchmarking System

Overview

This proposal extends the Elo-style rating system to benchmark Speech-to-Text (STT) pipelines using real-world audio samples from existing datasets.

Users will be presented with:

  • An audio sample to play
  • Ground-truth metadata (language, dataset name)
  • Two competing STT transcriptions (from different pipelines)

They will vote on which transcription is more accurate. This setup allows us to:

  • Evaluate STT pipelines on real-world data
  • Collect labeled user preference data
  • Crowdsource performance feedback
  • Gamify STT evaluation to make it engaging for contributors

Core Concept

Each STT model + configuration is treated like a “player” in an Elo-based system. A match compares two STT outputs on the same utterance.

Judgement UI

Users are shown:

  • 🔉 Audio playback button
  • 🌍 Language + Dataset metadata
  • 📜 Ground-truth transcription (optional toggle)
  • 🅰️ Transcription A
  • 🅱️ Transcription B

Users select one of:

  1. A is better
  2. B is better
  3. ⚖️ Tie (both are about equally good)
  4. Both suck (neither is acceptable)

Elo Rating Logic

  • Winners gain Elo points, losers lose points.
  • Tie: both pipelines gain small Elo bump.
  • Both bad: both lose a small amount.

Example Entry Display


🔉 \[▶️ Play Audio]
📄 Language: pt-PT
📂 Dataset: CORAA (TEDx segment)
📜 Ground Truth: "o jovem gaspar está a estudar engenharia informática"

🅰️ A: "o jovem caspar está estudar engenharia informática"
🅱️ B: "o jovem gaspar esta estudar engenheira informática"

\[ A is better ] \[ B is better ] \[ Tie ] \[ Both suck ]


Implementation Goals

Phase 1: MVP

  • Fetch or stream audio samples from known datasets (e.g., Common Voice, CORAA, FLEURS)
  • Render audio + metadata + two transcriptions
  • Log user votes and metadata
  • Store and update Elo scores for each pipeline

Phase 2: Dataset Building

  • Export human-voted comparisons in a structured format
  • Tag data with correctness labels for each transcription
  • Derive approximate quality ratings per STT config

Phase 3: Gamification

  • Leaderboard for top contributors
  • Badges for “STT Whisperer”, “Dataset Diver”, etc.
  • Streak modes: quick-fire comparisons
  • Option to show accuracy score vs ground truth after vote

Phase 4: Advanced Features (Optional)

  • Trust-weighted voting (e.g., more weight to experienced users)
  • Rating decay for older scores
  • Multi-language support for evaluation
  • Multi-pipeline tournament mode (e.g., round robin ratings)

Tech Stack Suggestions

  • Frontend: HTMX/Flask for fast prototyping, or React/Svelte
  • Backend: FastAPI or Flask + SQLite/PostgreSQL
  • Audio: HTML5 audio or Web Audio API
  • Data: Use HF Datasets or Parquet for data storage/export
  • Pipeline Matching: Sample audio and match pipelines with similar Elo ratings

Volunteer Tasks

We're looking for help with:

  • UI/UX development for the rating interface
  • Backend logic and Elo algorithm
  • Integration with existing open STT models
  • Dataset integration (e.g., Common Voice, CORAA)
  • Gamification and community engagement features

All contributions welcome!


License

All code and evaluation data will be released under an OSI-approved license (e.g., Apache 2.0 or MIT).


Status

🟡 Spec in progress – looking for volunteers


TTS Pipeline Elo Benchmarking System

Overview

This proposal defines a human-in-the-loop evaluation framework to benchmark Text-to-Speech (TTS) pipelines using an Elo-style rating system. Each comparison evaluates two different TTS outputs of the same sentence.

Unlike STT or intent evaluation, TTS quality depends on multiple perceptual dimensions, so we track two separate Elo scores:

  • 🎯 Pronunciation Accuracy: Correctness of phonemes and words.
  • 🎵 Naturalness / Prosody: Human-likeness, rhythm, stress, and flow.

Core Concept

Each TTS system (model + voice + config) is treated as a "player." Two audio samples are generated from the same input text and compared by a human judge.

Judgement UI

Users are shown:

  • 📜 Input text

  • 🔉 Two audio players (A and B)

  • 🗳️ Voting panel for:

    • 🎯 Pronunciation:

      • A is better
      • B is better
      • Tie
      • Both bad
    • 🎵 Naturalness:

      • A is better
      • B is better
      • Tie
      • Both bad

Optionally:

  • 🔍 Dataset/lang/source (used internally or toggled by user)

Example Entry Display


📜 Input Text:
"A inteligência artificial pode transformar o futuro da educação."

🔉 A: \[▶️ Play A]
🔉 B: \[▶️ Play B]

🎯 Pronunciation
\[ A is better ] \[ B is better ] \[ Tie ] \[ Both bad ]

🎵 Naturalness
\[ A is better ] \[ B is better ] \[ Tie ] \[ Both bad ]


Elo Rating System

Each TTS configuration has two Elo scores:

  • elo_pronunciation
  • elo_naturalness

Each vote updates the appropriate Elo score using standard Elo logic.

  • Tie → small gain for both
  • Both bad → small loss for both
  • Winner vs loser → adjusted gain/loss based on rating delta

Implementation Goals

Phase 1: Core Benchmarking Loop

  • Generate TTS outputs from multiple systems using shared input text
  • UI to present audio A/B test
  • Two voting blocks: pronunciation and naturalness
  • Backend logic to store results and update dual Elo scores
  • Track source text and system IDs for reproducibility

Phase 2: Dataset Integration

  • Option to use curated sentence sets (e.g., CSS10, LJSpeech, CMU ARCTIC)
  • Support for multilingual text-to-speech comparison
  • Export votes and ratings for research analysis

Phase 3: Gamification

  • Leaderboard of top contributors
  • Achievements like “Ear for Detail”, “TTS Judge”
  • Accuracy feedback (e.g., agreement with majority)
  • Fast rating mode (hotkey controls)

Phase 4: Advanced Features (Optional)

  • Trust-weighted voting
  • Context-aware voting (e.g., emphasis on tricky words)
  • Audio quality checks (clipping, noise)
  • Blind tests with human voice samples included

Tech Stack Suggestions

  • Frontend: Web-based interface with audio players, HTMX or React
  • Backend: FastAPI or Flask, storing dual Elo scores
  • Audio: HTML5 audio with local or streamed files
  • TTS Pipelines: Run offline or via inference endpoints (Mycroft Mimic3, Coqui TTS, Bark, etc.)
  • Data export: JSONL/Parquet files with per-vote metadata

Volunteer Tasks

We are looking for help with:

  • UI development (A/B audio interface)
  • Backend Elo system with dual scores
  • Integration with existing TTS systems
  • Multilingual sentence sourcing
  • UX and gamification

License

All source code and rating data will be released under an OSI-approved license (Apache 2.0 or MIT preferred). Generated audio may be subject to model license restrictions.


Status

🟡 Spec in progress – seeking volunteers

Wake Word Detection Elo Benchmarking System

Overview

This system benchmarks Wake Word Detection (WWD) models using an Elo-style crowd-powered evaluation loop. It presents users with short audio clips and asks whether they contain the wake word or not — then compares this against multiple model predictions.

This setup allows:

  • Evaluation of models on real-world and edge-case examples
  • Discovery of false positives/negatives
  • Labeling and refinement of new test sets
  • Gamified engagement with users

Core Concept

Each WWD model (with its specific settings) is a "player" in an Elo tournament.

Users listen to short audio samples and:

  • Decide if they contain a wake word
  • See predictions from 2 models (or optionally more)
  • Select which model performed better (or neither)

Dual Elo Scoring

Each model receives two Elo scores:

  • 🟢 Recall Score (catching real wake words)
  • 🔴 Precision Score (avoiding false alarms)

Judgement UI

Users are shown:

  • 🔉 Play audio (usually 1–3 seconds)
  • ✅ Human label: wake word or not? (user confirms/overrides)
  • 🤖 Prediction from Model A and Model B: "Wake word" or "No wake word"
  • 🗳️ Which model performed better?

Response Options

  1. A is better
  2. B is better
  3. ⚖️ Tie
  4. Both are wrong

Example Entry Display


🔉 \[▶️ Play audio]
🎧 Ground Truth: Wake word present (e.g., "Hey Mycroft")

🤖 Model A: No wake word
🤖 Model B: Wake word detected

\[ A is better ] \[ B is better ] \[ Tie ] \[ Both wrong ]


Elo Scoring Rules

Each model's precision and recall scores are updated:

  • If the sample contains a wake word:
    • Correct detections improve recall Elo
    • Missed detections hurt recall Elo
  • If the sample does not contain a wake word:
    • False positives hurt precision Elo
    • Correct silence improves precision Elo

Users vote on relative quality, not just raw correctness — letting us crowdsource nuanced judgment.


Implementation Goals

Phase 1: MVP

  • Build an interface to present short audio + predictions
  • Display ground truth label
  • Accept user votes on model performance
  • Update model Elo scores (precision & recall separately)

Phase 2: Data Collection & Curation

  • Import audio from real usage logs or datasets (e.g., Precise Wake Words, Porcupine test sets)
  • Store user decisions as labeled dataset
  • Label edge cases (background speech, accents, cut-off wake words)

Phase 3: Gamification

  • Leaderboard for “Wake Word Whisperers”
  • Awards for catching false positives/negatives
  • Confidence training (blur prediction labels unless user wants to see them)

Phase 4: Advanced Features

  • Compare >2 models at a time
  • Model match history: who wins most often?
  • Model vetoing (e.g., too many false alarms → disabled)

Tech Stack Suggestions

  • Frontend: HTMX, React, or Svelte
  • Backend: Flask/FastAPI with SQL for model scores
  • Audio: Local .wav or streamed short clips
  • Wake Word Models: Precise, Porcupine, Vosk, Whisper keyword spotting, custom ONNX
  • Vote Format: JSONL logs with user choice, predictions, ground truth, and audio reference

Example JSONL Record

{
  "audio_file": "sample_135.wav",
  "ground_truth": true,
  "model_a_prediction": false,
  "model_b_prediction": true,
  "user_vote": "b_better",
  "lang": "en",
  "wake_word": "hey mycroft"
}

Volunteer Tasks

  • Wake word audio curation (wake word + hard negatives)
  • Frontend to present short clips and model predictions
  • Elo score update logic per class (recall vs precision)
  • Gamified dashboard

License

All UI, model evaluations, and labeled datasets will be released under a permissive license such as Apache 2.0 or MIT. Wake word audio may require filtering by license or recording source.


Status

🟡 Spec in progress – seeking contributors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment