This proposal introduces an Elo-style rating system to benchmark OVOS intent pipeline configurations using real user utterances and human-in-the-loop feedback.
Instead of traditional dataset validation, users are presented with predictions from two different intent pipeline configurations and asked to judge which one is more accurate. This approach allows us to:
- Benchmark pipelines using real-world utterances.
- Collect high-quality, user-validated labeled data.
- Engage the community with gamification features.
- Real-world benchmarking: Measure how well intent pipelines perform on live data from the OpenData Dashboard.
- Data labeling: Collect ground-truth labels for ambiguous or unhandled utterances.
- Community engagement: Make participation more interactive and rewarding via gamification.
- Continuous evaluation: Track the performance of pipelines over time as models and configs evolve.
Each pipeline configuration (plugin combo + settings) is treated like a “player” in a competitive rating system.
Users are shown:
- One real utterance
- Two predictions (from different pipelines)
Users can select:
- ✅ A is better
- ✅ B is better
- ⚖️ Both are correct
- ❌ Both are wrong
- Pipelines gain or lose Elo points based on user feedback.
- The Elo formula updates scores based on current rating difference and outcome.
- Tie and double-loss cases are also handled:
- Both correct → small Elo gain
- Both wrong → small Elo loss
- Define and store pipeline configuration hashes/IDs
- Set up backend to:
- Track Elo scores
- Store match results (utterance, predictions, user choice)
- Basic UI to show utterance + predictions, and collect feedback
- Serve utterances sampled from OpenData Dashboard
- Record matches for later analysis/labeled dataset generation
- Export all match data in a structured format (e.g., JSONL)
- Label each utterance with:
- Ground truth (from majority or trusted votes)
- Pipeline performance metadata
- User login or pseudo-anonymous ID system
- Contributor leaderboard (based on activity/accuracy)
- Badges and achievements (e.g., “100 matches rated”)
- Voting streaks or challenge modes
- Rating decay over time
- Confidence-based display (model certainty shown to user)
- Weight votes by trust score (agreement with majority)
- Admin dashboard for reviewing edge cases
- Frontend: Web app (React/Svelte/Flask+HTMX)
- Backend: FastAPI or Flask + SQLite/PostgreSQL
- Elo logic: Simple Python Elo implementation or off-the-shelf Elo module
- Storage: JSONL/Parquet for data export, database for live stats
- Authentication (optional): GitHub/Matrix/TOTP guest login
If you're interested in helping out, we need:
- UI developer (build the user feedback interface)
- Backend developer (API, Elo logic, persistence)
- Gamification designer (badges, rewards, UX)
- Community testers (help test and improve the interface)
Feel free to fork and experiment. All contributions welcome!
All data and code in this project will be released under an OSI-approved license (e.g., Apache 2.0 or MIT).
🟡 Spec in progress – looking for volunteers