Documentation Index
Fetch the complete documentation index at: https://podonos.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Two ways to design an evaluation
Use a Podonos template
Pick from a library of pre-built, science-backed evaluation templates. Question wording, scale, anchors, and instructions are already calibrated.
Bring your own design
Write a custom evaluation. Our team reviews and proposes edits before it goes live.
Podonos templates
We maintain a library of evaluation templates that map to common research and product questions. Each template carries:- Calibrated question wording. No ambiguous “naturalness” — the wording is the version that has produced the most consistent inter-evaluator agreement in our internal validation.
- Scale + anchors. The right number of points for the question, with concrete example audios at each level.
- Pre-set attention checks appropriate to the evaluation type.
- Default evaluator count and assignment policy tuned to the typical confidence interval customers want.
Naturalness (NMOS)
Five-point Likert mean opinion score with anchored examples.
Voice similarity (SMOS)
Compare a generated voice to a reference voice.
Speech quality (P.808)
ITU-T P.808 protocol for telecom-grade quality assessment.
Preferences (PREF)
Two-way A/B preference between models.
Ranking
N-way ranking with adaptive pairing for fixed-budget global rank.
Comparative similarity
Pairwise similarity to a reference under common-target conditions.
Custom evaluation review
If a template does not fit your question, you can design your own — and we will review it before launch.You draft
Write your question, scale, instructions, and anchors. Submit through the Workspace or your Slack channel.
We review
A Podonos evaluation specialist reads the draft for the failure modes we see most often: ambiguous wording, scale mismatch (too many points, too few), missing anchors, leading questions, and instructions that bury critical context.
We propose edits
You receive concrete proposed wording with the reasoning behind each change. You can accept, reject, or iterate.
What review catches
Ambiguous target terms
Ambiguous target terms
Words like “natural,” “expressive,” “good,” or “high quality” mean different things to different evaluators. We replace them with concrete behavioral prompts.
Scale mismatch
Scale mismatch
Five-point Likert is right for many tasks but wrong for others. Binary preferences should not have a 1–5 scale; subtle quality gradations need more than 3 points.
Missing anchors
Missing anchors
Every scale point needs a concrete audio example. Without anchors, scores drift and inter-evaluator agreement collapses.
Leading questions
Leading questions
Questions phrased to bias the answer (“how clearly does this voice articulate?” implies clarity is a feature). We rewrite to neutral framing.
Buried instructions
Buried instructions
Critical context placed at the end of a 500-word instruction page is invisible. We surface it.

