Documentation Index
Fetch the complete documentation index at: https://podonos.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Intro
The Comparative Mean Opinion Score (CMOS) evaluation is designed to assess how a target audio compares to a reference audio. This evaluation is versatile and can be used for various comparison purposes:- Quality Comparison: “Is the target better or worse than the reference?”
- Similarity/Fidelity: “How similar is the target to the reference?”
- Multi-dimensional Assessment: Evaluate specific aspects like naturalness, clarity, or speaker similarity.
- Objective: Compare target audio against a reference audio on various dimensions.
- Use Case: TTS quality assessment, voice cloning fidelity, audio codec evaluation, speech enhancement validation.
- Type:
CMOSin the SDK.
Creating CMOS Evaluations
There are three ways to create CMOS evaluations:| Method | Description | Use Case |
|---|---|---|
create_evaluator() | Direct creation with type="CMOS" | Quick setup with default settings |
create_evaluator_from_template() | Use predefined SPEECH_CMOS template | Standardized evaluations with preset questions |
create_evaluator_from_template_json() | Custom JSON with custom_type="SINGLE_REF" | Fully customized evaluation questions |
Example: Using create_evaluator()
Add Files for Evaluation
Add one stimulus audio and one reference audio. The reference file must be specified with
is_ref=True.- Stimulus: The generated or synthesized audio to evaluate (
is_ref=False) - Reference: The ground-truth or original audio (
is_ref=True)
model_tag values for each stimulus.Example: Using create_evaluator_from_template()
Use a predefined template for standardized CMOS evaluations.Example: Using create_evaluator_from_template_json()
Create custom CMOS-style evaluations with theSINGLE_REF custom type.
- Using String
- Using CustomType Enum
Full Example
Here is a complete example comparing multiple TTS models against the same reference recordings:Key Considerations
For CMOS evaluation, one file must be marked as reference (
is_ref=True) and one as stimulus (is_ref=False). Both files having the same is_ref value will cause an error.- File Configuration: Exactly two files are required - one reference and one stimulus.
- Method Restriction: Use
add_files()method, notadd_file()which is only for single-stimulus evaluations. - Evaluation Logic: Evaluators will rate the quality difference between the stimulus and the reference audio.
custom_type Options
When usingcreate_evaluator_from_template_json(), choose the appropriate custom_type based on your evaluation needs:
| Value | Description | batch_size | File Configuration |
|---|---|---|---|
SINGLE | Single stimulus evaluation | 1 | 1 stimulus |
DOUBLE | Double stimulus evaluation | 2 | 2 stimuli (no reference) |
SINGLE_REF | Reference-based comparison (CMOS) | 2 | 1 reference + 1 stimulus |
RANKING | Ranking evaluation | 2+ | Multiple stimuli |
DOUBLE vs SINGLE_REF
| DOUBLE | SINGLE_REF (CMOS) | |
|---|---|---|
| File Configuration | 2 stimuli | 1 ref + 1 stimulus |
is_ref Usage | Not allowed (causes error) | Required |
| Comparison Type | A vs B | Reference vs Stimulus |
Current Limitations
Use Cases
| Scenario | Evaluation Focus | Example Question |
|---|---|---|
| TTS Quality Assessment | Quality comparison | ”Is the synthesized speech better or worse than the human recording?” |
| Voice Cloning Fidelity | Speaker similarity | ”How similar is the cloned voice to the original speaker?” |
| Audio Codec Evaluation | Quality degradation | ”How much quality loss occurred after compression?” |
| Speech Enhancement | Improvement measurement | ”Is the enhanced audio clearer than the original noisy recording?” |
| Prosody Transfer | Style similarity | ”Does the target match the speaking style of the reference?” |
| Emotion Synthesis | Emotion fidelity | ”How well does the target convey the same emotion as the reference?” |

