Intro
The Comparative Mean Opinion Score (CMOS) evaluation is designed to assess how a target audio compares to a reference audio. This evaluation is versatile and can be used for various comparison purposes:
- Quality Comparison: “Is the target better or worse than the reference?”
- Similarity/Fidelity: “How similar is the target to the reference?”
- Multi-dimensional Assessment: Evaluate specific aspects like naturalness, clarity, or speaker similarity.
CMOS is particularly useful when you have a ground-truth or baseline audio and want to measure how your generated audio compares against it.
- Objective: Compare target audio against a reference audio on various dimensions.
- Use Case: TTS quality assessment, voice cloning fidelity, audio codec evaluation, speech enhancement validation.
- Type:
CMOS in the SDK.
Beta Feature: CMOS support is currently under development in the Workspace. Some features such as file management (add/edit/delete) may not be fully functional yet. Creating CMOS templates in the Workspace is also not supported at this time—use create_evaluator_from_template_json() with a JSON template instead.
Creating CMOS Evaluations
There are three ways to create CMOS evaluations:
| Method | Description | Use Case |
|---|
create_evaluator() | Direct creation with type="CMOS" | Quick setup with default settings |
create_evaluator_from_template() | Use predefined SPEECH_CMOS template | Standardized evaluations with preset questions |
create_evaluator_from_template_json() | Custom JSON with custom_type="SINGLE_REF" | Fully customized evaluation questions |
Example: Using create_evaluator()
Initialize the Client
Begin by initializing the Podonos client with your API key.import podonos
client = podonos.init("<API_KEY>")
Create the Evaluator
Set up the evaluator for a CMOS evaluation.evaluator = client.create_evaluator(
name="TTS Quality Comparison",
desc="Compare synthesized speech against reference recording",
type="CMOS",
lan="en-us",
num_eval=10
)
Add Files for Evaluation
Add one stimulus audio and one reference audio. The reference file must be specified with is_ref=True.from podonos import File
evaluator.add_files(
file0=File(path="generated.wav", model_tag="tts_v2", is_ref=False),
file1=File(path="reference.wav", model_tag="human", is_ref=True)
)
- Stimulus: The generated or synthesized audio to evaluate (
is_ref=False)
- Reference: The ground-truth or original audio (
is_ref=True)
You can compare multiple models against the same reference by using different model_tag values for each stimulus. Finalize the Evaluation
Close the evaluator to complete the setup.
Example: Using create_evaluator_from_template()
Use a predefined template for standardized CMOS evaluations.
import podonos
from podonos import File
client = podonos.init("<API_KEY>")
# Create evaluation using a template
evaluator = client.create_evaluator_from_template(
template_id="<TEMPLATE_ID>",
name="CMOS Evaluation",
num_eval=5
)
# Add files (1 reference + 1 stimulus)
evaluator.add_files(
file0=File(path="synthesized.wav", model_tag="tts"),
file1=File(path="reference.wav", model_tag="human", is_ref=True)
)
evaluator.close()
Example: Using create_evaluator_from_template_json()
Create custom CMOS-style evaluations with the SINGLE_REF custom type.
Using String
Using CustomType Enum
import podonos
from podonos import File
client = podonos.init("<API_KEY>")
template_json = {
"questions": [
{
"type": "SCORED",
"question": "How similar is the synthesized audio to the reference?",
"options": [
{"label_text": "Identical"},
{"label_text": "Very similar"},
{"label_text": "Somewhat similar"},
{"label_text": "Different"},
{"label_text": "Completely different"}
]
}
]
}
evaluator = client.create_evaluator_from_template_json(
json=template_json,
name="CMOS Style Evaluation",
custom_type="SINGLE_REF" # CMOS style (1 ref + 1 stimulus)
)
# Add files
evaluator.add_files(
file0=File(path="synthesized.wav", model_tag="tts"),
file1=File(path="reference.wav", model_tag="human", is_ref=True)
)
evaluator.close()
import podonos
from podonos import File
from podonos.common.enum import CustomType
client = podonos.init("<API_KEY>")
template_json = {
"questions": [
{
"type": "SCORED",
"question": "How similar is the synthesized audio to the reference?",
"options": [
{"label_text": "Identical"},
{"label_text": "Very similar"},
{"label_text": "Somewhat similar"},
{"label_text": "Different"},
{"label_text": "Completely different"}
]
}
]
}
evaluator = client.create_evaluator_from_template_json(
json=template_json,
name="CMOS Style Evaluation",
custom_type=CustomType.SINGLE_REF # Using enum for type safety
)
# Add files
evaluator.add_files(
file0=File(path="synthesized.wav", model_tag="tts"),
file1=File(path="reference.wav", model_tag="human", is_ref=True)
)
evaluator.close()
Full Example
Here is a complete example comparing multiple TTS models against the same reference recordings:
import podonos
from podonos import File
client = podonos.init("<API_KEY>")
evaluator = client.create_evaluator(
name="Multi-Model TTS Quality Comparison",
desc="Compare multiple TTS models against human reference recordings",
type="CMOS",
lan="en-us",
num_eval=5
)
# Define models to compare
models = ["tts_model_v1", "tts_model_v2", "tts_model_v3"]
# Define audio samples (script IDs)
scripts = ["001", "002", "003"]
# Add evaluation pairs for each model and script combination
for script_id in scripts:
reference_path = f"reference_{script_id}.wav"
for model_name in models:
generated_path = f"{model_name}_{script_id}.wav"
evaluator.add_files(
file0=File(
path=generated_path,
model_tag=model_name,
tags=["english", "female"],
is_ref=False
),
file1=File(
path=reference_path,
model_tag="human_speaker",
tags=["english", "female"],
is_ref=True
)
)
evaluator.close()
In this example, three different TTS models are evaluated against the same set of human reference recordings. The results will show the quality comparison for each model, allowing you to identify which model produces audio closest to the reference quality.
Key Considerations
For CMOS evaluation, one file must be marked as reference (is_ref=True) and one as stimulus (is_ref=False). Both files having the same is_ref value will cause an error.
- File Configuration: Exactly two files are required - one reference and one stimulus.
- Method Restriction: Use
add_files() method, not add_file() which is only for single-stimulus evaluations.
- Evaluation Logic: Evaluators will rate the quality difference between the stimulus and the reference audio.
custom_type Options
When using create_evaluator_from_template_json(), choose the appropriate custom_type based on your evaluation needs:
| Value | Description | batch_size | File Configuration |
|---|
SINGLE | Single stimulus evaluation | 1 | 1 stimulus |
DOUBLE | Double stimulus evaluation | 2 | 2 stimuli (no reference) |
SINGLE_REF | Reference-based comparison (CMOS) | 2 | 1 reference + 1 stimulus |
RANKING | Ranking evaluation | 2+ | Multiple stimuli |
DOUBLE vs SINGLE_REF
| DOUBLE | SINGLE_REF (CMOS) |
|---|
| File Configuration | 2 stimuli | 1 ref + 1 stimulus |
is_ref Usage | Not allowed (causes error) | Required |
| Comparison Type | A vs B | Reference vs Stimulus |
Current Limitations
CMOS is currently in beta. The following limitations apply:
- Workspace File Management: Adding, editing, or deleting files in the Workspace UI may not work as expected.
- Workspace Template Creation: Creating CMOS templates directly in the Workspace is not supported. Use
create_evaluator_from_template_json() with a JSON template to create custom CMOS evaluations.
These features are under active development and will be available in future updates.
Use Cases
| Scenario | Evaluation Focus | Example Question |
|---|
| TTS Quality Assessment | Quality comparison | ”Is the synthesized speech better or worse than the human recording?” |
| Voice Cloning Fidelity | Speaker similarity | ”How similar is the cloned voice to the original speaker?” |
| Audio Codec Evaluation | Quality degradation | ”How much quality loss occurred after compression?” |
| Speech Enhancement | Improvement measurement | ”Is the enhanced audio clearer than the original noisy recording?” |
| Prosody Transfer | Style similarity | ”Does the target match the speaking style of the reference?” |
| Emotion Synthesis | Emotion fidelity | ”How well does the target convey the same emotion as the reference?” |