Skip to main content

Intro

The Comparative Mean Opinion Score (CMOS) evaluation is designed to assess how a target audio compares to a reference audio. This evaluation is versatile and can be used for various comparison purposes:
  • Quality Comparison: “Is the target better or worse than the reference?”
  • Similarity/Fidelity: “How similar is the target to the reference?”
  • Multi-dimensional Assessment: Evaluate specific aspects like naturalness, clarity, or speaker similarity.
CMOS is particularly useful when you have a ground-truth or baseline audio and want to measure how your generated audio compares against it.
  • Objective: Compare target audio against a reference audio on various dimensions.
  • Use Case: TTS quality assessment, voice cloning fidelity, audio codec evaluation, speech enhancement validation.
  • Type: CMOS in the SDK.
Beta Feature: CMOS support is currently under development in the Workspace. Some features such as file management (add/edit/delete) may not be fully functional yet. Creating CMOS templates in the Workspace is also not supported at this time—use create_evaluator_from_template_json() with a JSON template instead.

Creating CMOS Evaluations

There are three ways to create CMOS evaluations:
MethodDescriptionUse Case
create_evaluator()Direct creation with type="CMOS"Quick setup with default settings
create_evaluator_from_template()Use predefined SPEECH_CMOS templateStandardized evaluations with preset questions
create_evaluator_from_template_json()Custom JSON with custom_type="SINGLE_REF"Fully customized evaluation questions

Example: Using create_evaluator()

1

Initialize the Client

Begin by initializing the Podonos client with your API key.
import podonos

client = podonos.init("<API_KEY>")
2

Create the Evaluator

Set up the evaluator for a CMOS evaluation.
evaluator = client.create_evaluator(
    name="TTS Quality Comparison",
    desc="Compare synthesized speech against reference recording",
    type="CMOS",
    lan="en-us",
    num_eval=10
)
3

Add Files for Evaluation

Add one stimulus audio and one reference audio. The reference file must be specified with is_ref=True.
from podonos import File

evaluator.add_files(
    file0=File(path="generated.wav", model_tag="tts_v2", is_ref=False),
    file1=File(path="reference.wav", model_tag="human", is_ref=True)
)
  • Stimulus: The generated or synthesized audio to evaluate (is_ref=False)
  • Reference: The ground-truth or original audio (is_ref=True)
You can compare multiple models against the same reference by using different model_tag values for each stimulus.
4

Finalize the Evaluation

Close the evaluator to complete the setup.
evaluator.close()

Example: Using create_evaluator_from_template()

Use a predefined template for standardized CMOS evaluations.
import podonos
from podonos import File

client = podonos.init("<API_KEY>")

# Create evaluation using a template
evaluator = client.create_evaluator_from_template(
    template_id="<TEMPLATE_ID>",
    name="CMOS Evaluation",
    num_eval=5
)

# Add files (1 reference + 1 stimulus)
evaluator.add_files(
    file0=File(path="synthesized.wav", model_tag="tts"),
    file1=File(path="reference.wav", model_tag="human", is_ref=True)
)

evaluator.close()

Example: Using create_evaluator_from_template_json()

Create custom CMOS-style evaluations with the SINGLE_REF custom type.
import podonos
from podonos import File

client = podonos.init("<API_KEY>")

template_json = {
    "questions": [
        {
            "type": "SCORED",
            "question": "How similar is the synthesized audio to the reference?",
            "options": [
                {"label_text": "Identical"},
                {"label_text": "Very similar"},
                {"label_text": "Somewhat similar"},
                {"label_text": "Different"},
                {"label_text": "Completely different"}
            ]
        }
    ]
}

evaluator = client.create_evaluator_from_template_json(
    json=template_json,
    name="CMOS Style Evaluation",
    custom_type="SINGLE_REF"  # CMOS style (1 ref + 1 stimulus)
)

# Add files
evaluator.add_files(
    file0=File(path="synthesized.wav", model_tag="tts"),
    file1=File(path="reference.wav", model_tag="human", is_ref=True)
)

evaluator.close()

Full Example

Here is a complete example comparing multiple TTS models against the same reference recordings:
import podonos
from podonos import File

client = podonos.init("<API_KEY>")

evaluator = client.create_evaluator(
    name="Multi-Model TTS Quality Comparison",
    desc="Compare multiple TTS models against human reference recordings",
    type="CMOS",
    lan="en-us",
    num_eval=5
)

# Define models to compare
models = ["tts_model_v1", "tts_model_v2", "tts_model_v3"]

# Define audio samples (script IDs)
scripts = ["001", "002", "003"]

# Add evaluation pairs for each model and script combination
for script_id in scripts:
    reference_path = f"reference_{script_id}.wav"

    for model_name in models:
        generated_path = f"{model_name}_{script_id}.wav"

        evaluator.add_files(
            file0=File(
                path=generated_path,
                model_tag=model_name,
                tags=["english", "female"],
                is_ref=False
            ),
            file1=File(
                path=reference_path,
                model_tag="human_speaker",
                tags=["english", "female"],
                is_ref=True
            )
        )

evaluator.close()
In this example, three different TTS models are evaluated against the same set of human reference recordings. The results will show the quality comparison for each model, allowing you to identify which model produces audio closest to the reference quality.

Key Considerations

For CMOS evaluation, one file must be marked as reference (is_ref=True) and one as stimulus (is_ref=False). Both files having the same is_ref value will cause an error.
  • File Configuration: Exactly two files are required - one reference and one stimulus.
  • Method Restriction: Use add_files() method, not add_file() which is only for single-stimulus evaluations.
  • Evaluation Logic: Evaluators will rate the quality difference between the stimulus and the reference audio.

custom_type Options

When using create_evaluator_from_template_json(), choose the appropriate custom_type based on your evaluation needs:
ValueDescriptionbatch_sizeFile Configuration
SINGLESingle stimulus evaluation11 stimulus
DOUBLEDouble stimulus evaluation22 stimuli (no reference)
SINGLE_REFReference-based comparison (CMOS)21 reference + 1 stimulus
RANKINGRanking evaluation2+Multiple stimuli

DOUBLE vs SINGLE_REF

DOUBLESINGLE_REF (CMOS)
File Configuration2 stimuli1 ref + 1 stimulus
is_ref UsageNot allowed (causes error)Required
Comparison TypeA vs BReference vs Stimulus

Current Limitations

CMOS is currently in beta. The following limitations apply:
  • Workspace File Management: Adding, editing, or deleting files in the Workspace UI may not work as expected.
  • Workspace Template Creation: Creating CMOS templates directly in the Workspace is not supported. Use create_evaluator_from_template_json() with a JSON template to create custom CMOS evaluations.
These features are under active development and will be available in future updates.

Use Cases

ScenarioEvaluation FocusExample Question
TTS Quality AssessmentQuality comparison”Is the synthesized speech better or worse than the human recording?”
Voice Cloning FidelitySpeaker similarity”How similar is the cloned voice to the original speaker?”
Audio Codec EvaluationQuality degradation”How much quality loss occurred after compression?”
Speech EnhancementImprovement measurement”Is the enhanced audio clearer than the original noisy recording?”
Prosody TransferStyle similarity”Does the target match the speaking style of the reference?”
Emotion SynthesisEmotion fidelity”How well does the target convey the same emotion as the reference?”