Reference-Based Comparison

Intro

The Comparative Mean Opinion Score (CMOS) evaluation is designed to assess how a target audio compares to a reference audio. This evaluation is versatile and can be used for various comparison purposes:

Quality Comparison: “Is the target better or worse than the reference?”
Similarity/Fidelity: “How similar is the target to the reference?”
Multi-dimensional Assessment: Evaluate specific aspects like naturalness, clarity, or speaker similarity.

CMOS is particularly useful when you have a ground-truth or baseline audio and want to measure how your generated audio compares against it.

Objective: Compare target audio against a reference audio on various dimensions.
Use Case: TTS quality assessment, voice cloning fidelity, audio codec evaluation, speech enhancement validation.
Type: CMOS in the SDK.

Beta Feature: CMOS support is currently under development in the Workspace. Some features such as file management (add/edit/delete) may not be fully functional yet. Creating CMOS templates in the Workspace is also not supported at this time—use create_evaluator_from_template_json() with a JSON template instead.

Creating CMOS Evaluations

There are three ways to create CMOS evaluations:

Method	Description	Use Case
`create_evaluator()`	Direct creation with `type="CMOS"`	Quick setup with default settings
`create_evaluator_from_template()`	Use predefined `SPEECH_CMOS` template	Standardized evaluations with preset questions
`create_evaluator_from_template_json()`	Custom JSON with `custom_type="SINGLE_REF"`	Fully customized evaluation questions

Example: Using create_evaluator()

Initialize the Client

Begin by initializing the Podonos client with your API key.

import podonos

client = podonos.init("<API_KEY>")

Create the Evaluator

Set up the evaluator for a CMOS evaluation.

evaluator = client.create_evaluator(
    name="TTS Quality Comparison",
    desc="Compare synthesized speech against reference recording",
    type="CMOS",
    lan="en-us",
    num_eval=10
)

Add Files for Evaluation

Add one stimulus audio and one reference audio. The reference file must be specified with is_ref=True.

from podonos import File

evaluator.add_files(
    file0=File(path="generated.wav", model_tag="tts_v2", is_ref=False),
    file1=File(path="reference.wav", model_tag="human", is_ref=True)
)

Stimulus: The generated or synthesized audio to evaluate (is_ref=False)
Reference: The ground-truth or original audio (is_ref=True)

You can compare multiple models against the same reference by using different model_tag values for each stimulus.

Finalize the Evaluation

Close the evaluator to complete the setup.

evaluator.close()

Example: Using create_evaluator_from_template()

Use a predefined template for standardized CMOS evaluations.

import podonos
from podonos import File

client = podonos.init("<API_KEY>")

# Create evaluation using a template
evaluator = client.create_evaluator_from_template(
    template_id="<TEMPLATE_ID>",
    name="CMOS Evaluation",
    num_eval=5
)

# Add files (1 reference + 1 stimulus)
evaluator.add_files(
    file0=File(path="synthesized.wav", model_tag="tts"),
    file1=File(path="reference.wav", model_tag="human", is_ref=True)
)

evaluator.close()

Example: Using create_evaluator_from_template_json()

Create custom CMOS-style evaluations with the SINGLE_REF custom type.

Using String
Using CustomType Enum

import podonos
from podonos import File

client = podonos.init("<API_KEY>")

template_json = {
    "questions": [
        {
            "type": "SCORED",
            "question": "How similar is the synthesized audio to the reference?",
            "options": [
                {"label_text": "Identical"},
                {"label_text": "Very similar"},
                {"label_text": "Somewhat similar"},
                {"label_text": "Different"},
                {"label_text": "Completely different"}
            ]
        }
    ]
}

evaluator = client.create_evaluator_from_template_json(
    json=template_json,
    name="CMOS Style Evaluation",
    custom_type="SINGLE_REF"  # CMOS style (1 ref + 1 stimulus)
)

# Add files
evaluator.add_files(
    file0=File(path="synthesized.wav", model_tag="tts"),
    file1=File(path="reference.wav", model_tag="human", is_ref=True)
)

evaluator.close()

import podonos
from podonos import File
from podonos.common.enum import CustomType

client = podonos.init("<API_KEY>")

template_json = {
    "questions": [
        {
            "type": "SCORED",
            "question": "How similar is the synthesized audio to the reference?",
            "options": [
                {"label_text": "Identical"},
                {"label_text": "Very similar"},
                {"label_text": "Somewhat similar"},
                {"label_text": "Different"},
                {"label_text": "Completely different"}
            ]
        }
    ]
}

evaluator = client.create_evaluator_from_template_json(
    json=template_json,
    name="CMOS Style Evaluation",
    custom_type=CustomType.SINGLE_REF  # Using enum for type safety
)

# Add files
evaluator.add_files(
    file0=File(path="synthesized.wav", model_tag="tts"),
    file1=File(path="reference.wav", model_tag="human", is_ref=True)
)

evaluator.close()

Full Example

Here is a complete example comparing multiple TTS models against the same reference recordings:

import podonos
from podonos import File

client = podonos.init("<API_KEY>")

evaluator = client.create_evaluator(
    name="Multi-Model TTS Quality Comparison",
    desc="Compare multiple TTS models against human reference recordings",
    type="CMOS",
    lan="en-us",
    num_eval=5
)

# Define models to compare
models = ["tts_model_v1", "tts_model_v2", "tts_model_v3"]

# Define audio samples (script IDs)
scripts = ["001", "002", "003"]

# Add evaluation pairs for each model and script combination
for script_id in scripts:
    reference_path = f"reference_{script_id}.wav"

    for model_name in models:
        generated_path = f"{model_name}_{script_id}.wav"

        evaluator.add_files(
            file0=File(
                path=generated_path,
                model_tag=model_name,
                tags=["english", "female"],
                is_ref=False
            ),
            file1=File(
                path=reference_path,
                model_tag="human_speaker",
                tags=["english", "female"],
                is_ref=True
            )
        )

evaluator.close()

In this example, three different TTS models are evaluated against the same set of human reference recordings. The results will show the quality comparison for each model, allowing you to identify which model produces audio closest to the reference quality.

Key Considerations

For CMOS evaluation, one file must be marked as reference (is_ref=True) and one as stimulus (is_ref=False). Both files having the same is_ref value will cause an error.

File Configuration: Exactly two files are required - one reference and one stimulus.
Method Restriction: Use add_files() method, not add_file() which is only for single-stimulus evaluations.
Evaluation Logic: Evaluators will rate the quality difference between the stimulus and the reference audio.

custom_type Options

When using create_evaluator_from_template_json(), choose the appropriate custom_type based on your evaluation needs:

Value	Description	batch_size	File Configuration
`SINGLE`	Single stimulus evaluation	1	1 stimulus
`DOUBLE`	Double stimulus evaluation	2	2 stimuli (no reference)
`SINGLE_REF`	Reference-based comparison (CMOS)	2	1 reference + 1 stimulus
`RANKING`	Ranking evaluation	2+	Multiple stimuli

DOUBLE vs SINGLE_REF

	DOUBLE	SINGLE_REF (CMOS)
File Configuration	2 stimuli	1 ref + 1 stimulus
`is_ref` Usage	Not allowed (causes error)	Required
Comparison Type	A vs B	Reference vs Stimulus

Current Limitations

CMOS is currently in beta. The following limitations apply:

Workspace File Management: Adding, editing, or deleting files in the Workspace UI may not work as expected.
Workspace Template Creation: Creating CMOS templates directly in the Workspace is not supported. Use create_evaluator_from_template_json() with a JSON template to create custom CMOS evaluations.

These features are under active development and will be available in future updates.

Use Cases

Scenario	Evaluation Focus	Example Question
TTS Quality Assessment	Quality comparison	”Is the synthesized speech better or worse than the human recording?”
Voice Cloning Fidelity	Speaker similarity	”How similar is the cloned voice to the original speaker?”
Audio Codec Evaluation	Quality degradation	”How much quality loss occurred after compression?”
Speech Enhancement	Improvement measurement	”Is the enhanced audio clearer than the original noisy recording?”
Prosody Transfer	Style similarity	”Does the target match the speaking style of the reference?”
Emotion Synthesis	Emotion fidelity	”How well does the target convey the same emotion as the reference?”

Get Started

Basics

Details

Use Cases

Roadmap

SDK References

Reference-Based Comparison

Intro

Creating CMOS Evaluations

Example: Using create_evaluator()

Example: Using create_evaluator_from_template()

Example: Using create_evaluator_from_template_json()

Full Example

Key Considerations

custom_type Options

DOUBLE vs SINGLE_REF

Current Limitations

Use Cases

Get Started

Basics

Details

Use Cases

Roadmap

SDK References

​Intro

​Creating CMOS Evaluations

​Example: Using create_evaluator()

​Example: Using create_evaluator_from_template()

​Example: Using create_evaluator_from_template_json()

​Full Example

​Key Considerations

​custom_type Options

​DOUBLE vs SINGLE_REF

​Current Limitations

​Use Cases

Intro

Creating CMOS Evaluations

Example: Using create_evaluator()

Example: Using create_evaluator_from_template()

Example: Using create_evaluator_from_template_json()

Full Example

Key Considerations

custom_type Options

DOUBLE vs SINGLE_REF

Current Limitations

Use Cases