Ranking - Podonos

Intro

When comparing three or more speech synthesis models, a ranking evaluation is an effective method for determining the relative quality of each model. Rather than comparing pairs individually, evaluators listen to a set of audio samples generated from the same script and rank them from best to worst. The Ranking evaluation is flexible in its evaluation criteria. You can rank models based on naturalness, overall preference, clarity, expressiveness, or any other quality dimension that matters to your use case.

Objective: Determine the relative ordering of multiple models by having evaluators rank them.
Use Case: Ideal for comparing TTS providers, model versions, or synthesis configurations side by side.
Type: RANKING in the SDK.

Example

In this example, we compare three different TTS providers by generating speech from the same scripts and submitting them for ranking evaluation. Here is a code example that you can immediately execute:

python

import podonos
from podonos import *
import provider_a, provider_b, provider_c

client = podonos.init()
etor = client.create_evaluator(
    name='TTS Provider Ranking',
    desc='Ranking evaluation across TTS providers',
    type='RANKING',
    lan='en-us',
    num_eval=10,
)

scripts = [
    'But in less than five minutes',
    'The two doctors therefore entered the room alone',
]

for script in scripts:
    path_a = provider_a.synthesize(text=script, output='provider_a.wav')
    path_b = provider_b.synthesize(text=script, output='provider_b.wav')
    path_c = provider_c.synthesize(text=script, output='provider_c.wav')

    etor.add_ranking_set([
        File(path=path_a, model_tag='Provider A', tags=['tts']),
        File(path=path_b, model_tag='Provider B', tags=['tts']),
        File(path=path_c, model_tag='Provider C', tags=['tts']),
    ])

etor.close()

Ok, let’s go line by line.

Create a Client

Let’s first create a new instance of Client.

python

client = podonos.init()

Create an Evaluator

Then, you create a new instance of Evaluator with type='RANKING':

python

etor = client.create_evaluator(
    name='TTS Provider Ranking',
    desc='Ranking evaluation across TTS providers',
    type='RANKING',
    lan='en-us',
    num_eval=10,
)

Generate speech and add ranking sets

For each script, generate speech from all providers and add a ranking set. Each ranking set contains one audio file per provider.

python

etor.add_ranking_set([
    File(path=path_a, model_tag='Provider A', tags=['tts']),
    File(path=path_b, model_tag='Provider B', tags=['tts']),
    File(path=path_c, model_tag='Provider C', tags=['tts']),
])

Finally, close the Evaluator object.

python

etor.close()

With this, you can rank multiple TTS providers via podonos from real human evaluators. Once these steps finish, you can check the results in your Workspace.

Use Case

Consider a scenario where you are evaluating multiple TTS providers to decide which one to integrate into your product. Each provider may have different strengths. One might excel at naturalness while another handles proper nouns better. Using the Ranking evaluation, you can have human evaluators directly compare all providers on the same scripts and produce a clear ordering, giving you confidence in your selection.

How It Works

Rankings are computed using the Bradley-Terry (BT) model, a well-established statistical method for deriving global rankings from pairwise comparisons. Each pair of models is compared by human evaluators, and the aggregated results are used to estimate a score for every model via maximum likelihood estimation. A key challenge in pairwise ranking is deciding which pairs to compare. With many models, the number of possible pairs grows quickly, but evaluation budgets are limited. Comparing pairs with obvious quality differences wastes valuable evaluations, while neglecting certain pairs leaves gaps in the data. To address this, Podonos uses adaptive pairing based on Fisher information. Fisher information quantifies how much a given comparison will improve the overall ranking accuracy. Comparisons between similarly-ranked models yield the most information, while lopsided matchups yield little. The system dynamically balances exploration (ensuring all pairs are observed) and exploitation (focusing on the most informative pairs), adapting automatically as data accumulates throughout the evaluation.

​Intro

​Example

​Use Case

​How It Works

Intro

Example

Use Case

How It Works