Bias Minimization

Why bias minimization matters

A subjective evaluation is only as good as the conditions it runs under. Order effects, model recognition, vague scales, and uneven loudness will all corrupt MOS, preference, and ranking results — sometimes by more than the effect size you are trying to measure. Podonos applies four core layers of defense.

Query-order shuffling

Each evaluator sees queries in an independently shuffled order to break order bias.

Within-query randomization

For double / triple comparison evaluations, the position of each audio is randomized per query.

Anchored instructions

Concrete examples of every scale point are pinned to the rating UI throughout the session.

Loudness normalization

All audio is loudness-normalized so loud renders do not win on volume alone.

1. Query-order shuffling

When two evaluators evaluate the same 100 queries, they do not see those queries in the same order. We shuffle order independently per evaluator. This breaks:

Anchoring on the first item. Whatever an evaluator hears first calibrates their internal scale; shuffling spreads that calibration evenly across the query set.
End-of-session fatigue concentration. Without shuffling, the last 10% of queries always come from the same part of your data. With shuffling, fatigue noise spreads uniformly.

The shuffling is not purely random — when there are clear quality differences across models, we adjust the order to avoid patterns that would let an evaluator infer “this is the new model” or “this is the same model again” across queries.

2. Within-query randomization

For evaluations that compare multiple audio files in a single query — preferences, comparative similarity, ranking — the position of each audio (left/right, A/B/C) is randomized per query, per evaluator.

Without this, an evaluator who prefers the first audio out of habit (a real and well-documented bias) would always favor whichever model you placed in slot A. Randomization eliminates this.

3. Anchored instructions

The word “naturalness” means different things to different people. So does “excellent.” Two evaluators can give the same audio a 3 and a 5 not because they disagree on quality, but because they disagree on what “5” means. Podonos anchors at three places in every session:

Training phase

Before the session starts, evaluators hear example audios labeled “this is what an Excellent sounds like,” “this is a Poor,” and so on for every scale point. They also receive precise question wording — for example, instead of “how natural is this voice?” we use “how much does this voice sound like it was spoken by a real human voice actor?”

Inline anchors during rating

A small clickable audio icon sits next to every scale option. If an evaluator forgets what “Excellent” sounded like 30 minutes in, one click reminds them. Anchor memory decays fast — this refresh is essential.

Question-phrasing review

Vague terms like “natural,” “good,” or “high quality” are replaced with concrete behavioral descriptions during evaluation design. See Evaluation Design & Review.

4. Loudness normalization

People prefer the louder audio. This is a well-replicated finding in audio perception research — when a listener compares two clips, the louder one wins on naturalness, preference, and quality metrics, even when the listener is told to ignore loudness. Loudness has to be controlled in two places: in the file and at the listener’s ear. Every audio file in an evaluation is loudness-normalized to a consistent integrated loudness target before evaluators hear it. Every session also opens with a calibration step: evaluators play a reference audio, set a comfortable listening volume, and that level holds for the rest of the session. Without this second step, file-side normalization is undone the moment one evaluator runs the session at half the volume of another. Together, this neutralizes the loudness bias entirely, leaving only the qualities you actually want to measure.

Content-side loudness normalization is on by default. You can disable it with use_loudness_normalization=False when you need evaluators to evaluate raw renders, but we strongly recommend leaving it on for any preference, naturalness, or quality study.

Blinding

Evaluators never see model names, customer identifiers, or condition labels. The rating UI shows only the audio and the question; every piece of metadata that could let an evaluator infer “this came from model X” or “this is the new model” is stripped before the session opens. Combined with query-order shuffling and within-query randomization, no signal remains for an evaluator to bias their rating toward a specific model or customer.

Cross-evaluation bias

In addition to the per-session controls above, Podonos applies a one-month cooldown to every evaluator across all evaluations on our platform. Once an evaluator participates in any Podonos evaluation, they are locked out of all Podonos evaluations for one month — across customers, evaluation types, and languages. This prevents memory-driven recognition of scripts, cross-customer model fingerprinting, and skill-drift from repeated exposure to the same evaluation type.

Overview

How We Do It

Best Practices

Bias Minimization

Why bias minimization matters

Query-order shuffling

Within-query randomization

Anchored instructions

Loudness normalization

1. Query-order shuffling

2. Within-query randomization

3. Anchored instructions

4. Loudness normalization

Blinding

Cross-evaluation bias

Overview

How We Do It

Best Practices

Documentation Index

​Why bias minimization matters

Query-order shuffling

Within-query randomization

Anchored instructions

Loudness normalization

​1. Query-order shuffling

​2. Within-query randomization

​3. Anchored instructions

​4. Loudness normalization

​Blinding

​Cross-evaluation bias

Why bias minimization matters

1. Query-order shuffling

2. Within-query randomization

3. Anchored instructions

4. Loudness normalization

Blinding

Cross-evaluation bias