Intro

One of the popular questions in speech/audio evaluation is the quality of the generated output. It is not directly about naturalness or intelligbility. Quality is connected with many aspects including all those mentioned above.

One of the widely used quality evaluation methods for speech/audio is mean opinion score (MOS). Its scale typically ranges from 1 (lowest quality) to 5 (highest quality like human) with 1 granularity (which is called five-point Likert Scale). Through podonos, you will evaluate the overall quality of your speech/audio in a fully managed service.

Quality Mean Opinion Score Measurement

As one way of quality measurement, we demonstrate a quality measurement of synthesized human voice with additional noise. Below is an executable code example:

import podonos
import os
from podonos import *
from pydub import AudioSegment
from pydub.generators import WhiteNoise

def add_noise(sound, noise_level=0.005):
    noise = WhiteNoise().to_audio_segment(duration=len(sound))
    noisy_sound = sound.overlay(noise)
    return noisy_sound

client = podonos.init()
etor = client.create_evaluator(
    name="Quality evaluation with/without additive noise",
    desc="Noise is added to the clean audio",
    type="QMOS", lan="de-de", num_eval=7)

src_folder = "./src"
dst_folder = "./dst"

files = os.listdir(src_folder)
for name in files:
    noisy_sound = add_noise(name)
    noisy_sound.export(os.path.join(dst_folder, name), format="wav")

    # Add files
    etor.add_file(File(path=os.path.join(src_folder, name), model_tag='Original',
                       tags=["original"]))
    etor.add_file(File(path=os.path.join(dst_folder, name), model_tag='Noisy', 
                       tags=["level 0.005"]))

etor.close()

With this, you can evaluate the overall quality of the original and the one with additive noise.

Indepth Quality Measurement

Another way of evaluating the speech quality is to follow ITU-T P.808 recommendation. It recommends 1) how to qualify the evaluators, 2) how to train them, and 3) how to collect the evaluation results and analyze them. It is a demanding process if you setup a system and run the evaluation. With Podonos, you can easily set up the whole evaluation with a few lines of code.

Example

In this example, let’s assume you are developing a new speech enhancement algorithm, called MNSE (My New Speech Enhancement). We will use mnse as the name of your package.

Here is a code example that you can immediately execute.

import podonos
from podonos import *
import mnse  # This is your speech enhancement package.

client = podonos.init()
etor = client.create_evaluator(name='mnse', desc='mnse_param1_param2',
                               type='P808', lan='es-es', num_eval=15)
total_audio_files = 50
for i in total_audio_files:
    # Generate the enhanced audio file.
    enhanced_audio_path = mnse.enhance(f'/path/to/audio_{i}.wav')
    etor.add_file(File(path=enhanced_audio_path,
                       model_tag='MNSE', tags=['bella', 'female', 'echo']))
etor.close()

Behind the scene

In addition to selecting a proper group of evaluators, ITU-T P.808 requires additional steps to ensure the evaluation environment is relevant and they conduct each session in an appropriate manner.

Evaluator qualitification

Following ITU-T P.808, we qualify the evaluators by reviewing hearing device, mother tongue, age, gender distribution, hearing capability, and geographics. For those who are disqualified, they are forced to stop the evaluation session.

Evaluation with gold references

While your audio files are evaluated, we automatically inject the gold references (so called anchor or hidden questions) that we know the correct responses. If the evaluators respond incorrectly, their evaluation results are automatically rejected afterwards.

Reliability evaluation

Once the evaluation session is done, we automatically compute the overall evaluation reliability. Such evaluations that are significantly unreliable compared to other evaluations are marked and excluded.