Naturalness

Intro

One of the popular measuress in the synthesized speech is the naturalness: measuring how natural the synthesized speech is. One of the most popular naturalness evaluation methods for speech/audio is mean opinion score (MOS). Its scale typically ranges from 1 (lowest naturalness like old robot) to 5 (highest naturalness like human) with 1 granularity (which is called five-point Likert Scale). Through podonos, you will evaluate the naturalness of your speech/audio in a fully managed way.

Example

Our first example uses AWS Polly to generate synthesized human voice and uses podonos for evaluation. Of course, you can use your own TTS (text-to-speech) model, or even your own voice. Here is a code example that you can immediately execute.

python

import podonos
from podonos import *
import boto3


polly_client = boto3.Session().client('polly')
client = podonos.init()
etor = client.create_evaluator(
    name='nmos_polly',
    desc='naturalness of Polly TTS',
    type='NMOS',
    num_eval=10,
)

scripts = [
    'But in less than five minutes',
    'The two doctors therefore entered the room alone'
]

for script in scripts:
    # Generate the synthesized speech.
	response = polly_client.synthesize_speech(
        VoiceId='Brian', OutputFormat='mp3',
        Text=script, Engine='neural'
    )

	filename = 'my_synthesized_speech.mp3'
	file = open(filename, 'wb')
	file.write(response['AudioStream'].read())
	file.close()
    etor.add_file(File(path=filename, model_tag='AWS Polly',
                       tags=['polly', 'brian']))

etor.close()

Ok, let’s go line by line.

Create a Client

Let’s first create a new instance of Client.

python

client = podonos.init()

Create an Evaluator

Then, you create a new instance of Evaluator:

python

etor = client.create_evaluator(name='nmos_polly',
    desc='naturalness of Polly TTS', type='NMOS', num_eval=10)

Add files

Now, you add every synthesized speech files to the evaluator.

python

etor.add_file(File(path=filename, model_tag='AWS Polly',
                   tags=['polly', 'brian']))

Finally, close the Evaluator object.

python

etor.close()

With this, you can evaluate the naturalness of two synthesized human voices via podonos in 5-point MOS scale from real human evaluators. By default, 10 humans will evaluate the naturalness of each audio, their results are analyzed. Once these steps finish, you can check the status in your Workspace.

Get Started

Basics

Details

Use Cases

Roadmap

SDK References

Intro

Example

Get Started

Basics

Details

Use Cases

Roadmap

SDK References

​Intro

​Example

Intro

Example