Intro

One of the popular measuress in the synthesized speech is the naturalness: measuring how natural the synthesized speech is. One of the most popular naturalness evaluation methods for speech/audio is mean opinion score (MOS). Its scale typically ranges from 1 (lowest naturalness like old robot) to 5 (highest naturalness like human) with 1 granularity (which is called five-point Likert Scale). Through podonos, you will evaluate the naturalness of your speech/audio in a fully managed way.

Example

Our first example uses AWS Polly to generate synthesized human voice and uses podonos for evaluation. Of course, you can use your own TTS (text-to-speech) model, or even your own voice. Here is a code example that you can immediately execute.

python
import podonos
from podonos import *
import boto3


polly_client = boto3.Session().client('polly')
client = podonos.init()
etor = client.create_evaluator(
    name='nmos_polly',
    desc='naturalness of Polly TTS',
    type='NMOS',
    num_eval=10,
)

scripts = [
    'But in less than five minutes',
    'The two doctors therefore entered the room alone'
]

for script in scripts:
    # Generate the synthesized speech.
	response = polly_client.synthesize_speech(
        VoiceId='Brian', OutputFormat='mp3',
        Text=script, Engine='neural'
    )

	filename = 'my_synthesized_speech.mp3'
	file = open(filename, 'wb')
	file.write(response['AudioStream'].read())
	file.close()
    etor.add_file(File(path=filename, model_tag='AWS Polly',
                       tags=['polly', 'brian']))

etor.close()

Ok, let’s go line by line.

1

Create a Client

Let’s first create a new instance of Client.

python
client = podonos.init()
2

Create an Evaluator

Then, you create a new instance of Evaluator:

python
etor = client.create_evaluator(name='nmos_polly',
    desc='naturalness of Polly TTS', type='NMOS', num_eval=10)
3

Add files

Now, you add every synthesized speech files to the evaluator.

python
etor.add_file(File(path=filename, model_tag='AWS Polly',
                   tags=['polly', 'brian']))
4

Close

Finally, close the Evaluator object.

python
etor.close()

With this, you can evaluate the naturalness of two synthesized human voices via podonos in 5-point MOS scale from real human evaluators. By default, 10 humans will evaluate the naturalness of each audio, their results are analyzed. Once these steps finish, you can check the status in your Workspace.