What is the best text-to-speech engine? Amazon Polly, Microsoft Azure Cognitive Services, Google Cloud?

Text-to-speech has come a long way since the robotic voice developed for Stephen Hawking in 1986; a peculiar voice which he kept until his death in March 2018, aged 76. That computer-generated voice, created by MIT engineer Dennis Klatt, based on Klatt’s own voice, used algorithms developed by Swiftkey, a British company later acquired by Microsoft. You can listen to Stephen Hawking’s final public lecture (A Brief History Of Time) in this video.

Despite the advances in text-to-speech synthesis, Stephen Hawking refused to upgrade his voice. The original 1980s sound had become part of his public persona.

What does a synthetic voice sound like today?

Since I’m passionate about the possibilities of AI-assisted creative automation, I tested the three leading text-to-speech engines: Amazon Polly, Microsoft Azure Cognitive Services and Google Cloud.

The purpose of this article is to give you my honest opinion about the way they render human voice based on the same text prompt.

For the sake of this quick experiment, I will use a male voice for all three services. There will obviously be differences due to the tone of voice but I’ve tried to pick the best example for each provider. The text prompt will be a poem by Thomas Hardy: “She Opened The Door”. As a bonus, we’ll conclude the review with a human recording.

Google Cloud Text-To-Speech

You can test Google’s text-to-speech offering on https://cloud.google.com/text-to-speech

The most advanced synthetic voices at Google are named WaveNet voices, powered by machine learning algorithms. Here’s the test rendered by WaveNet Voice D.

The text is properly read, no obvious mistakes but you’ll have noticed that it lacks emotion.

Bear in mind that it’s impossible to download the MP3 rendering from the public test page. I’ve used AudioHijack to record the output of my browser.

Amazon Polly text-to-speech

Let’s see how Amazon Polly performs given the same poetic prompt. State-of-the-art AI synthetic voices are called Neural at Amazon.

You can test Amazon’s service at https://console.aws.amazon.com/polly/home/SynthesizeSpeech?region=us-east-1

I’ve hired Matthew to perform Thomas Hardy’s poem “She Opened The Door”. Let’s listen to his version.

I prefer this voice to the one tested at Google but that’s subjective (the tone of voice is different). In terms of performance, I would say Matthew is slightly more engaged than his Google counterpart but he still lacks the emotion of a human actor.

Let’s see how Microsoft compares to its competitors.

Note: you can download the MP3 rendition straight from the Amazon Polly public console.

Microsoft Azure Cognitive Services Text-To-Speech

You can test Microsoft’s text-to-speech offering at https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/#features

The most advanced voices are also named “neural” at Microsoft. Let’s pick Guy as our performer. I’ve pasted each poem section as a single line in the Azure console, to avoid long pauses between each line.

Guy isn’t an accomplished actor but in my opinion he performs marginally better than his silicon colleagues. A glaring issue is that he doesn’t understand that he’s reading a poem, a feat which requires much more emotion than reciting a training manual. Guy has a second voice option (Newscast) but it didn’t really fit the use case. I also used AudioHijack to capture the MP3 recording.

Can you control the expressivity of synthetic voices?

On all services, using the API or the console, you can add SSML tags to your texts to insert pauses and other pronunciation instructions, which can in turn improve the expressivity of the performance. However, doing this requires time-consuming manual inputs.

You can also create your own custom voice, based on a voice talent’s recordings, to develop a unique rendering, powered by machine-learning. This also requires some time and fine tuning.

What’s the price of text-to-speech voice synthesis?

Compared to a human reader, it’s of course very cheap.

You can get a full reading of “A Christmas Carol” by Charles Dickens (64 pages / 165K characters) for $2.64.

To compare the state-of-the-art option for all three providers (WaveNet / Neural / Neural), here’s the pricing at time of writing.

Google Cloud https://cloud.google.com/text-to-speech/pricing WaveNet voices: $16 per 1 million characters

Amazon Polly: https://aws.amazon.com/polly/pricing/ Neural TTS Cost: $16 per 1 million characters

Microsoft Azure: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/ Neural cost: 0.5 million free characters per month, then $16M per 1 million characters

All providers offer the same pricing structure ($16 per 1 million characters) but you can get 500,000 free characters at Microsoft Azure (which also gives you a free £150 allowance across all Microsoft’s Cognitive Services when you sign up).

You can use all these services either via the no-code consoles listed above (you’ll have to use a hack to download the MP3 for Google and Microsoft) or call the API, following the online instructions.

I’ve successfully connected Microsoft Azure’s API to Integromat via a single authentication and was able to process a series of text prompts from a Google Sheet. Amazon Polly and Google Cloud require more advanced authentication methods.

How does synthetic text-to-speech compare to a human actor?

I’ve picked Thomas Hardy’s poem “She Opened The Door” since in 2019 I had commissioned the recording of a series of Hardy’s poems from a professional voice talent.

You can listen to his performance in the video below, which will show that, as we speak, human actors still perform at a higher level than machines, with little to no detailed instructions. Human consciousness still has the advantage of intuition.

But the fast improvement of text-to-speech quality has proven that we’re not far from a synthetic emotional rendition. Singularity is near. Stay tuned.

👉 I went full meta, asking Microsoft Azure TTS to voice my article dedicated to Generative AI.

Here’s the recording, using 3 different neural voices.

User-friendly AI Voice Studios

Replica by AI Lab

I’ve recently discovered an amazing service offered by a US/AU startup: Replica by AI Lab.

They’ve trained neural-quality realistic & expressive voices which you can use as programmatic voice actors in games and other multimedia productions.

I love the personality of the voices presented in their demo, which you can discover via this link.

👀 Their voice engine is available as a plugin on the iclone / Reallusion plaftorm. The output is truly mind blowing. I could see myself spending hours animating 3D characters with this software.

Murf.ai

After reading an article dedicated to AI-based content creation, I revisited Murf.ai which didn’t really impress me when I first tried it in early 2022. Their recent updates have drastically improved the quality of the TTS output.

Murf.ai is a user-friendly TTS service which offers expressive AI voice actors for all sorts of purposes. You can search the database for voices optimized for audiobooks, product demos, meditation or advertisements. The quality of Pro-level voice rendering is pretty impressive. Those voices have a lot of character. I could see myself using them for simple audio ads.

You can also further fine tune the emphasis of specific words in your copy. Of course, as for the other solutions available on the market, you’ll achieve the best results in English, which benefits from more training material. French voices, for instance, still feel very robotic.

You can easily mix multiple voices in the same project to record a dialogue.

You’ll pay a premium compared to a pure API-based service like Microsoft Cognitive Services but the user-friendliness and the control on the expressivity of the voices fully justifies the price tag.

Murf.ai voices optimized for advertisements

Filip Sokolowski · A Small Blue Dot – Chapter 1 – A New Friend (SAMPLE)

Let's have a chat