Text-to-speech has come a long way since the robotic voice developed for Stephen Hawking in 1986; a peculiar voice which he kept until his death in March 2018, aged 76. That computer-generated voice, created by MIT engineer Dennis Klatt, based on Klatt's own voice, used algorithms developed by Swiftkey, a British company later acquired by Microsoft. You can listen to Stephen Hawking's final public lecture (A Brief History Of Time) in this video:
Despite the advances in text-to-speech synthesis, Stephen Hawking refused to upgrade his voice. The original 1980s sound had become part of his public persona.
What does a synthetic voice sound like today?
Since I'm passionate about the possibilities of AI-assisted creative automation, I tested the three leading text-to-speech engines: Amazon Polly, Microsoft Azure Cognitive Services and Google Cloud.
The purpose of this article is to give you my honest opinion about the way they render human voice based on the same text prompt.
For the sake of this quick experiment, I will use a male voice for all three services. There will obviously be differences due to the tone of voice but I've tried to pick the best example for each provider. The text prompt will be a poem by Thomas Hardy: "She Opened The Door". As a bonus, we'll conclude the review with a human recording.
Google Cloud Text-To-Speech
You can test Google's text-to-speech offering on https://cloud.google.com/text-to-speech
The most advanced synthetic voices at Google are named WaveNet voices, powered by machine learning algorithms. Here's the test rendered by WaveNet Voice D.
The text is properly read, no obvious mistakes but you'll have noticed that it lacks emotion.
Bear in mind that it's impossible to download the MP3 rendering from the public test page. I've used AudioHijack to record the output of my browser.
Amazon Polly text-to-speech
Let's see how Amazon Polly performs given the same poetic prompt. State-of-the-art AI synthetic voices are called Neural at Amazon.
You can test Amazon's service at https://console.aws.amazon.com/polly/home/SynthesizeSpeech?region=us-east-1
I've hired Matthew to perform Thomas Hardy's poem "She Opened The Door". Let's listen to his version.
I prefer this voice to the one tested at Google but that's subjective (the tone of voice is different). In terms of performance, I would say Matthew is slightly more engaged than his Google counterpart but he still lacks the emotion of a human actor.
Let's see how Microsoft compares to its competitors.
Note: you can download the MP3 rendition straight form the Amazon Polly public console.
Microsoft Azure Cognitive Services Text-To-Speech
You can test Microsoft's text-to-speech offering at https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/#features
The most advanced voices are also named "neural" at Microsoft. Let's pick Guy as our performer. I've pasted each poem section as a single line in the Azure console, to avoid long pauses between each line.
Guy isn't an accomplished actor but in my opinion he performs marginally better than his silicon colleagues. A glaring issue is that he doesn't understand that he's reading a poem, a feat which requires much more emotion than reciting a training manual. Guy has a second voice option (Newscast) but it didn't really fit the use case. I also used AudioHijack to capture the MP3 recording.
Can you control the expressivity of synthetic voices?
On all services, using the API or the console, you can add SSML tags to your texts to insert pauses and other pronunciation instructions, which can in turn improve the expressivity of the performance. However, doing this requires time-consuming manual inputs.
You can also create your own custom voice, based on a voice talent's recordings, to develop a unique rendering, powered by machine-learning. This also requires some time and fine tuning.
What's the price of text-to-speech voice synthesis?
Compared to a human reader, it's of course very cheap.
You can get a full reading of "A Christmas Carol" by Charles Dickens (64 pages / 165K characters) for $2.64.
To compare the state-of-the-art option for all three providers (WaveNet / Neural / Neural), here's the pricing at time of writing.
Google Cloud https://cloud.google.com/text-to-speech/pricing WaveNet voices: $16 per 1 million characters
Amazon Polly: https://aws.amazon.com/polly/pricing/ Neural TTS Cost: $16 per 1 million characters
Microsoft Azure: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/ Neural cost: 0.5 million free characters per month, then $16M per 1 million characters
All providers offer the same pricing structure ($16 per 1 million characters) but you can get 500,000 free characters at Microsoft Azure (which also gives you a free £150 allowance across all Microsoft's Cognitive Services when you sign up).
You can use all these services either via the no-code consoles listed above (you'll have to use a hack to download the MP3 for Google and Microsoft) or call the API, following the online instructions.
I've successfully connected Microsoft Azure's API to Integromat via a single authentication and was able to process a series of text prompts from a Google Sheet. Amazon Polly and Google Cloud require more advanced authentication methods.
How does synthetic text-to-speech compare to a human actor?
I've picked Thomas Hardy's poem "She Opened The Door" since in 2019 I had commissioned the recording of a series of Hardy's poems from a professional voice talent.
You can listen to his performance in the video below, which will show that, as we speak, human actors still perform at a higher level than machines, with little to no detailed instructions. Human consciousness still has the advantage of intuition.
But the fast improvement of text-to-speech quality has proven that we're not far from a synthetic emotional rendition. Singularity is near. Stay tuned.