AI Speech Gets Real: BASE TTS

Jim Griffin 03/06/2024

Amazon has introduced an amazing new model called BASE TTS (TTS = text-to-speech). These are the models that accept written text as an input, and then speak that text for us, which is what we use to create talking avatars and chatbots, among many other use cases.

BASE stands for Big Adaptive Streamable Emergent.

The top TTS models until now have been YourTTS, Bark and Tortoise-TTS. They’ve all been pushing speech synthesis closer and closer to human-like speech, so BASE from Amazon set out to beat them by training on more data than they did. It’s a billion-parameter model trained on 100,000 hours of audio data.

The video covers seven areas where text-to-speech is known to stumble sometimes. In ascending order of difficulty, those are:

Compound nouns
Syntactically-complex sentences
Foreign words
Unusual punctuation
Questions
Paralinguistics (things like groans, laughs, and whispers),
and – most difficult of all . . .
Emotions.

The video then presents 8 audio samples created by BASE TTS, each of which illustrates BASE TTS attempting to perform one of those especially-difficult tasks described above.

The results are quite impressive. Give a listen and see what you think!

Author

Jim Griffin

Segment of One – Now it’s Real

Jim Griffin 03/06/2024

“Segment of One” is where every customer in a database of millions can be treated in a different way. Although there’s been buzz about that since at least 1989, true […]

Default

Raghav Ram: 40 LLMs, One Answer

Jim Griffin 24/11/2025

Default

Michael Koved: The Economics of Generative AI

Jim Griffin 10/06/2025