• Home
  • keyboard_arrow_right AI Speech Gets Real: BASE TTS

AI Speech Gets Real: BASE TTS

Jim Griffin June 3, 2024


Background

Amazon has introduced an amazing new model called BASE TTS (TTS = text-to-speech). These are the models that accept written text as an input, and then speak that text for us, which is what we use to create talking avatars and chatbots, among many other use cases.

BASE stands for Big Adaptive Streamable Emergent.

The top TTS models until now have been YourTTS, Bark and Tortoise-TTS. They’ve all been pushing speech synthesis closer and closer to human-like speech, so BASE from Amazon set out to beat them by training on more data than they did. It’s a billion-parameter model trained on 100,000 hours of audio data.

The video covers seven areas where text-to-speech is known to stumble sometimes. In ascending order of difficulty, those are:

  1. Compound nouns
  2. Syntactically-complex sentences
  3. Foreign words
  4. Unusual punctuation
  5. Questions
  6. Paralinguistics (things like groans, laughs, and whispers),
    and – most difficult of all . . .
  7. Emotions.

The video then presents 8 audio samples created by BASE TTS, each of which illustrates BASE TTS attempting to perform one of those especially-difficult tasks described above.

The results are quite impressive. Give a listen and see what you think!

Previous post