Abstract:

State-of-the-art speech synthesis models try to get as close as possible to the human voice. Hence, modelling emotions is an essential part of Text-To-Speech (TTS) research. In our work, we selected FastSpeech2 as the starting point and proposed a series of modifications for synthesizing emotional speech. According to automatic and human evaluation, our model, EmoSpeech, surpasses existing models regarding both MOS score and emotion recognition accuracy in generated speech. We provided a detailed ablation study for every extension to FastSpeech2 architecture that forms EmoSpeech. The uneven distribution of emotions in the text is crucial for better, synthesized speech and intonation perception. Our model includes a conditioning mechanism that effectively handles this issue by allowing emotions to contribute to each phone with varying intensity levels. The human assessment indicates that proposed modifications generate audio with higher MOS and emotional expressiveness.

Notation of the models:

Please, refer to section 5 for detailed notation of the models.

ID Model
#0 Expressive FastSpeech2, baseline
#1 #0 + eGeMAPS predictor
#2 #1 + CLN
#3 #2 + CCA
EmoSpeech #3 + JCU

Audio samples

Speaker 0011

Model Neutral Angry Happy Sad Surprise
Sentence Monster made a deep bow. Who is been repeating all that hard stuff to you? Rat came and replied on the leaves. The football teams give a tea party. As rich as Peter’s son in law!
original
baseline
# 1
# 2
# 3
EmoSpeech

Speaker 0012

Model Neutral Angry Happy Sad Surprise
Sentence All smile were real and the happier,the more sincere. I thought you meant how old are you? Let’s make the noise a snake. She is now choosing skirt to wear. The football teams give a tea party.
original
baseline
# 1
# 2
# 3
EmoSpeech

Speaker 0014

Model Neutral Angry Happy Sad Surprise
Sentence A divine wrath made her blue eyes awful. Rat came and replied on the leaves. The football teams give a tea party. As rich as Peter’s son in law! Let’s make the noise a snake.
original
baseline
# 1
# 2
# 3
EmoSpeech

Speaker 0015

Model Neutral Angry Happy Sad Surprise
Sentence In which fox loses a tail and its elder sister finds one. She is now choosing skirt to wear. Hold up my chin, slow and solid. I thought you meant how old are you? A divine wrath made her blue eyes awful.
original
baseline
# 1
# 2
# 3
EmoSpeech

Speaker 0017

Model Neutral Angry Happy Sad Surprise
Sentence Who is been repeating all that hard stuff to you? Our thanks to God’s oath. She had said, so that one could keep up a conversation! Monster made a deep bow. How I hate this foul pool!
original
baseline
# 1
# 2
# 3
EmoSpeech