State-of-the-art speech synthesis models try to get as close as
possible to the human voice. Hence, modelling emotions is an
essential part of Text-To-Speech (TTS) research. In our work,
we selected FastSpeech2 as the starting point and proposed a
series of modifications for synthesizing emotional speech. According to automatic and human evaluation, our model, EmoSpeech, surpasses existing models regarding both MOS score
and emotion recognition accuracy in generated speech. We
provided a detailed ablation study for every extension to FastSpeech2 architecture that forms EmoSpeech. The uneven distribution of emotions in the text is crucial for better, synthesized speech and intonation perception. Our model includes a
conditioning mechanism that effectively handles this issue by
allowing emotions to contribute to each phone with varying intensity levels. The human assessment indicates that proposed
modifications generate audio with higher MOS and emotional
expressiveness.
Notation of the models:
Please, refer to section 5 for detailed notation of the models.
ID
Model
#0
Expressive FastSpeech2, baseline
#1
#0 + eGeMAPS predictor
#2
#1 + CLN
#3
#2 + CCA
EmoSpeech
#3 + JCU
Audio samples
Speaker 0011
Model
Neutral
Angry
Happy
Sad
Surprise
Sentence
Monster made a deep bow.
Who is been repeating all that hard stuff to you?
Rat came and replied on the leaves.
The football teams give a tea party.
As rich as Peter’s son in law!
original
baseline
# 1
# 2
# 3
EmoSpeech
Speaker 0012
Model
Neutral
Angry
Happy
Sad
Surprise
Sentence
All smile were real and the happier,the more sincere.
I thought you meant how old are you?
Let’s make the noise a snake.
She is now choosing skirt to wear.
The football teams give a tea party.
original
baseline
# 1
# 2
# 3
EmoSpeech
Speaker 0014
Model
Neutral
Angry
Happy
Sad
Surprise
Sentence
A divine wrath made her blue eyes awful.
Rat came and replied on the leaves.
The football teams give a tea party.
As rich as Peter’s son in law!
Let’s make the noise a snake.
original
baseline
# 1
# 2
# 3
EmoSpeech
Speaker 0015
Model
Neutral
Angry
Happy
Sad
Surprise
Sentence
In which fox loses a tail and its elder sister finds one.
She is now choosing skirt to wear.
Hold up my chin, slow and solid.
I thought you meant how old are you?
A divine wrath made her blue eyes awful.
original
baseline
# 1
# 2
# 3
EmoSpeech
Speaker 0017
Model
Neutral
Angry
Happy
Sad
Surprise
Sentence
Who is been repeating all that hard stuff to you?
Our thanks to God’s oath.
She had said, so that one could keep up a conversation!