Demos for "Multi-reference Tacotron by Intercross Training for Style Disentangling, Transfer and Control in Speech Synthesis"

Paper:  Under review.
Author:  Yanyao Bian, Changbin Chen, Yongguo Kang, Zhenglin Pan

Abstract

Speech style control and transfer techniques aim to enrich the diversity and expressiveness of synthesized speech. Existing approaches model all speech styles into one representation, lacking the abilty to control a specific speech feature independently. To address this issue, we introduce a novel multi-reference structure to Tacotron and propose intercross training approach, which together ensure that each sub-encoder of the multi-reference encoder independently disentangles and controls a specific style. Experimental results show that our model is able to control and transfer desired speech styles individually.

Introduction

Demos here are generated by WaveNet for better audio fidelity. However, the works in the paper is irrelevant to the audio quality. When hearing the demos, the readers are recommended to focus on distinguishing different styles of the audios.

Contents

Experiments in the paper

1 Single-reference

1.1 Style transfer

Compare the performance of non-parallel (text contents of source and target are irrelevant) style transfer between the original model and ours (intercross). Reference audios are randomly chosen from a same speaker.
These demos are corresponding to Figure 5 in our paper.
Reference audios
Original
Intercross
Synthesized text: 正提高自己的能力,为了更好的为你解决难题。
More non-parallel style transfer cases over different speakers and different text contents.
speaker1 speaker2 speaker3 speaker4
Reference audios
Text1
Text2
Text3
Text1: 我可不会。
Text2: 南开永远年轻,她的学生也都充满活力。
Text3: “你知道,我是用稻草填塞的,所以我没有脑子。”他悲伤的回答。

1.2 Style control

Visualization of gradual changes in mel-spectrogram. We present GIFs of text1, text2 and text3 from left to right for intuitive view.
We reuse the reference audios of speaker1 and speaker4 in above experiments. Then we conduct linear interpolation between them, introduced in Section 2.5 in our paper.
These demos are corresponding to Figure 6 in the paper.
speaker1, α = 0.0 α = 0.25 α = 0.5 α = 0.75 speaker4, α = 1.0
Text1
Text2
Text3
Text1: 我可不会。
Text2: 南开永远年轻,她的学生也都充满活力。
Text3: “你知道,我是用稻草填塞的,所以我没有脑子。”他悲伤的回答。

1.3 Random sampling

These demos are generated by random sampling introduced in Section 2.5 in our paper.
Synthesized text: 我是发音人随机采样的测试文本。

1.4 One-shot & few-shot

We use audios from extra 20 speakers to conduct one-shot speaker conversion introduced in section 3.1.4. We see not all of them can be converted success. We present two success cases and two failed cases here.
Reference audios Text1 Text2 Text3
Case1 (success)
Case2 (success)
Case3 (failed)
Case4 (failed)
Text1: 我赢了所有人,但却输掉了你。
Text2: 我心里就甜滋滋的,像吃了蜜一样。
Text3: 他眼里迸射出仇恨的火花。
We conduct few-shot speaker conversion over the speaker of failed cases in one-shot speaker conversion. We using different number of utterances to fine-tune the models and evaluate their performance.
Utterance number Text1 Text2 Text3
Case3 10
20
Case4 10
20
Text1: 我赢了所有人,但却输掉了你。
Text2: 我心里就甜滋滋的,像吃了蜜一样。
Text3: 他眼里迸射出仇恨的火花。

2 Multi-reference

2.1 Style Control

In the experiments, we control two styles: speaker and prosody. We define 5 prosodies by different speaking scenes: news, story, radio, poetry, call-center. We briefly conclude their features as follows.
news: relative fast; formal.
story: many transitions and breaks.
radio: relative slow; deep and attractive voice.
poetry: slow; obeys rules of rhyming.
call-center: relative fast; sweet;
We conduct parallel multi-styles control. As table 1 in our paper shows, speaker F17 and speaker M5 both have styles news and radio.
Formally, we generates audios according to following equations.
speaker = F17 + α 1 * (M5 - F17)
prosody = Radio + α 2 * (News - Radio)
These demos are corresponding to Figure 8(a) in the paper.
α 1 = 0.0 α 1 = 0.25 α 1 = 0.5 α 1 = 0.75 α 1 = 1.0
α 2 = 0.0
α 2 = 0.25
α 2 = 0.5
α 2 = 0.75
α 2 = 1.0
Synthesized text: 2018已经成为大家共同的回忆,2019年的“冰与火之歌”正在唱响。
We conduct non-parallel multi-styles control. As table 1 in our paper shows, speaker F4 only has style call-center and speaker M2 only has style poetry. And we are going to generate non-existent styles for the speakers.
Formally, we generates audios according to following equations.
speaker = M2 + α 1 * (F4 – M2)
prosody = poetry + α 2 * (callcenter - poetry)
These demos are corresponding to Figure 8(b) in the paper.
α 1 = 0.0 α 1 = 0.25 α 1 = 0.5 α 1 = 0.75 α 1 = 1.0
α 2 = 0.0
α 2 = 0.25
α 2 = 0.5
α 2 = 0.75
α 2 = 1.0
Synthesized text: 2018已经成为大家共同的回忆,2019年的“冰与火之歌”正在唱响。

Extra experiments

1 Emotion

1.1 Style control

Similar to speaker control, we conduct emotion control and present demos here.
speaker1, α = 0.0 α = 0.3 α = 0.5 α = 0.7 speaker4, α = 1.0
Neutral-happy
Happy-angry
Angry-confuse
Confuse-sad
Sad-surprise
Surprise-fear
Text: 同学们欣喜若狂,全都兴高采烈地欢呼起来。

1.2 Random sampling

Similar to speaker random sampling, we conduct emotion random sampling and present demos here.
Synthesized text: 我是情感随机采样的测试文本。

Possible products preview

There are too many applications of our model to show. We list some of what we are working on.

1 Expressive TTS

Adjusting the synthesis prosodies to fit the text contents will significantly improve the performance. Following demos are synthesized using the same speaker and different prosodies. Furthermore, giving different style embeddings during synthesizing each sentence, we can obtain audios with various styles to implement role-play TTS.
Style-news Style-poetry Style-radio
Text-news
Text-poetry
Text-radio

2 Customization TTS

Everyone can have their own synthesized voices using one-shot/few-shot. Thanks to our PMs for providing many recodings for testing.
We present 2 ground-truth voices and 2 synthesized voices for each speaker. Try to distinguish them.
speaker1
speaker2