The first TTS model that supports mixed Mandarin and dialects: Henan dialect and Shanghai dialect are fluent

2024-08-13

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

Since the emergence of GPT-4o in 2024, companies in the industry have invested huge resources in the research and development of TTS large models. In recent months, Chinese speech synthesis large models have sprung up like mushrooms after rain, such as chattts, seedstts, cosyvoice, etc.

Although the performance of current speech synthesis models for Mandarin Chinese is almost the same as that of real people, TTS models are rarely used when facing the complex Chinese dialects. Training a unified speech synthesis model for various Chinese dialects is an extremely challenging task.

Industry pain points and technical bottlenecks

At present, speech synthesis big model technology has made significant progress in the field of Mandarin, but the development in the field of dialects is very slow. China has dozens of major dialects, each with unique phonetic features and grammatical structures, which makes it extremely complicated to train a TTS big model covering various dialects.

Most existing TTS models focus on Mandarin and cannot meet the diverse needs of speech synthesis. In addition, the scarcity of dialect corpora and the lack of high-quality annotated data further increase the technical difficulty.

Giant Network AI Lab's Technological Innovation and Breakthroughs

In order to solve the above problems, algorithm experts and linguists in Giant Network AI Lab team worked together to build a Mandarin and dialect data set covering 20 dialects and more than 200,000 hours based on the Chinese dialect system. With this huge data set, we trainedBailing-TTS is the first TTS model that supports mixed Mandarin dialectsBailing-TTS can not only generate high-quality Mandarin speech, but also generate speech in multiple dialects including Henan dialect, Shanghai dialect, Cantonese, etc.

ArXiv: https://arxiv.org/pdf/2408.00284

Homepage: https://giantailab.github.io/bailingtts_tech_report/index.html

Paper title: Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

The following audio audition link: https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==∣=2650930007&idx=5&sn=383cf581d916b0802b940366bd4b9d5f&chksm=84e43f29b393b63f434ae60d4633694cd0362cec7590badfae2b0b683a5bd0c112e725c1f80d&token=2010422951⟨=zh_CN#rd

The following is the synthesis effect of Bailing-TTS Henan dialect:

Let me show you the effect of zero-sample cloning in Mandarin:

We have adopted a number of innovative technologies to achieve this goal:

1.Unified dialect token specification：We standardized the tokens of each dialect and made Mandarin partially overlap with the tokens of each dialect, so as to use Mandarin to provide basic pronunciation capabilities. This enables us to achieve high-quality dialect speech synthesis under limited data conditions.

2.Refined Token Alignment Technology: We propose a refined token-wise alignment technique based on large-scale multimodal pre-training.

3.Hierarchical Hybrid Expert Structure: We design a hierarchical mixture-of-experts architecture for learning both unified representations for multiple Chinese dialects and specific representations for each dialect.

4.Hierarchical reinforcement learning enhancement strategy: We proposed a hierarchical reinforcement learning strategy to further enhance the dialect expression ability of the TTS model by combining basic training strategies with advanced training strategies.

Implementation details

Figure 1 Bailing-TTS overall architecture

1. Fine-grained Token Alignment Based on Large-Scale Multimodal Pre-training

To achieve refined alignment of text and speech tokens, we propose a multi-stage, multimodal pre-training learning framework.

In the first stage, we use an unsupervised sampling strategy to perform coarse training on a large-scale dataset. In the second stage, we adopt a refined sampling strategy to perform fine-grained training on a high-quality dialect dataset. This method can effectively capture the fine-grained correlation between text and speech and promote the alignment of the two modalities.

2. Hierarchical Hybrid Expert Transformer Network Structure

To train a unified TTS model for multiple Chinese dialects, we designed a hierarchical hybrid expert network structure and a multi-stage multi-dialect token learning strategy.

First, we propose a specially designed hybrid expert architecture to learn unified representations for multiple Chinese dialects and specific representations for each dialect. Then, we inject dialect tokens into different layers of the TTS model through a cross-attention based fusion mechanism to improve the model's multi-dialect expression ability.

3. Hierarchical reinforcement learning enhancement strategy

We propose a hierarchical reinforcement learning strategy that further enhances the dialect expression ability of the TTS model by combining basic strategy training and advanced training strategies. The basic training strategy supports the exploration of high-quality dialect speech expressions, and the advanced training strategy strengthens the speech characteristics of different dialects on this basis, thereby achieving high-quality speech synthesis in multiple dialects.

Figure 2 Dialect MoE structure

Experimental Results

Bailing-TTS has achieved a level close to that of real people in terms of robustness, generation quality, and naturalness in Mandarin and multiple dialects.

Table 1. Test results of Bailing-TTS on Mandarin Chinese and dialects

In actual application scenario evaluations, Baling-TTS has achieved good results.

Table 2 Test results of Bailing-TTS speaker fine-tuning and zero-shot cloning on Mandarin Chinese and dialects

Technology application and future prospects

At present, this multi-dialect TTS model has been applied in many practical scenarios. For example, dubbing for NPCs in games, dubbing in dialects in video creation, etc. Through this technology, game and video content can be closer to regional culture and enhance users' immersion and experience.

In the future, with the further development of the end-to-end voice interaction model, this technology will show greater potential in the fields of dialect culture protection, game AI NPC dialect interaction, etc. In the dialect protection scenario, by supporting voice interaction in multiple dialects, the next generation can easily learn, inherit, and protect Chinese dialects, so that the Chinese dialect culture can last for a long time. In the game scenario, intelligent NPCs that can speak dialects and interact with voice will further enhance the expressiveness of the game content.

Giant Network AI Lab will continue to be committed to promoting the innovation and application of this technology to bring users a smarter and more convenient voice interaction experience.

About the Team

Giant AI Lab was established in 2022 and is an artificial intelligence technology application and research institution affiliated to Giant Network. It is committed to the field of AIGC content (image/text/audio/video/3D model, etc.) generation, realizing comprehensive intelligence of content production and creation, and promoting game play innovation. At present, the laboratory has built a full-link AI industrial production pipeline within Giant, and completed the filing of the first vertical large model (GiantGPT) in the game industry, and took the lead in commercial application.

news

The first TTS model that supports mixed Mandarin and dialects: Henan dialect and Shanghai dialect are fluent

Introduction

My contact information