Voice cloning reaches human level, Microsoft's new VALL-E 2 model makes DeepFake comparable to voice actors

2024-07-24

New Intelligence Report

Editor: Qiao Yang

【New Wisdom Introduction】Following the first-generation VALL-E model at the beginning of last year, Microsoft recently launched the VALL-E 2 model, marking the first text-to-speech model that reaches human levels in terms of synthetic speech robustness, similarity, and naturalness.

Recently, Microsoft released the zero-sample text-to-speech (TTS) model VALLE-2, which achieved the same level as humans for the first time, which can be said to be a milestone in the field of TTS.

Paper address: https://arxiv.org/pdf/2406.05370

With the rapid progress of deep learning in recent years, training models with clean single-person speech in a recording studio environment can now achieve the same level of quality as humans, but zero-sample TTS is still a challenging problem.

“Zero-sample” means that during the inference process, the model can only refer to a short unfamiliar voice sample and speak the text content in the same voice, just like a ventriloquist who can imitate instantly.

After hearing this, I wonder if you will suddenly become alert - a model with this capability is the best tool for Deepfake!

Fortunately, MSRA has taken this into consideration and currently only uses the VALL-E series as a research project with no plans to incorporate it into products or expand its use.

Although VALL-E 2 has strong zero-shot learning capabilities and can imitate voices like a voice actor, the similarity and naturalness depend on factors such as the length and quality of the voice prompt and background noise.

On the project page and in the paper, the authors made an ethical statement: If VALL-E is to be promoted to real-world applications, at least a strong synthetic speech detection model is needed, and an authorization mechanism is designed to ensure that the model has been approved by the voice owner before synthesizing speech.

Some netizens expressed disappointment with Microsoft's practice of only publishing papers but not products.

After all, the various failed products recently have made us realize that just watching the demo is completely unreliable, and not being able to try it yourself = nothing.

But some people on Reddit speculated that Microsoft just didn't want to be the "first to try something new" and that it didn't release the model because it was worried about the possible criticism and negative public opinion it might bring.

Once there is a way to transform VALL-E into a product, or other competing products emerge in the market, will Microsoft still be worried that it will not make money?

It is indeed as netizens said, judging from the demo currently released on the project page, it is difficult to judge the true level of VALL-E.

Project page: https://www.microsoft.com/en-us/research/project/vall-ex/vall-e-2/

The five texts are all short English sentences of no more than 10 words. The voice tones of the voice prompts are very similar, and the English accents are not diverse enough.

Although there are not many demos, one can vaguely feel that the model's imitation of British and American accents is very perfect. However, if the prompt has a slight Indian or Scottish accent, it will be difficult to achieve the level of being indistinguishable from the real thing.

method

The predecessor of the model, VALL-E, was released in early 2023 and is already a major breakthrough in zero-sample TTS. VALL-E is able to synthesize personalized speech from a 3-second recording while retaining the speaker's voice, emotions, and acoustic environment.

However, VALL-E has two key limitations:

1) Stability: Random sampling used in the inference process may lead to unstable output, and kernel sampling with small top-p values may lead to infinite loop problems. Although it can be alleviated by multiple sampling and subsequent sorting, it will increase the computational cost.

2) Efficiency: VALL-E’s autoregressive architecture is bound to the same high frame rate as off-the-shelf audio codec models and cannot be adjusted, resulting in slow inference speed.

Although many studies have been conducted to improve these problems of VALL-E, they often complicate the overall architecture of the model and increase the burden of expanding the data scale.

Building on these previous works, VALL-E 2 contains two key innovations: repetition aware sampling and grouped code modeling.

Repeat-aware sampling is an improvement on random sampling in VALL-E. It can adaptively adopt random sampling or nucleus sampling. The selection is based on the repetition of previous tokens. Therefore, it effectively alleviates the infinite loop problem of VALL-E and greatly enhances decoding stability.

Algorithm description of repeated perceptual sampling

Grouped code modeling divides the codec code into multiple groups, and each group is modeled on a single frame during autoregression. This not only reduces the sequence length and accelerates inference, but also improves performance by alleviating the problem of long context modeling.

It is worth noting that VALL-E 2 only requires simple speech-transcribed text data for training, without the need for additional complex data, which greatly simplifies the data collection and processing process and improves potential scalability.

Specifically, for each speech-text data in the dataset, the audio codec encoder and text tokenizer are used to represent it as codec code = [0, 1, …, (−1)] and text sequence = [0, 1, …, (−1)] for the training of autoregressive (AR) and non-autoregressive (NAR) models.

Both AR and NAR models use the Transformer architecture, and the subsequent evaluation experiments designed 4 variants for comparison. They share the same NAR model, but the group sizes of the AR model are 1, 2, 4, and 8 respectively.

The reasoning process is also a combination of AR and NAR models. The first code sequence with target code ≥ ′,0 is generated based on the text sequence and code hint <′,0, and then the target code of each group is generated by autoregression.

Given a ≥′,0 sequence, we can infer a NAR model using textual conditions and acoustic conditions <′ to generate the remaining target code sequences ≥′,≥1.

The model training used data from the Libriheavy corpus, which contains 50,000 hours of speech from 7,000 people reading English audiobooks. The word segmentation of text and speech used BPE and the open source pre-trained model EnCodec, respectively.

In addition, the open source pre-trained model Vocos is also used as the audio decoder for speech generation.

Evaluate

In order to verify whether the speech synthesis effect of the model can reach the same level as humans, two subjective indicators, SMOS and CMOS, were used for evaluation, and real human speech was used as ground truth.

SMOS (Similarity Mean Opinion Score) is used to evaluate the similarity between the speech and the original prompt, with a score range of 1 to 5 in increments of 0.5 points.

CMOS (Comparative Mean Opinion Score) is used to evaluate the naturalness of synthesized speech compared with a given reference speech. The scale range is -3 to 3, with an increment of 1.

According to the results in Table 2, the subjective score of VALL-E 2 not only exceeds that of the first generation of VALL-E, but also performs more perfectly than real human speech.

In addition, the paper also uses objective indicators such as SIM, WER and DNSMOS to evaluate the similarity, robustness and overall perceptual quality of synthesized speech.

In terms of these three objective indicators, no matter how the group size of VALL-E 2 is set, there are all-round improvements compared with VALL-E. The WER and DNSMOS scores are also better than real human speech, but there is still a certain gap in the SIM score.

In addition, it can be found from the results in Table 3 that when the AR model group size of VALL-E 2 is 2, the best effect can be achieved.

Similar conclusions can be drawn from the evaluation on the VCTK dataset. When the prompt length increases, the grouped code modeling method can reduce the sequence length and alleviate the generation errors caused by the incorrect attention mechanism in the Transformer architecture, thereby improving the WER score.

About the Author

The first author of this article, Sanyuan Chen, is a joint PhD student of Harbin Institute of Technology and Microsoft Research Asia. He has been an intern researcher in the Natural Language Computing Group of MSRA since 2020. His research interests focus on pre-trained language models for speech and audio processing.

References:

https://arxiv.org/abs/2406.05370

news

Voice cloning reaches human level, Microsoft's new VALL-E 2 model makes DeepFake comparable to voice actors

Introduction

my contact information