news

The latest technical details of the AI ​​audio model that is popular all over the world

2024-07-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Smart Things
Compiled by Meng Qiang
Edited by Yunpeng

According to Zhidongxi on July 24, Satbility AI shared the Stable Audio Open research paper on Arxiv on July 19, disclosing the technical details behind the model.

Stable Audio Open is an open source text-to-audio model launched by StabilityAI in June this year. It can generate samples and sound effects up to 47 seconds long for free, as well as 44.1kHz high-quality stereo audio, and can run on consumer-grade GPUs. In addition to being free and open source, the model also focuses on protecting the copyright of creators and tries to avoid ethical issues in data training.

The paper revealed that Stable Audio Open is a variant model of the commercial Stable Audio 2 launched by StabilityAI in March this year. The overall architecture remains consistent, but adjustments have been made to the use of training data and some architectures. The key architecture consists of an autoencoder, T5-based text embedding, and a diffusion model (DiT).

Paper address: https://arxiv.org/html/2407.14358v1

1. 3 key architectures provide support for free generation of 44.1kHz high-quality stereo short audio

Stable Audio Open introduces a text-to-audio model with three main architectures:

  1. Autoencoder: compresses waveform data to a manageable sequence length;
  2. Text embedding based on T5;
  3. Diffusion-based Transformer (DiT) models: operate in the latent space of the autoencoder.

An autoencoder is a neural network architecture consisting of an encoder and a decoder. The encoder compresses the input data into a smaller latent space representation, while the decoder decompresses the latent representation. The autoencoder in Stable Audio Open compresses the audio waveform into a shorter sequence for subsequent processing.


T5 (Text-to-Text Transfer Transformer) is a natural language processing model developed by Google that can convert input text into another text representation. In Stable Audio Open, the T5 model converts user-entered text into text embeddings to facilitate the integration of text information into the audio generation process.

DiT (Diffusion Transformer) is a diffusion model that operates in the latent space of the autoencoder, processes and optimizes the data compressed by the encoder to ensure that the decoder can restore coherent, high-quality audio.


As a variant model of Stable Audio 2, Stable Audio Open has made adjustments in the use of training data and some architectures. It uses a completely different data set and uses T5 instead of CLAP (Contrastive Language-Audio Pretraining). The former was developed by Google and focuses on text data to complete various natural language processing tasks, while the latter was developed by OpenAI and can process both language data and audio data.

As an open source and free model, Stable Audio Open cannot generate coherent and complete tracks, nor is it optimized for complete tracks, melodies, or vocals.

Stability AI claims that Stable Audio Open focuses on audio demos and sound effects production, and can generate up to 47 seconds of 44.1kHz high-quality stereo audio for free. After professional training, the model is very suitable for creating drum beats, instrumental repetitions, ambient sounds, foley recordings, and other audio samples for music production and sound design.

Another key benefit of this open source version is that users can fine-tune the model based on their own custom audio data. This way, users can train the model with their own drum recordings and generate unique beats in their own style.

2. The training process focuses on protecting copyright

Against the backdrop of the rapid development of generative AI, there is an increasing debate over the use of artificial intelligence in the music industry, especially on copyright issues. Ed Newton-Rex, former vice president of audio at Stability AI, left at the end of 2023 because he disagreed with Stability AI's use of copyrighted audio when training models, believing that this was unethical. He was involved in the development of Stable Audio.

The data training of generative AI is like a black box. No one except the developers knows whether the data used for training is protected by copyright. Newton-Rex said: "Many multi-billion dollar technology companies use creators' works to train generative AI models without permission, and then use these models to generate new content." He said in a public resignation letter that he does not accept this behavior of making profits by infringing creators' copyrights.

Stability AI said that in order to respect the copyright of creators, the data sets used by Stable Audio Open come from Freesound and the Free Music Archive (FMA), and all the recordings used are audio recordings released under the CC (Creative Commons) license. CC is a copyright licensing mechanism that allows creators to share their works and stipulate how others can use these works.


To ensure that any copyrighted material is avoided, Stability AI says it uses an audio tagger to identify music samples in Freesound, and the identified samples are sent to content detection company Audible Magic to ensure that potential copyrighted music is removed from the dataset.

“This allows us to create an open audio model while fully respecting the rights of creators,” Stability AI said.

Conclusion: Open source and free models make Vincent audio more popular

The launch of Stable Audio Open demonstrates Stability AI's innovation and progress in the field of text-to-audio models. Although the model has certain limitations in the length and coherence of generated audio, its advantages are also obvious. It can generate high-quality 44.1kHz stereo audio for free and can run on consumer-grade GPUs, lowering the threshold for using Wensheng audio.

At the same time, Stable Audio Open not only opens up audio generation technology, but also sets a new benchmark for copyright protection. In the future, with the continuous advancement of technology and the improvement of ethical standards, Stable Audio Open is expected to realize its potential in more application scenarios and promote the development and popularization of audio generation technology.

Currently, the Stable Audio Open model weights are available on the machine learning model platform Hugging Face. Stability AI encourages sound designers, musicians, developers, and anyone interested in audio to explore the model's capabilities and provide feedback.

Source: Stability AI