news

china's first large audio generation model has been registered

2024-09-20

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

phoenix.com technology news, september 20: recently, in the latest list of shanghai generative big models approved for registration released by the shanghai cyberspace administration, the himalaya audio big model and text big models such as mihoyo and china literature group's dream island have been approved for registration, becoming the first audio generation big model in the country to pass the generative artificial intelligence service of the cyberspace administration.

the himalaya audio big model is the world's first fourth-generation audio generation big model with multi-emotional interpretation and supernatural expression.this model will lead the entire audio industry's aigc to evolve from the third-generation audio generation model to the fourth-generation audio generation large model.

the himalaya audio model is an llm framework developed by the everest ai team based on self-developed text-audio joint modeling, which realizes joint modeling training of audio and text under the same spatial vector representation.this joint modeling approach fully endows the audio generation task with powerful semantic information and makes full use of the intrinsic connections and complementary information between them., greatly improving the performance and generalization ability of the model. this is also the core technological breakthrough that enables the fourth-generation audio model to surpass the previous generation.

during the training process, himalaya everest ai first pre-processes the audio data and text data separately, converts them into token forms suitable for model input, and maps the audio tokens and text tokens to the same space vector representation, so that the model can better understand and process the relationship between audio and text. the overall training process includes pre-training (pretraining), supervised fine-tuning (sft), domain supervised fine-tuning (domain sft), speaker supervised fine-tuning (speaker sft), and reinforcement learning (rl). through the training of these processes,the model has the following features: (1) 15s timbre cloning and voice conversion capabilities. (2) super-human, multi-emotional, and human-preference-aligned speech generation. (3) highly controllable style and paralinguistic capabilities.

the himalaya everest ai r&d team evaluated the trained model and found that in the scenario of long audio content such as audio novels, the controllability of the character's interpretation style, the stability of the phoneme performance, the naturalness of the speech flow, rhythm and pauses were significantly higher than those of the third-generation audio generation models at home and abroad.

ximalaya audio big model implements the paradigm of "integration of production and model", by combining the model with the industry to form a positive feedback loop of business, data, and algorithms. it is widely used in business scenarios such as aigc audio books and chat conversational interactions. for example, the recently popular audio book "my altay" was generated by the himalaya audio big model. himalaya everest ai said that the audio big model capabilities can be directly experienced and used on the everest ai official website, and users can directly create their own audio content.