news

"China's first model with voice capabilities comparable to GPT-4o", Lingo voice AI model opens for internal testing

2024-08-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

IT Home reported on August 24 that Xihu Xinchen, invested by Jinke Tom Cat, launched the Xinchen Lingo voice big model in August this year. It is the first end-to-end voice big model in China and has opened internal beta reservations today (August 24).

In the announcement released on August 21, the official introduction stated that compared with traditional TTS, the end-to-end speech model is a more comprehensive technology.It not only performs speech recognition, but also integrates multiple links such as natural language processing, intent recognition, dialogue management and speech synthesis, realizing the complete interactive process from speech input to speech feedback, greatly enriching the depth and breadth of human-computer interaction.

IT Home quoted an official press release, saying that the Lingo voice model is the first model in China that has the same voice capabilities as GPT-4o, and has the following three significant features in terms of technical capabilities:

Native speech understanding:As an end-to-end model, Lingo can not only recognize text information in speech, but also accurately capture other important features such as emotion, tone, pitch, and even ambient sound, helping the model to understand the speech content more comprehensively, thereby providing a more natural and vivid interactive experience.

Multiple voice styles:Lingo can adaptively adjust the speed, pitch, and noise intensity of speech according to the context and user instructions, and can generate voice responses in a variety of styles such as conversation, singing, and crosstalk, effectively improving the flexibility and adaptability of the model in different application scenarios.

Voice Mode Super Compression:Lingo uses a voice codec with a compression rate of hundreds of times, which can compress voice to an extremely short length, significantly reducing computing and storage costs while helping the model generate high-quality voice content.