news

Don’t just focus on the ChatGPT version of Her. Domestic players are also getting excited about multimodal AI anthropomorphic interaction.

2024-07-31

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Synced

Author: Du Wei

How far has AI developed in recognizing human emotions? Earlier this month, a high-profile competition to challenge AI to be more emotional came to an end!

This isThe 2nd Multimodal Emotion Recognition Challenge (MER24)It was jointly initiated and held at the AI ​​top conference IJCAI2024 by Professor Tao Jianhua of Tsinghua University, Lian Zheng of the Institute of Automation of the Chinese Academy of Sciences, Björn W. Schuller of Imperial College London, Zhao Guoying of the University of Oulu, and Erik Cambra of Nanyang Technological University. It discussed how to use multimodal data such as text, audio and video for AI emotion recognition, and promote the application of related technologies in real human-computer interaction scenarios.



Competition official website: https://zeroqiaoba.github.io/MER2024-website/#organization

This year's challenge has three tracks, namely Semi (semi-supervised learning track), Noise (noise robustness track) and Ov (open vocabulary emotion recognition track).The Semi track has the most teams, is the most difficult, and has the most intense competition.

Taking the Semi track as an example, participating teams need to use a small amount of labeled and a large amount of unlabeled video data to train their models, and evaluate the performance and generalization ability of the models on the unlabeled datasets. The key to winning this track is to improve the model's emotion recognition performance by improving semi-supervised learning technology, such as the accuracy of predicting emotion categories.

Since the competition started in May, nearly 100 teams from around the world have competed in the past two months, including well-known universities and new startups.The first place in the Semi track was won by the social platform Soul App, and its voice technology team took the lead with its feasible innovative technical solutions.



However, before revealing the Soul team’s technical solution, we need to first understand AI’s emotion recognition capabilities in multiple modalities.

The next step in human-computer interaction

Let AI understand emotions

Today's AI seems to be omnipotent, capable of communicating, generating pictures or videos, solving math problems, etc., and is competent for tasks at different levels such as perception, learning, reasoning, and decision-making. Thanks to the support of large models, AI can be said to be smart enough, but it lacks in emotional aspects such as empathy.

In human-computer interaction, users sometimes not only need AI to follow instructions to complete tasks, but also need them to provide enough emotional value to meet emotional needs. From functional "basic operation" to emotional "advanced", the skills that AI needs to master need to be upgraded.

Therefore, multimodal emotion recognition has become an active research topic in the field of AI. AI that can understand and convey emotions has become a new hot spot pursued by the industry and is also considered to be the next major breakthrough in the field of AI. In the past six months, some AI startups and industry giants have unveiled a new form of immersive human-computer interaction for us.

In early April, a foreign startup Hume AI released a voice conversation robot, Empathetic Voice Interface (EVI), which can detect up to 53 emotions by analyzing and identifying the interlocutor's tone and emotions through voice communication. In addition, it can simulate different emotional states and be closer to real people in interaction. The breakthrough in AI emotion also allowed the startup to quickly obtain $50 million in Series B financing.

Next, OpenAI released its big move. Its flagship model GPT-4o demonstrated real-time audio and video call functions, responding instantly to user emotions and tone. It was called the ChatGPT version of "Her", and this voice function was officially opened to users recently. Since then, AI has possessed powerful eloquence and emotional perception capabilities, which makes people call the arrival of the science fiction era.

Domestic companies such as Microsoft XiaoIce and Lingxin Intelligence are also committed to creating emotional AI products. We can see a trend: emotion recognition capabilities are becoming more and more involved in multimodal AI applications such as text, audio and video. However, if we want to make further progress in the field of anthropomorphic emotion recognition, we still need to solve problems such as the scarcity of labeled data, the instability and inaccuracy of subjective emotion recognition.

Therefore, it is particularly necessary to encourage academia and industry to pay more attention to the field of multimodal emotion recognition and accelerate the innovation and progress of related technologies. At present, top AI academic conferences such as ACM MM and AAAI have regarded emotion computing as an important research topic, and top conferences such as CVPR and ACL have also held emotion computing related challenges. Especially in the face of the advent of the era of big data and big models, how to use a large amount of unlabeled data in multimodal emotion recognition and effectively process and integrate different modal information is a major challenge facing the industry. This is also the reason and significance of holding the MER24 challenge.

The Soul team won first place in the Semi track, thanks to its accumulated capabilities and innovations in multimodal data understanding, emotion recognition algorithms, model optimization platform tools, internal workflow construction, and efficient collaboration of the technical team.

Winning the most difficult track

What did the Soul team do?

Since the Semi track is the most difficult, what are the difficulties? How did the Soul team win first place? Let's continue reading.

Data is one of the three major elements of AI. Without sufficient, especially high-quality, data training, the model cannot guarantee good performance results. Faced with the challenges brought about by data scarcity, the industry must not only expand all types of data, including AI-generated data, but also focus on improving the generalization ability of models in data-sparse scenarios. The same is true for multimodal emotion recognition tasks. Its core lies in the support of massive labeled data, which labels different types of content such as text, audio and video with emotional labels such as joy, anger, sorrow, happiness and sadness. The reality is that data with emotional labels on the Internet is very scarce.

Semi track of this competitionOnly 5030 labeled data are provided, and the remaining 115595 are unlabeled data.Therefore, the scarcity of labeled data became the first problem encountered by all participating teams, including the Soul team.



Image source: MER24 baseline paper: https://arxiv.org/pdf/2404.17113

On the other hand, compared with the Noise and Ov tracks, the Semi track focuses on testing core backbone technologies, that is, it pays more attention to the selection of model architecture and feature extraction generalization capabilities, and has relatively high requirements for the technical accumulation and innovation of multimodal large model technology.



In view of the characteristics of the track with little labeled data and high technical requirements, the Soul team made sufficient pre-race preparations based on some modules of the self-developed large model accumulated previously, and determined a set of feasible innovative technical solutions. In terms of overall thinking, the strategy of "main trunk first, then fine-tuning" was adopted, first focusing on improving the generalization of each core feature extraction model, and then integrating them together; in the specific implementation process, the following aspects of work were done. These constitute their core advantages.

First of all, we focus on multimodal feature extraction in the early stage. In the end-to-end model architecture, we use pre-trained models to extract emotional representations of different modalities such as text, speech, and vision, and focus on the commonalities and differences in emotions to improve the effect of emotion recognition. In the later stage, we propose effective fusion methods based on the features of each modality of multiple modalities, and fuse these modules to form a model architecture. In order to improve the generalization performance of the pre-trained model, the Soul team specifically proposed EmoVCLIP for the first time in the field of emotion recognition for the video modality. EmoVCLIP is a model based on the large model CLIP combined with prompt learning technology with more generalized performance in the field of video emotion recognition.

In addition, in order to improve the emotion recognition capability of the text modality, the Soul team used GPT-4 to pseudo-label emotions for the text modality, making full use of GPT-4's emotion attention capabilities to improve the accuracy of emotion recognition in the text modality, laying a better foundation for further modal fusion in the future.

Secondly, in terms of multimodal feature fusion, the Soul team used the Modality Dropout strategy for the first time in the direction of multimodal emotion recognition and studied the performance impact of different dropout rates. In order to alleviate the competition problem between modalities, a certain modality (text, voice or video modality) is randomly suppressed during the model training process to achieve better robustness and enhance the model's generalization ability on unseen data beyond the provided labeled data.

Finally, semi-supervised learning technology comes into play. The basic idea is to use labeled data to train a model, then predict the unlabeled data, and generate pseudo labels for the unlabeled data based on the prediction results. These pseudo labels are used to train the model and continuously improve the model effect. The Soul team uses this self-training strategy in semi-supervised learning to cyclically pseudo-label more than 110,000 unlabeled data in the Semi track and add them to the training set, iterate the model and update it to obtain the final model.



Soul team's technical plan for the competition.

From the overall idea to multimodal feature fusion, contrastive learning, and unlabeled data self-training, the Soul team's technical solutions brought them good results.In terms of the accuracy of multimodal emotion recognition of speech, vision and text, the system proposed by the Soul team improved by 3.7% over the baseline system, reaching more than 90%.At the same time, the Soul team can also better distinguish emotions with confusing boundaries in the field of emotion recognition (such as anxiety and worry).



Image source: MER24 baseline paper: https://arxiv.org/pdf/2404.17113

From a deeper perspective, the Soul team's success in the MER24 challenge is a concentrated reflection of its deep cultivation of AI big model technology, especially multimodal emotional interaction capabilities, in the social field.

Innovative multimodal anthropomorphic interaction

Social AI is the next level

The social field naturally requires emotional AI. A mainstream view is that the essence of social interaction is the exchange of emotional values, and emotions are diverse. This means that if AI wants to seamlessly integrate into social scenarios and function efficiently, it must provide rich emotional feedback and experience like a real person.

The foundation for realizing empathy AI is to give it powerful multimodal emotion recognition capabilities, and to evolve from a simple "task executor" to a "companion that meets human emotional needs". However, it is still very difficult for AI to effectively understand emotions. It is fundamentally different from humans in terms of understanding context, perceiving user emotions, giving emotional feedback, and thinking. Therefore, continuous innovation of related technologies and algorithms is very important.

For Soul, which is rooted in the social field, focusing on building AI with emotional capabilities has become an important proposition that needs to be considered. When it was first launched in 2016, Soul first thought about how to use innovative technologies and products to better meet user needs. Among them, the introduction of AI to solve the need for people to connect with each other has become the key to its foothold in the social field and its development and growth. The "Lingxi Engine" launched earlier uses intelligent recommendation algorithms to mine and analyze user interest maps and full-scene features on the site, making it easier for them to find people they can chat with and more needed content, forming a highly sticky user and content ecosystem. To date, this more "smart" algorithm application matching scenario is also one of the very active features on Soul.

With the successful experience of early AI-assisted social interaction, in this wave of rapid development of large-scale model technology, Soul further explores new possibilities for human-computer interaction based on AI's involvement in social interaction and auxiliary relationship networks.

Since the launch of AIGC-related algorithm research and development in 2020, Soul has focused on multimodality and has accumulated cutting-edge capabilities in intelligent dialogue, image generation, voice and music generation. Compared with pure technology-oriented AI startups, Soul has a major feature in that it adopts a "model-response integration" strategy, which simultaneously promotes large models and AIGC applications on the C-end.Focus on creating AI with emotion recognition capabilities, and truly achieve warm feedback in rich anthropomorphic interaction scenarios

From Soul's actions in the past two years, it can be seen that it has accelerated the pace of AIGC empowering social scenarios. In 2023, the self-developed language model Soul X was launched, becoming an important infrastructure for AIGC + social layout. With the support of the model's prompt drive, conditional controllable generation, context understanding, multimodal understanding and other capabilities, the in-site conversation is not only smooth and natural, but also has emotional warmth.

Text became the first test of Soul's emotion recognition capabilities, and gradually extended from a single modality to more modalities. This year, Soul launched a large voice generation model and officially upgraded its self-developed voice model, covering sub-fields such as voice generation, voice recognition, voice dialogue, and music generation. While supporting functions such as real-time voice generation and voice DIY, it also has the ability to simulate real-time dialogue with multiple emotions.

Of course, in addition to continuing to work on more emotional AI at the model level, Soul has also made use of them in the diversified social scenarios of its platform, further enriching and enhancing users' AI interaction experience.

Take Soul's anthropomorphic conversational robot "AI Goudan" as an example. It relies on Soul's self-developed language model Soul X to achieve anthropomorphic interaction. It can not only accurately understand the multimodal content such as text and pictures input by users, but also actively send them care according to the dialogue scene in multiple rounds of communication, as if they were talking to a real person. At the same time, users can also customize their own Goudan to experience unique virtual human interaction.



AI Gou Dan also demonstrated its ability to integrate anthropomorphism, knowledge, multimodality, time perception and other aspects, which made many users on the Soul site marvel at its powerful anthropomorphic interaction capabilities. This is why many users on the Soul platform will actively post complaints such as "I'm afraid Gou Dan is not a real person."

In addition, Soul also relies on Soul X to introduce AI NPC in the game scene "Werewolf Phantom". With the help of advanced reinforcement learning technology, it has anthropomorphic decision-making capabilities such as disguise, trust, leadership and confrontation at all stages of the game. It can directly play Werewolf with users and speak without any sense of disobedience.

Another example is Soul, which launched its first independent new app outside of its main site, “Echoes of Another World”. As an AI social platform, users can engage in immersive real-time communication with virtual characters in multiple scenes and styles. These characters all have the ability to communicate with images, voices, and personalities. Of course, users can customize virtual characters and customize their personalities (such as background, experience, personality, etc.) according to their preferences, which is very playable.

Similarly, the self-developed voice model also plays a role in scenes such as AI Gou Dan, Werewolf Phantom, and Echoes of Another World. For example, in Echoes of Another World, the voice call function is supported, and the virtual characters with real-life voices can communicate with users naturally and in real time, enriching the interactive experience.



"Echoes from Another World" real-time voice call function.

In addition to continuing to deepen AI personified interactions in social scenarios such as intelligent conversations, games, and voice, Soul is also building the ability to generate diverse styles of paintings that suit its own aesthetics in the field of visual generation, creating AI digital avatars, and further moving towards a multi-dimensional comprehensive interactive experience.

It can be seen that Soul's layout in the field of AI emotion recognition has covered language, voice and visual multi-modality, and has made efforts in text, pictures, audio and video scenes that are closely related to social interaction, allowing users to experience warm AI in three-dimensional, multi-sensory human-computer interaction.

Conclusion

2024 is called the first year of AIGC application by many people in the industry. The focus is no longer just on parameters and basic capabilities. With the trend of shifting from the model layer to the application layer, only by first implementing AI in vertical fields and scenarios can we win more users and markets. Especially for human-computer interaction in the C-end track, it is more natural to focus on user needs. This is well reflected in the social field.

Previously, many dating apps such as AlienChat were shut down, and the topic of "The first batch of young people who fell in love with AI broke up" became a hot topic. Behind this, functional homogeneity is partly the reason, and it is also because the experience has not changed from an assistant/NPC role to a companion who truly provides emotional support. This requires that in the social field, by enriching the human-computer interaction methods and scenarios, AI can fully participate in all social links, have deep emotional exchanges with users, and provide them with emotional value.

This may also be one of the core competitive points in the direction of AI social networking. It is not difficult to understand why Soul, as the application layer, attaches so much importance to the accumulation of self-developed technical capabilities. In the past period of time, on the one hand, it has been committed to creating personalized, anthropomorphic, and diversified AI capabilities; on the other hand, it has accelerated the implementation of AI Native applications from multiple dimensions, including social experience improvement, AI social networking, AI games, etc., to form a complete AI product chain, and provide users with the fun of AI interaction in various social scenarios.

It can be said that in recent years, Soul has incubated a series of product results based on its self-developed language and voice models, and has accumulated rich innovative technologies and practical experience in the process of improving the emotional interaction experience between AI and users. These have paved the way for it to win the first place in the MER24 Challenge, and are also the basis for it to compete and communicate with high-quality international teams.

In recent years, there are more and more such challenges, such as the NTIRE 2024 AIGC Quality Evaluation Challenge at the CVPR 2024 Workshop, and the two consecutive MER Challenges in 2023 and 2024. Domestic companies have repeatedly achieved good results with the technology accumulated in practice. For example, SenseTime, which won the first place in MER23 last year, and Soul, which won the first place this year, have achieved remarkable results by paying attention to and investing in AIGC technology and applications.

It can be foreseen that in the future, platforms like Soul that insist on technology and product innovation will continue to create value for users in the process of unleashing AI capabilities. In this way, they will be able to realize more lasting and diversified business value on the basis of forming a prosperous content and community ecology.