news

"Her" has a new image! Make video calls to AI with almost no delay, Sequoia YC invested in it

2024-08-16

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

The fastest conversation video AI in history is here.Delay of less than one second

End-to-end, able to listen, see, speak and have an image.



This product does not come from a company that has already made a name for itself, such as OpenAI or HeyGen, and it does not have a specific name.

Because it comes from the entrepreneurial teamTavus, hence it is also called Conversational Replicas by Tavus.

The main function is to build an immersive AI-generated video experience.

After going online today, it has already topped Producthunt’s list of today’s new products, and the number of likes is still rising.



Tavus officially summarized the product features for everyone:

  • Latency less than one second
  • Realistic and intelligent digital twin
  • Plug-and-play end-to-end building blocks
  • Modular, customizable components, such as LLM speech synthesis

Netizens were excited:

Okay, now there’s “someone” to conduct ZOOM video conferences for me hahaha!



Many netizens also regard this asA better human-computer interaction interface than reading documents or chatting

This conversational video interface is a game changer!
I can already imagine the endless possibilities for immersive experiences.



You can try it for 2 minutes on the web page

After seeing this message, QuantumBit rushed to Tavus' official website in one second.

On the official website, you can experience this "fastest conversation video in history" online for 2 minutes.

According to the existing settings,The person you are talking to during the experience is Carter, a character created by Tavus

Carter, portrayed as an employee of AI video research company Tavus, responded with humor while being helpful.

This is the man below:



Although Carter is a virtual character, video chatting with him is just like video chatting with your own friends.

Official advice is to try to stay in a quiet room when chatting with Carter after authorizing the camera and microphone.

Carter mentioned in the conversation that the topics people like to discuss with him most, in addition to asking him about the AI ​​technology used by Tavus, are sharing their daily experiences and telling jokes.

He told a joke on the spot:

Question, why can't the bike just stand there on its own?
The answer is, because it is too tired (Two tires).

After he finished speaking, Carter cheered himself up and laughed twice.



QuantumBit also actually experienced it for 2 minutes, and the overall feeling is as follows:

First, TavusThe response speed is really fast, which is in line with the official claim of “less than one second”.

Even if you suddenly speak while he is talking, Carter will immediately stop and listen to your latest statement.

Secondly, although the official website claims that it supports more than 30 languages, no matter whether you ask questions in Chinese or English, it always fails.Unable to speak Chinese

When we ask him "Can you speak Chinese?", Carter will answer: "I prefer to speak English!"



Third, Tavus’ AICan indeed "see with eyes"

While trying out Quantum位, I was embarrassed for a moment. I didn't know what to ask and could only smile foolishly.

Carter immediately spoke:

Oh! You smiled at me~



Fourth, in the trial version, Carter'sThe lip movements and the words spoken are almost perfectly synchronized

It is no wonder that some netizens said after trying it:

It's really impressive, with fast response times and excellent video and audio production capabilities.



Now, you can use Tavus’ conversational video AI by just registering.

In the official version,Carter isn't the only AI character you can talk toThere are both men and women, with roles ranging from sales to life guidance.

The background of the chat can also be changed according to the user's choice, and is not limited to the office scene.



At the same time,Ability to manually enter context for conversations

It can be said that the degree of personalized customization is very high.



There are currently free versions and paid versions, corresponding to different generation rights.



Developed based on self-developed models

Behind the Tavus conversational video AI is the Phoenix-2 model developed by the Tavus team.

This is a combination of 3D models and 2D GANs driven by audio and text, which can generate realistic short videos of 1-2 minutes.

The generation process is roughly divided into the following four steps:

TTS (Text-to-Speech) – 3D reconstruction of head and shoulders – Cue-word script-driven facial animation – High-fidelity rendering.



△ Fine-tune facial geometry details through differential rendering

In order to make the AI ​​image that communicates with users more realistic, the Tavus team built the video rendering pipeline of Phoenix-2.Combination of GAN and 3D Gaussian splatting.

The reason for this is that traditional GANs are usually limited by image resolution, while volumetric models always lack temporal consistency.

So Tavus thought of combining the two.

Training GANs requires large datasets and expensive computing resources, and due to their two-dimensional nature and temporal consistency issues, the inference time and video quality are usually limited.

Tavus uses 3D models as an “intermediate”, achieving rendering speeds of over 100 FPS and a higher degree of controllability and versatility due to the physically aware constraints around dynamic objects.



△ Compare the differences between 2D and 3D head talking models

In addition, the improvement of the Phoenix-2 model over the previous series is that it replaces the NeRF of the original Phoenix model.

We instead use 3D Gaussian splatting to learn how the induction drives dynamic facial deformation in 3D space, and use this information to render views based on unseen audio.

Team members said that compared with NeRF, 3D Gaussian splashing performs better in terms of data, memory, computational complexity, process, and rendering efficiency.

The Phoenix-2 model pipeline based on 3D Gaussian splatting can be trained 70% faster than the original model and rendered at 60+ FPS.



Tavus said,During the conversation, there is turn end detection and interruptibility, making users feel that the conversation is more real.

Additionally, because facial information is extremely sensitive, the team provides safety checks, security protocols, automated content review, and anti-illusion checks to protect information security.



It is worth mentioning that the Phoenix series models also support another product of Tavus——

Generate conversation videos of the user’s digital twin.

You only need to provide 2 minutes of material and spend $1 (starting from) to call the API to generate video content.



Official tips: An end-to-end solution can be provided with the following capabilities:

  • Use APIs to build secure and realistic digital twins or AI agents
  • Customize LLM, dialogue characters and backgrounds
  • Stream conversations in embedded meeting rooms
  • Record, transcribe and share conversations
  • Handle high traffic volumes with production-grade scalability
"If you don't <1s, you're not a human being."

The Tavus team is a small AI video startup that was founded four years ago.

Most of the members come from Amazon, Descript, Google, Apple, etc.

Public information shows that as of March this year, the company has received A-round investment from Sequoia, Scale VC, and YC, with a financing amount of approximately US$18 million.



Tavus's co-founder and CEO is namedHassaan Raza

Worked at Google and Apple.



The company's co-founder and COO left a message on Producthunt saying that the production of conversational video AI took a long time, and research, engineering, and construction took about thousands of hours.

As for why we pursue a delay of 1 second or less?

The official answer isSimulate human-to-human video conversations as closely as possible

Because if the reaction speed is not less than 1 second, then the person (chatting with you) is not a human being.

Reference Links:
[1]https://www.tavus.io/careers
[2]https://x.com/heytavus/status/1824075891271749903
[3]https://www.producthunt.com/posts/conversational-replicas-by-tavus