revealing the secret of ai forging xiao yang’s recording: zero cost, only three seconds

revealing the secret of ai forging xiao yang’s recording: zero cost and only takes three seconds

2024-09-30

the "lu wenqing recording" exposed in the xiao yang incident first caused a public outcry due to the scale of the content, and then it was found that it was all forged by ai.

over time, ai technology has once again been pushed to the forefront.

picture/official reply from yanyu technology

regardless of whether the technology is good or bad, in essence, ai synthesized recording can be understood as a kind of deepfake, which uses deep learning algorithms to simulate and forge audio and video, that is, through the deep learning model in artificial intelligence technology, people’s voices, facial expressions and body movements are spliced together into very realistic fake content.

from a technical perspective, it is neutral. in addition to voice simulation, similar methods also include ai face-changing, face synthesis, video generation, etc., collectively referred to as deep forgery.

however, neutral technology cannot prevent users from seeking evil intentions.

lan mediahui consulted lin hongxiang, founder and ceo of fengping intelligence, a leading domestic ai digital human company. regarding this type of incident, lin hongxiang said frankly that the improvement in production efficiency brought by ai is comprehensive, but in the middle of the expansion of "application" , if violations are to be completely isolated, systematic regulations and effective implementation may be required.

according to the current technical level of the industry, users only need to find a few minutes of scattered material as ai learning samples to quickly clone a complete ai human voice. some of the speech pauses, emotions and intonations in the recording can be added, subtracted, and adjusted through technical means.

moreover, when it comes to practical applications, the cost of copying a set of ai human voices is "now not high." many applications on the market will provide some free entrances. taking the model involved as an example, the reecho model provides free voice cloning service, more professional version requires an additional fee.

a section of boss lu's live broadcast intercepted from the internet was converted into audio and imported. in just a few seconds, boss lu's ai voice was cloned.

then we imitated a recording in the original incident that had very outrageous emotions and text, and used it as a script import model to create a recording of lu wenqingrui commenting on musk, and we were done.

"xiao ma and the others are gone, right? i'm telling you, whoever i want to be popular can be popular, understand. i know a lot of ceos, and i don't praise anyone i praise. don't mention musk to me. , it doesn’t work, you know, it doesn’t work, it doesn’t work even when we drink, who is he? without three sheep, who will sell goods to him, do you understand this?”

frankly speaking, if you have listened to those kinds of ai scam calls too many times, or are sensitive to human voices, you can actually tell that the ai audio has a "machine feel" - the intonation is too stable from beginning to end, and it will never sound when people are emotionally excited. it will be like this. but this is just the most basic normal version model and instant cloning function. if there is more sufficient corpus and the professional cloning function is selected, the effect will be more "real".

so, is it possible for audio and video synthesized by ai to be as intuitive as a lie detector to distinguish authenticity through data?

at a technical level, it is feasible. lin hongxiang said that in addition to the authorization of the user himself, there are indeed relevant standards under construction in the ai digital human industry, requiring that all kinds of ai-generated content be added with special identifiable "feature marks."

this label is not simply adding a "generated by xx ai" watermark in the corner. taking ai synthesized sound as an example, it will add additional noise frequency bands outside the frequency band of human speaking sounds, even within the range of visible sounds. add certain characteristic frequency bands.

this characteristic frequency can be identified by the machine. if identification is required, the device can extract these frequency bands, and theoretically the authenticity can be determined.

but at present, there are not many companies willing to popularize this function. the limiting factor is the cost of one more procedure. although the cost of a single use model is not high, each audio and video model is pre-installed. the investment in the training phase and the costs incurred in developing the next generation audio and video model after phased output still put great pressure on ai companies at this stage.

at present, the ai audio and video industry is still in its early stages. how to acquire customers while covering costs during the promotion stage is a topic that practitioners cannot avoid.

but these are obviously not things that criminals with evil intentions would consider. whether fireworks or bombs depends on how the gunpowder is used.

more than half a year ago, the hong kong police disclosed a fraud case involving a total amount of hk$200 million. in the case, employees of the hong kong branch of a multinational company received a notice from the cfo of the headquarters, saying that the headquarters was planning a "secret transaction" and needed to transfer company funds to several local accounts in hong kong for later use.

then, the employees were invited to participate in a "multi-person video conference" initiated by the headquarters, and in accordance with the meeting requirements, hk$200 million was transferred 15 times to 5 bank accounts.

source/cctv news

in fact, in this multi-person video conference, except for the employees of the branch, the other "people" were artificial intelligence images synthesized by fraudsters using public audio and video slices, and then used the video conference call to change faces and voices. the fraud team directly it becomes the executive team calling the shots.

in the hong kong case, the criminals are equivalent to using ai face-changing + ai voice changer to appear on the scene. however, xiao yang’s ai forged recording this time was completely synthesized by a large model after learning the relevant audio materials of lu wenqing from three sheep company. , the emotions are close to the entire audio of a real person. the process is just that simple - ai synthesized audio and video is already a mature technology, and related products have also developed into a complete industry.

however, the mainstream of ai synthesized audio and video is definitely not fake. in the plot of the wandering earth part 2, tu hengyu, played by andy lau, resurrected yaya in the form of a digital life. outside the plot, the late famous movie star ng meng-tat also appeared on the screen through ai.

therefore, if there is another incident like xiao yang's recording incident in the future, before discussing whether the technology is guilty or innocent, we should try to control the people first.

take care of humanity and save ai.

news

revealing the secret of ai forging xiao yang’s recording: zero cost and only takes three seconds

introduction

my contact information