Deepfake Research: How Easy Is It to Scam Individuals?
According to data aggregated by the Software Engineering Institute of Carnegie Mellon University, there was a “nearly five-fold increase” in deepfake incidents from 2022 to 2023. This is highly disproportionate to the recorded 32% increase in reported overall artificial intelligence (AI) controversies in the same period, suggesting that deepfakes are spreading at a more rapid pace than other AI tech.
This trend is highly concerning, especially given the rise in cases of misuse of this technology. In February 2024, for example, a cybercriminal used live deepfake technology to pose as the chief financial officer of a company in Hong Kong and tricked a finance worker into paying out $25 million.
Concernedly, the technology behind face and voice cloning is fast becoming increasingly accessible. In September 2024, BBC News covered a lawsuit filed by two voice actors whose voices had been cloned without permission and replicated for sale and broadcast. A month later, Microsoft announced a new feature on Teams, allowing real-time AI translation that mimics user voices.
With generative AI becoming ever more pervasive in the tech space, we at WizCase sought to explore the complexity behind real-time face and voice cloning. Specifically, we wanted to learn how easy it is to create live deepfake videos and hopefully understand this technology’s potential societal impacts.
Our Experiment: Creating and Testing Deepfakes
To find out the practical requirements and complications of generating deepfakes, we ran an experiment using real-time cloning tools to impersonate a test subject. We then sampled our live deepfake on unsuspecting colleagues to see if they would spot anything wrong and what, if any, red flags they picked up on. Here’s how it went.
Software and Hardware Used
We set everything up on a Windows-based PC. Below is an overview of the tools and software used to carry out the experiment:
- Deep Live Cam: This program was used to clone the subject’s likeness in real time using only their LinkedIn profile photo. It requires a moderately powerful GPU and Python programming knowledge, but it’s generally easy to install and run for users with coding experience. The software became the #1 trending repository on GitHub in August 2024.
- RVC (Retrieval-based-Voice-Conversion-WebUI): This is most commonly used as a voice cloner for songs, but it doesn’t work in real time. Instead, it requires an audio track file, which the program then converts to the desired voice. So, we used this framework to train the model that was later used to clone the test subject’s voice in real time.
- Voice Changer (by w-okada): This program was used alongside Deep Live Cam to clone the subject’s voice in real time, using the model trained using RVC.
- Alacritty: We used this Rust-based terminal emulator to run commands needed to run Deep Live Cam, RVC, and Voice Changer.
- Python 3.12 and 3.10.15: We used Python 3.12 to run Deep Live Cam and Voice Changer, while Python 3.10 was used to run RVC.
- OBS (Open Broadcaster Software): We used this open-source recording and streaming app to input the face-cloning results from Deep Live Cam into Google Meet.
- VB-Audio: This software simulates having an audio input cable (microphone) connected to the computer, allowing us to send the output from the Voice Changer to this virtual cable.
- Voice Memos: Our subject used Voice Memos on an iMac to record themselves reciting the Harvard sentences over 15 minutes of audio. The test sentences were used to train the working model.
- Audio splitter: The model-training process requires audio files to be no longer than 10 seconds each, so we used a generic audio splitter to divide the recording into 140 .wav files, separating each complete sentence.
Generating the Deepfake Models
Installing and running the face-swapping software went smoothly for the most part. The program was downloaded from GitHub and required Python 3.12 to run in the version we used. We used Alacritty to run the command to activate the program (the default Windows terminal should also work).
Below is a breakdown of how we used Deep Live Cam after it has been successfully activated:
The preview generated by the program replaced the tester’s likeness with the subject’s while seamlessly responding to the movements of the tester. Our tester scored the likeness generated 9 out of 10, even though the image used was merely downloaded from LinkedIn. A better-quality image, a higher-resolution camera (for recording), and more advanced hardware will produce even better results.
Python Errors Encountered With Deep Live Cam
Python’s complexity and modularity became a significant issue, which we believe casual or inexperienced users might also find to be the biggest hurdle. Python errors can be difficult to decode from the console or terminal, so it often takes longer to find and resolve problems.
Moreover, modules and dependencies are usually loaded in sequence. As such, if one module fails, other subsequent errors aren’t flagged until the first one is resolved.
Several libraries were missing when we first tried to run Deep Live Cam, and we had to decipher the errors manually one by one.
After all of the missing dependencies had been installed, the software required the installation of a separate program. Still, Deep Live Cam would not run properly. Thorough and time-consuming research eventually revealed that the program needed a specific version of Python 3.12.
We encountered one final issue after the Deep Live Cam had booted. During the first few attempts, the program wasn’t using the AMD graphics card, resulting in pixelation and low frame rates. After our tester spotted the problem, they realized that startup execution commands were missing.
Overall, these issues took several hours to fully resolve.
Cloning the test subject’s voice turned out to be more challenging. While finding the necessary software and learning how to use it was not a major issue, executing the process took trial and error.
We first downloaded RVC to train the model for voice-cloning. RVC has several versions, each catering to different types of hardware (for NVIDIA and for AMD or Intel).
The necessary program only provided binaries for execution, which meant that additional dependencies had to be manually installed using a package manager in the terminal. After installing all the necessary packages, we started RVC using a terminal command, which launched RVC through a web app.
We then went to the “Train” tab of the program to input the sample recordings we had and train the model. Below, we detailed how we used RVC for model training:
When the program finished running, it generated two files: one with a .pth extension and another with a .index extension. These files contained the parameters needed for the real-time voice-cloning.
Python Errors Encountered With RVC
During RVC installation and execution, the program requested dependencies that had already been installed while resolving the issues with Deep Live Cam. After a thorough investigation, we found that RVC required a specific version of Python and specific versions of every dependency.
However, those installations failed either because the modules already existed in another version or because the existing versions of the modules were not available in Python 3.12.
To solve these issues, we created a local Python environment to run the program’s requirements in isolation. We used the pyenv, which allowed the necessary versions to run in a virtual container.
Hardware Issues Encountered With RVC
After resolving all issues with Python, we found that we were still unable to use RVC because it required an NVIDIA graphics card with CUDA cores for model training. The error message was seen as follows:
We resolved the problem by enlisting the help of our systems engineer, who had an NVIDIA RTX 2070 and an RTX 4080. The sample audio files were sent to them, and they sent back the .pth and .index files after two days.
We must note that we found forums where some users claimed to have been able to run RVC using an AMD graphics card. However, this process would’ve been much more complicated and time-consuming — especially for casual users — or it would require a specific AMD graphics card version.
Overall, the issues with voice training and cloning took around a week to address.
After successfully training the model with RVC, we were ready to run the Voice Changer program. It activates from the terminal and functions as a web app that operates locally.
Upon successfully activating the program, we found preloaded models that were primarily based on anime characters’ voices. To use our custom model for cloning the test subject’s voice, we followed the process outlined below:
Real-Time Cloning and Testing
We prepared the test by installing OBS and VB-Audio and setting them up as a virtual camera and mic to transmit the cloned live video and voice to Google Meet. We followed the process below:
We conducted our real-time deepfake experiment on two unknowing colleagues. Our first target (a new team member) had never met the test subject before, while the second target had been working with the subject for a few months and had previously been in several video meetings together.
Two separate Google Meet video calls were arranged. To simulate a scenario wherein cybercriminals use deepfake technology for fraud, we posited that the test subject had been hacked. The subject used their corporate email to send meeting invitations to both targets.
Before entering the chat room, we changed the microphone and video input settings. Instead of using the default physical microphone and camera, we selected the VB-Audio output and OBS camera as inputs.
Results and Target Feedback
We were successful in deceiving both targets. While the simulation was far from perfect, and both targets noticed minor anomalies during the video call, they remarked that the issues observed in the video or audio were chalked up to potential connection problems.
Notably, the second target was more convinced of the deception, even though they were more familiar with the test subject. This could be attributed to our tester’s lack of experience during the first test.
The first target noted during the debrief that the method of communication wasn’t too confident or convincing. After learning of the deception, they gave the video and voice impersonation a 4 and 3, respectively. However, they didn’t raise any flags during the call and dismissed their misgivings.
On the other hand, the second target gave 9 (video) and 4 (voice) points. They found the cloned appearance completely believable and only doubted the tester’s accent.
The target trusted that the issues they noticed were only caused by poor internet connectivity. They followed through on the fake assignment given by our imitator despite the task being outside their job description and very different from the team’s usual projects.
We believe that the shortcomings in our deepfake cloning test could easily be resolved with more time to practice the impersonation — mastering the subject’s gestures, speech patterns, and vocabulary, for example. More advanced hardware might also be able to produce real-time simulations with much higher resolutions and quality, making the deception even more convincing.
Experiment Conclusion and Insights
From the results of our experiment, we gather that organized cybercriminals would likely find it very easy to scam unsuspecting and unprepared individuals.
We completed the experiment over the course of three weeks, with one team member working on the bulk of the technical aspects. While our tester had a moderate understanding of programming principles, they didn’t have an extensive background in computer science or coding. As such, they needed to conduct thorough research and, at one point, required the assistance of our systems engineer.
The process would likely be much faster and easier for people with greater expertise in programming or computer science, especially those with coordinated teams.
While the programs used can’t run on the most basic hardware — such as Apple ecosystems or out-of-box Windows PCs — they’re perfectly usable in systems with above-average builds. Specifically, our team believes that people involved in the following fields are likely to have the necessary hardware to run the software:
- Software engineers and developers (advanced programming)
- Data scientists (intensive data processing)
- Videographers and animators (demanding video rendering)
- Gamers (low-latency and high-resolution gaming)
In terms of technical know-how, we believe that people comfortable in using a command-line interface (CLI) can learn how to operate the deepfake programs with relative ease. Individuals familiar with Linux-based systems, for instance, might already have ample experience in using the GNOME terminal, giving them a decent start-point for using Python.
Overall, we can conclude that anyone with the time and patience to learn basic Python coding, along with above-average hardware that can handle intensive processing, is very capable of creating deepfakes independently.
For clarifications or inquiries about our methodology, experiment results, or further analyses, please don’t hesitate to contact us here.
Other Popular Deepfake Tools and Software
We centered our experiment on the most popular face- and voice-swapping tools available for public use, but several other platforms are making waves in the field. While most don’t offer real-time cloning, they can still be exploited for unethical or criminal activities.
Other widely used face-altering tools include DeepFaceLab, FaceSwap, SimSwap, and Avatarify. These apps have varying levels of realism and accessibility to casual users, which we classified according to the following criteria:
Realism | Accessibility | |
High | Videos with little to no perceptible artifacts that could raise suspicion | Can easily be used by people with little technical expertise |
Moderate | Fairly believable videos with minor anomalies (e.g., mouth movements that are noticeably inconsistent with speech) | Can be learned by users with average expertise, decent hardware setup, and time to learn basic programming |
Low | Videos that are obviously doctored | Designed for advanced users or developers with a firm understanding of machine learning frameworks |
We included Deep Live Cam, the program we used in the experiment, for comparison in the table below:
Face-Altering Technology | Realism | Accessibility | Real-Time Conversion |
Deep Live Cam | High | Moderate | Yes |
DeepFaceLab | Moderate | Low | No |
FaceSwap | Moderate | Low | No |
SimSwap | Moderate | Moderate | No |
Avatarify | Low | High | No |
Similarly, a number of other open-source projects are publicly available to use for deepfake voice-altering purposes. The most popular include Real-Time-Voice-Cloning, which generates synthetic speech from an audio source, and So-VITS-SVC (Soft Voice Conversion VITS Singing Voice Conversion). The latter is primarily used to convert the voice of singing audio while preserving the pitch and intonation.
Notably, most voice-altering tools have low accessibility, as they typically require advanced hardware and Python programming knowledge.
What Threats Do Deepfakes Pose?
Fraud, financial and identity theft, and non-consensual media have been the biggest ethical and privacy concerns since the emergence and rapid rise of deepfakes. With the development of tools like Deep Live Cam, the threats have only increased. For now, such tools are only accessible to people with advanced hardware and intermediate technical knowledge, but that can soon change.
With voice cloning, users may forgo the most time-intensive and complicated part by skipping the model-training process, as some websites offer this service. They only need to send the recordings of the person they want to clone, and these sites will run the data through RVC (or similar software). With the models trained by a third party, apps like Voice Changer can be used directly and without much difficulty.
This poses even more significant privacy issues because there’s no guarantee that data submitted to such sites are protected by data privacy and use laws.
A disclaimer on the GitHub page of Deep Live Cam acknowledged the software’s “potential for unethical applications.” The program purportedly has a mechanism to block graphic content and nudity. Still, this covers only a narrow range of threats. For now, at least, there’s no way to automatically flag the use of the software for misinformation, financial crimes, or other fraudulent activities.
The developers have also stated their willingness to “shut down the project or add watermarks” if required by the law. In the meantime, however, users with the necessary hardware and technical know-how can continue to use the software for any purpose they choose. Moreover, being open-source, the program could easily be duplicated, updated, and redistributed by anyone with sufficient programming skills.
What’s in Store for Deepfake Technology?
Our experiment focused on real-time or live cloning, which has a wide range of malicious potential. Remarkably, we found it moderately easy to accomplish. Pre-recorded deepfakes are even easier, especially with the rapidly evolving technology becoming publicly available. With this, the line between factual and false might continue to blur in the coming years.
While deepfake detection tools are continuously being designed and improved, there’s no telling whether the pace of development for that technology can keep pace with deepfake creation software. As such, users need to remain vigilant of the media they consume, as well as the way they interact with friends, family, and colleagues online.
People should be wary of any anomalies in speech, movement, or message when talking to somebody over voice or video call. Never dismiss doubts, and always make an effort to confirm the identity of whom you’re speaking with.
Leave a Comment
Cancel