A comprehensive survey of deepfake technology covering definitions, performance metrics, datasets, challenges, and detection methods from both English and Chinese research literature.

Recent advancements in AI and machine learning have significantly increased the capability to produce realistic synthetic media, leading to the emergence of "deepfakes" - manipulated or synthetic media that is difficult for humans to distinguish from authentic content. This paper provides a comprehensive overview of deepfake technology, covering definitions, performance metrics, datasets, challenges, competitions, and benchmarks from both English and Chinese research literature.

Definitions and Scope

The term "deepfake" combines "deep" (referring to deep learning) and "fake," typically referring to manipulation of existing media or generation of new synthetic media using deep learning-based approaches. However, there is no universal definition, and the boundary between deepfakes and non-deepfakes is not always clear-cut.

Some researchers adopt a broader understanding of deepfake, considering it as audio-visual manipulation using various technical sophistication levels. Legal definitions in the United States have also taken a broader approach, focusing on authenticity and impersonation rather than the specific use of deep learning technology.

For this survey, we adopt an inclusive approach covering all forms of manipulated or synthetic media considered deepfakes in a broader sense, including related topics like biometrics and multimedia forensics.

Performance Metrics and Standards

Deepfake detection is primarily a binary classification problem, and several performance metrics are used to evaluate detection systems:

Confusion Matrix

The fundamental tool for evaluating binary classifiers, showing true positives, true negatives, false positives, and false negatives.

Precision and Recall

Precision measures the fraction of actually positive samples among all predicted positives, while recall measures the fraction of predicted positive samples among actually positive samples.

True and False Positive Rates

True positive rate (TPR) indicates the fraction of predicted positive samples among actually positive samples, while false positive rate (FPR) indicates the fraction of predicted positive samples among actually negative samples.

Equal Error Rate (EER)

The point where FPR and false negative rate (FNR) are equal, often used in biometric applications.

Accuracy and F-Score

Accuracy measures the fraction of correctly predicted samples, while F-score provides a balanced measure considering both precision and recall.

Receiver Operating Characteristic (ROC) Curve

Shows how TPR and FPR change with different prediction thresholds, with the area under the curve (AUC) providing a single performance metric.

Log Loss

Measures the performance of classifiers that can return probability scores for predicted labels.

Perceptual Quality Assessment (PQA)

Mean opinion score (MOS) is the most widely used subjective PQA metric, calculated by averaging subjective scores given by human judges. Objective PQA metrics like mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM) have also been developed.

Standards

Several ISO/IEC standards are relevant to deepfake detection, including ISO/IEC 19795-1:2021 for biometric performance testing and ISO/IEC 30107-3:2017 for biometric presentation attack detection.

Numerous datasets have been created for deepfake research, categorized into image, video, audio/speech, and hybrid datasets:

Deepfake Image Datasets

SwapMe and FaceSwap dataset: 4,310 images including 2,010 fake images
Fake Faces in the Wild (FFW) dataset: 131,500 face images
Generated.photos datasets: Up to nearly 2.7 million synthetic face images
MesoNet Deepfake Dataset: 19,457 face images
100K-Generated-Images: 100,000 synthesized face, bedroom, car, and cat images

Deepfake Video Datasets

DeepfakeTIMIT: 620 deepfake face videos
FaceForensics++: 5,000 face videos with over 1.8 million manipulated frames
UADFV dataset: 98 face videos
Deep Fakes Dataset: 142 "in the wild" deepfake portrait videos
DFDC (Deepfake Detection Challenge) full dataset: 128,154 face videos

Deepfake Audio/Speech Datasets

Voice Conversion Challenge datasets: Released biennially since 2016
ASVspoof Challenge datasets: Released biannually since 2015
Baidu Research dataset: 134 utterances including real, cloned, and manipulated samples

Hybrid Deepfake Datasets

NIST OpenMFC Datasets: GAN-generated deepfake images and videos
ForgeryNet: 2,896,062 images and 221,247 videos for comprehensive forgery analysis

Several initiatives aim to advance deepfake detection and generation through competitions and benchmarks:

Detection of Manipulated Media

Deepfake Detection Challenge (DFDC): Facebook-led competition with 2,114 participants
ASVspoof Challenge: Biennial competition focused on automatic speaker verification spoofing
Face Anti-spoofing Challenge: Annual competition on presentation attack detection
FaceForensics Benchmark: Ongoing automated benchmark for face manipulation detection
Open Media Forensics Challenge: Annual image and video forensics evaluation
DeeperForensics Challenge: Deepfake face detection challenge
Face Forgery Analysis Challenge: Competition on photo-realistic manipulation detection
2020 CelebA-Spoof Face Anti-Spoofing Challenge: Challenge on face anti-spoofing
2021 CSIG Challenge: Chinese challenge with fake media forensic tasks
2020 China Artificial Intelligence Competition: Chinese AI competition with deepfake tasks

Generation of Manipulated Media

Voice Conversion Challenge: Biennial competition promoting voice conversion technology
Deepfake Africa Challenge: Initiative focusing on creative potential of synthetic media in African context

Generation and Detection of Manipulated Media

DeepFake Game Competition (DFGC): Competition promoting adversarial game between creation and detection agents

A meta-review of 12 selected deepfake-related survey papers reveals several insights:

Definitions and Scope

Most authors discussed the history of deepfakes and pointed out the combination of "deep learning" and "fake," but some used broader definitions. Many authors focused more on face images and videos, with some even limiting the definition of "deepfake" to face swapping.

Performance Metrics

Surprisingly, none of the surveys covered performance metrics explicitly, though some used them to explain and compare performance of covered methods. The most used metrics include accuracy, EER, and AUC.

Datasets

Many surveys list deepfake-related datasets, but coverage is mostly limited compared to the comprehensive review provided in this paper.

Challenges, Competitions and Benchmarks

Coverage of challenges, competitions, and benchmarks is also mostly limited, with some major initiatives not covered in any of the surveys.

Performance Comparison

Only some surveys explicitly covered performance comparison between different methods. Due to quality issues of many deepfake-related datasets, performance comparison results should be treated with caution.

Challenges and Recommendations

Key challenges identified include developing more robust, scalable, generalizable, and explainable deepfake detection methods. Other recommendations include considering fusion of different methods, improving dataset quality, and developing active defense mechanisms.

Conclusion

The rapid growth in media manipulation capabilities has led to significant concerns about deepfakes and their potential misuse. This paper provides a comprehensive overview of deepfake technology, covering definitions, performance metrics, datasets, challenges, competitions, and benchmarks from both English and Chinese research literature.

The lack of universal definitions and standards presents challenges for the field, but ongoing efforts in competitions, benchmarks, and standardization are helping to advance the state of the art. As deepfake technology continues to evolve, it's crucial to develop more robust detection methods while also considering positive applications in entertainment, creative arts, and privacy protection.

Future research directions include improving dataset quality, developing more generalizable detection methods, and considering the social media laundering effects in training data. The field is likely to remain dynamic as both generation and detection techniques continue to advance in an ongoing arms race.

#Deepfakes #AI #Security #datasets #Metrics

How Researchers Measure, Detect and Benchmark AI Manipulation

Definitions and Scope

Performance Metrics and Standards

Confusion Matrix

Precision and Recall

True and False Positive Rates

Equal Error Rate (EER)

Accuracy and F-Score

Receiver Operating Characteristic (ROC) Curve

Log Loss

Perceptual Quality Assessment (PQA)

Standards

Deepfake Image Datasets

Deepfake Video Datasets

Deepfake Audio/Speech Datasets

Hybrid Deepfake Datasets

Detection of Manipulated Media

Generation of Manipulated Media

Generation and Detection of Manipulated Media

Definitions and Scope

Performance Metrics

Datasets

Challenges, Competitions and Benchmarks

Performance Comparison

Challenges and Recommendations

Conclusion

Comments

How Researchers Measure, Detect and Benchmark AI Manipulation

Definitions and Scope

Performance Metrics and Standards

Confusion Matrix

Precision and Recall

True and False Positive Rates

Equal Error Rate (EER)

Accuracy and F-Score

Receiver Operating Characteristic (ROC) Curve

Log Loss

Perceptual Quality Assessment (PQA)

Standards

Deepfake-Related Datasets

Deepfake Image Datasets

Deepfake Video Datasets

Deepfake Audio/Speech Datasets

Hybrid Deepfake Datasets

Deepfake-Related Challenges, Competitions & Benchmarks

Detection of Manipulated Media

Generation of Manipulated Media

Generation and Detection of Manipulated Media

Meta-Review of Deepfake-Related Surveys

Definitions and Scope

Performance Metrics

Datasets

Challenges, Competitions and Benchmarks

Performance Comparison

Challenges and Recommendations

Conclusion

Comments