SoK: The Good, The Bad, and The Unbalanced

Measuring Structural Limitations of Current Synthetic Media Datasets

Abstract: Deepfake media represents an important and growing threat not only to computing systems but to society at large. Datasets of image, video, and voice deepfakes are being created to assist researchers in building strong defenses against these emerging threats. However, despite the growing number of datasets and the relative diversity of their samples, little guidance exists to help researchers select datasets and then meaningfully contrast their results against prior efforts. To assist in this process, this paper presents the first systematization of deepfake media. Using traditional anomaly detection datasets as a baseline, we characterize the metrics, generation techniques, and class distributions of existing datasets. Through this process, we discover significant problems impacting the comparability of systems using these datasets, including unaccounted-for heavy class imbalance and reliance upon limited metrics. These observations have a potentially profound impact should such systems be transitioned to practice - as an example, we demonstrate that the widely-viewed best detector applied to a typical call center scenario would result in only 1 out of 333 flagged results being a true positive. To improve reproducibility and future comparisons, we provide a template for reporting results in this space and advocate for the release of model score files such that a wider range of statistics can easily be found and/or calculated. Through this, and our recommendations for improving dataset construction, we provide important steps to move this community forward.

This is a companion website for our paper -- the goal is to provide additional information to help digest the information in the paper, and to provide reproducibility steps for other researchers. (Paper and Presentation Link)

Appendix 1 - The Base-Rate Fallacy

The base-rate fallacy is a probabilistic detection phenomenon that arises when the base-rate of incidence is not taken into account when building a detector. Failure to consider the base-rate results in a misinterpretation of performance, generally with that performance being overstated.

Stemming from Bayesian statistics, the relationship between a specific conditional probability and the transpose of that condition gives Bayes' theorem:

Where I is an anomaly, A is an alarm, P(I) is the probability of an anomaly, P(A) is the probability of an alarm, P(I|A) is the probability of an anomaly given an alarm, and P(A|I) is the probability of an alarm given an anomaly.

Rewriting Bayes' theorem for two classes gives us the Bayesian Detection Rate (BDR):

A real-world example of the base-rate fallacy helps highlight this often missed issue. Consider an anomaly detection system with a measured true detection rate of 99.9% and a false positive rate of 0.01%. The anomalies, meanwhile, have a proven base-rate of 1 in 100,000 (i.e., the number of benign events is substantially larger than the number of anomaly events).

If the detector reports an anomaly, we calculate the probability of this result as follows:

Let A represent an alarm and \neg A represent the inverse, no alarm. Similarly, let I represent anomaly and \neg I represent benign.

We can express this example as a set of probabilities:

The goal, then, is to find P(I|A) using Bayes' theorem:

The BDR, or probability of anomaly given an alarm, is the transpose of the true detection rate, or probability of an alarm for an anomaly. From Eq.2, BDR = P(I|A) = 0.001 and TPR = P(A|I) = 0.999. As a result, even though the detector is 99.9% accurate, there is only a 1 in 1,000 chance that an alarm truly indicates an anomaly. This is due to the large population size imbalance between the two classes (anomaly and benign).

Figure 7: Bayesian Detection Rate (BDR) (i.e., if an event is predicted positive, the probability of that event being positive) vs. the False Alarm Rate. We assume a True Detection Rate of 1.0 and vary the base-rate to show the trade-off between the False Alarm and BDR. The shaded area denotes a BDR of >80% (i.e., 8 out of 10 events flagged as the positive class are actually the positive class) and highlights the low False Alarm Rate required to reach this threshold.

The BDR gives a ratio of how many incidents flagged as positive are actually positive. Figure 7 is model-agnostic and shows that a reasonable BDR requires an exceedingly small false positive rate, as the false positive rate rapidly dwarfs the positive detection rate. For instance, achieving a BDR of 80% at a base-rate of 0.1% and a true positive detection rate of 100% requires a false alarm rate of 0.0125%. In this scenario, a detector is only allowed to misclassify 1 in 2000 instances of the negative class. We observe that despite assuming a perfect true detection rate, the detector must demonstrate a near-zero false positive rate to achieve performance with a high BDR.

Appendix 2 - Training Data Augmentation

2.1. ASVspoof

Figure 8: The measured distribution of ASV_trn durations (seconds) and the distribution we sampled from Librispeech train-clean-100.

Modeling the distribution of sample lengths, the number of speakers, the number of utterances per speaker, and the ratio of female to male speakers allows us to accurately sample from an additional corpus and match the distribution of ASV_trn. We collected real speech samples from the widely-used LibriSpeech train-clean-360 corpus and subsampled the utterances to match distributions of utterances per speaker. We sample and segment utterances from LibriSpeech from a distribution based on the ASV_trn real-speech mean, standard deviation, minimum, and maximum durations. Each successive class distribution is additive and builds upon the previous class distribution samples. For a more concrete example of what a new training distribution looks let us examine the D_75F/25R training set. ASV_trn contains 22,800 deepfake samples and 2,580 real samples.

To achieve the desired ratio we calculate:

Thus, we need to sample 5,020 real-speech samples from Librispeech. ASV_trn specifies 200 utterances per speaker and a 6:4 female-to-male ratio. This then requires the 5,020 LibriSpeech samples to come from 25 speakers and a split of 15 females and 10 males.

2.2. CFAD

Figure 9: The measured distribution of CFAD_trn durations (seconds) and the distribution we sampled from the podcast subset of WeNetSpeech.

We follow a similar strategy as ASVspoof augmentation in the previous section, with a few modifications. First, CFAD does not specify the number of utterances per speaker, nor the split of male/female speakers present in the final dataset. Due to this, we only match the length distribution of the real samples from CFAD_trn when subsampling from WeNetSpeech. As the WeNetSpeech samples are substantially longer than the CFAD samples, we first apply a Voice Activity Detection (VAD) based on WebRTCVAD to the samples to create independent speech-only subsamples. We then model the distribution from CFAD_trn and randomly select samples from our VAD WeNetSpeech that are ±10% from each CFAD sample. As an example, D_75F/25R starts with 12,800 real samples and 25,600 fake samples. To match the required distribution we add 64,000 samples from WeNetSpeech.

Appendix 3 - Additional Calculated Model Metrics

The following figures correspond with Tests 1 & 2 from Section 3.2.3 (Class Distribution in Datasets) in the paper. The figures presented here are for the other models we tested, but did not have room to include in the paper . The trends highlighted in the paper follow for all the additional models from ASVspoof 2021 and CFAD.

3.1 ASVspoof 2021

ASVspoof Baseline Models

ASVspoof Dataset

Figure 10.1: RawNet2 against the original ASV_eval test set.

Figure 10.2: RawNet2 against the real-only RO_asv test set.

Figure 10.3: wav2vec against the original ASV_eval test set.

Figure 10.4: wav2vec against the real-only RO_asv test set.

3.2 CFAD

CFAD Baseline Models

CFAD Dataset

Figure 11.1: RawNet2 against the original CFAD_eval test set.

Figure 11.2: RawNet2 against the real-only RO_cfad test set.

Figure 11.3: LFCC-LCNN against the original CFAD_eval test set.

Figure 11.4: LFCC-LCNN against the real-only RO_cfad test set.

3.3 Comprehensive Metric Reporting

A sample metric reporting template, filled in with all the measured values that we calculated for all models (ASVspoof and CFAD).

Table 4: Example template of dataset statistic and metric reporting; populated using the scores files for each of our trained models.

Appendix 4 - CIFAKE Experiments

We employ the same methodology (outlined in Section 3.2 of the paper) against the CIFAKE dataset to test if our hypothesis holds across dataset domains. CIFAKE does not meet all of our criteria defined in Section 3, however, it is the dataset that comes the closest to meeting these criteria. Specifically, CIFAKE is publicly available, contains a well-defined train/test split, has suggested metrics, and has a loosely defined baseline model. The loosely defined baseline model is the issue with this dataset and in order to perform these experiments we make some assumptions (i.e., we take “standard” values on missing parameters) on the values of hyper-parameters in model training.

CIFAKE Baseline Model
CIFAKE Dataset

4.1. Reproducibility

We retrain the CIFAKE model using CIFAKE_trn with given, or assumed, parameters. Using the trained models we determine per-class probabilities for predictions against CIFAKE_eval. We evaluate the reproducibility of the baseline model based on the metrics provided and if our measured retraining metrics can meet the reported values of the baseline model.

We show the results of retraining the baseline model along with the reported results from that model in Table 5. We show that our retraining is similar to the reported metrics. The reported Accuracy, Recall, and F1-Score for the baseline model and our measured values are all within 3.2% (relative). However, as we we forced to make substantial assumptions about the hyper-parameters for the CIFAKE baseline model, this begs an interesting reproducibility question. Most likely, the parameters used in the training of the CIFAKE baseline model as reported are the “standard” and this makes sense looking at the relative difference in measured and reported values.

Using the provided baseline model, descriptions from the CIFAKE paper, and standard hyper-parameters we create the following model architecture:

train_ds = tf.keras.utils.image_dataset_from_directory(

dir,

seed=123,

image_size=(32, 32),

batch_size=32,

label_mode='binary')

model = keras.Sequential()

model.add(layers.Rescaling(1./255))

model.add(layers.Conv2D(filters=32, kernel_size=3, activation='relu'))

model.add(layers.MaxPool2D())

model.add(layers.Conv2D(filters=32, kernel_size=3, activation='relu'))

model.add(layers.MaxPool2D())

model.add(layers.Flatten())

model.add(layers.Dense(units=64, activation='relu'))

model.add(layers.Dense(units=1, activation='sigmoid'))

model.compile(loss=losses.BinaryCrossentropy(), optimizer=tf.keras.optimizers.Adam(), metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()])

model.fit(train_ds, epochs=100, verbose=1)

Regardless, of similar results, the lack of explicit steps leaves many questions. CIFAKE faces reproducibility issues like ASVspoof and CFAD, just in a different direction.

Table 5: Measured vs. Reported (M/R) reproducibility results from retraining the CIFAKE baseline model.

4.2. Efficacy of Reported Metrics

As CIFAKE contains a single baseline model, we cannot compare the efficacy of reported metrics thus nullifying this research question for this dataset.

4.3. Class Distribution in Datasets

4.3.1. Test 1

CIFAKE_eval contains 50% deepfakes and is thus perfectly balanced and provides a good comparison to the speech deepfake datasets. To understand any biases, we retrain the baseline CIFAKE model with varying class distributions. We test the following class distributions: D_90F/10R, D_75F/25R, D_50F/50R, and D_25F/75R -- the same parameters and epochs are unaltered for each model and training class-distribution combination. We then evaluate each model against the default CIFAKE_eval set. We augment the number of real samples in CIFAKE_trn with images from the CIFAR-100 dataset to build the different distribution experiment sets. As CIFAKE directly uses the CIFAR-10 images and the CIFAR-100 images are gathered and processed by the same group, this dataset is the perfect fit for augmentation. There are no overlapping samples in the two datasets.

We show that, similar to the speech deepfake datasets, as the training set class distribution skews towards a specific class (i.e., a move left or right on the x-axis) the predictive ability on that specific class increases, while the predictive ability on the opposite class decreases. To be precise, a model is biased towards the class with the greatest sample distribution density.

Figure 12.1: CIFAKE against the original test set.

4.3.2. Test 2

To test bias in the training set of CIFAKE, we explore a scenario where there are no deepfake samples in an evaluation set. Similar to ASVspoof and CFAD, we collect x samples from STL-10. STL-10 is a publicly available image corpus based on the CIFAKE-10 dataset, used for image classification and thus fits perfectly as a real-only test dataset.

We show the false positives and true negatives for the CIFAKE baseline model with the 4 different training distributions. This demonstrates the overwhelming number of false positives for D_90F/10R in an all-real evaluation set and moving to D_25F/75R dramatically reduces false positive predictions. Adding a minor amount of the underrepresented class to the training data reduces bias and follows the trends shown in Section 3.2.3 for the speech deepfake datasets. These trends, however present, are much less impactful with this baseline model compared to ASVspoof and CFAD.

Figure 12.2: CIFAKE against the real-only test set.

4.3.4. Characterizing Model Efficacy

Figure 13: We calculate CIFAKE’s baseline model BDR as a function of the base-rate where the vertical lines represent the measured false positive rates for each class distribution against the real-only image set. The intersections show where a class distribution maps to a BDR at diﬀerent base-rates.

To showcase how the BDR helps evaluate the performance of a model, we examine the CIFAKE baseline model against the real-only test set as the training class distribution changes. The correct classification of true negatives increases from 4.4% to 44.5% and the false positive classifications decrease from 95.6% to 55.5% and is a relative false positive decrease of 53%. While these results are an improvement, they do not contextualize the output of the detector within the scope of any base-rate. We plot the training class distribution for the CIFAKE baseline model against a range of base-rates in Figure 13. We show that all CIFAKE training distributions are grouped at the far right. This shows that even when training with primarily real samples, as in D_25F/75R, base-rates impose a substantial impact on model performance. We show that the same trends outlined in Section 3.2.4, in the paper, for the speech deepfake datasets hold against the deepfake image domain datasets; however, the trends are more impactful on the CIFAKE baseline model.

Appendix 5 - Dataset Systematization

5.1. Deepfake Image Datasets

For deepfake images, we gather ArtiFact, CIFake, DFFD, r/Fakeddit, CT-GAN, ForgeryNet, FakeSpotter, Two-Stream, and Head Poses.

Table 7: Characterization of collected deepfake image datasets.

5.2. Deepfake Video Datasets

For deepfake videos, we gather UADFV, Deepfake-TIMIT, DFDC, FaceForensics++, DeeperForensics, Celeb-DF, WildDeepfake, KoDF, FakeAVCeleb, LAV-DF, ForgeryNet, DeepFake MNIST+, DeePhy, and DFFD.

Table 8: Characterization of collected deepfake video datasets.

5.3. Deepfake Speech Datasets

For deepfake speech, we gather ASVspoof 2015, ASVspoof 2017, ASVspoof 2019, ASVspoof 2021, ADD 2022, ADD 2023, SASV 2022, Fake or Real, WaveFake, FakeAVCeleb, In The Wild, CFAD, Half-Truth, FMFCC-A, VSDC, ReMASC, RedDots: Redeployed, BTAS 2016, Baidu, H-Voice, VCC 2016, VCC 2018, and VCC 2020.

Table 9: Characterization of collected deepfake speech datasets.

5.4. Network IDS Datasets

For Network IDS, we gather TUIDS, TRAbID, DARPA1998, UNSW-NB15, CSECICIDS2018, AWID, Botnet, CIDDS-002, Empirical Botnet, DDoS, Protocol Profiles, Kyoto 2006+, PU-IDS, SSENet-2014, and PUF.

Table 10: Characterization of collected network intrusion detection datasets.

Page updated

Google Sites

Report abuse