ExDDV: A New Dataset for Explainable Deepfake Detection in Video (2025)

Vlad Hondru1, Eduard Hogea2, Darian Onchiş2, Radu Tudor Ionescu1,
1University of Bucharest, Romania, 2West University of Timişoara, Romania
Corresponding author: raducu.ionescu@gmail.com.

Abstract

The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.

1 Introduction

Online fraud and misinformation based on deepfake videos reached unprecedented expansion rates in recent years. A recent forensics report suggests that identity fraud rates nearly doubled, showing a significant rise in the prevalence of deepfake videos between 2022 and 2024, from 29%percent2929\%29 % to 49%percent4949\%49 %111https://regulaforensics.com/news/deepfake-fraud-doubles-down/. This rise of deepfake content is primarily caused by recent advances in generative AI, especially with the emergence of highly capable diffusion models [15, 46, 71, 70, 74, 81, 76, 77]. The high quality and realism of generated videos put online users in difficulty of telling the difference between real and fake content. In this context, humans can turn to state-of-the-art automatic deepfake detectors for help [10, 75, 78, 56, 32, 44, 25, 29]. To come in handy, such models need to be robust and trustworthy, while also providing explainable decisions that would enable users to gain important insights regarding the kinds of artifacts they should look for in a video. Yet, Croitoru et al.[19] showed that state-of-the-art deepfake detectors fail to generalize to content generated by new generative models (not seen during training). Moreover, to the best of our knowledge, the task of generating textual descriptions to explain the artifacts observed in deepfake videos has not been explored so far.

ExDDV: A New Dataset for Explainable Deepfake Detection in Video (1)

To this end, we introduce the first dataset and benchmark for Explainable Deepfake Detection in Video, called ExDDV. Our new dataset comprises approximately 5.4K real and deepfake videos that are manually annotated with text descriptions, clicks and difficulty levels. The text descriptions explain the artifacts observed by human annotators, while the clicks provide precise localizations of the described artifacts, as shown in Figure1. The annotated videos are gathered from a broad set of existing datasets for video deepfake detection, including DeeperForensics [36], FaceForensics++ [67], DeepFake Detection Challenge [21] and BioDeepAV [19], to enhance the diversity of our collection. The inter-annotator agreement (0.62380.62380.62380.6238 cosine similarity in Sentence-BERT space) confirms that the collected annotations are consistent and of high quality. ExDDV comes with an official split into training, validation and test, which facilities result reproducibility and future comparisons.

We further evaluate a number of vision-language models (VLMs) on ExDDV, comparing various architectures, training procedures and supervision signals. In terms of architectures, we experiment with BLIP-2 [40], Phi-3-Vision [1] and LLaVA-1.5 [45]. In terms of training strategies, we consider pre-trained versions, as well as versions based on in-context learning and fine-tuning. Regarding the supervision signals, we consider text descriptions alone or in combination with clicks. For click supervision, we study two alternative approaches, namely soft and hard input masking. Our empirical results show that fine-tuning provides the most accurate explanations for all VLMs, confirming the utility of ExDDV in developing robust explainable models for deepfake videos. Moreover, we find that both text and click supervision signals are required to jointly localize and describe the observed artifacts, as well as to generate top-scoring explanations.

In summary, our contribution is threefold:

  • We introduce the first dataset for explainable deepfake detection in video, comprising 5.4K videos that are manually labeled with descriptions, clicks and difficulty levels.

  • We study various VLM architectures and training strategies for explainable deepfake detection, all leading to a comprehensive benchmark.

  • We publicly release our dataset and code to reproduce the results and foster future research.

2 Related Work

Deepfake detection in video.Nowadays, deepfakes have started to pose a real threat, since generative methods have significantly evolved and their number has increased. Nevertheless, substantial effort has been made to develop detection methods [19] and counter the misuse of generative AI technology. Early methods for deepfake detection in video were based on convolutional networks [2, 24, 27, 68, 3, 53, 56, 32, 44, 82, 18, 25]. To handle both spatial and temporal dimensions, two different strategies are commonly adopted. The first is to apply 2D convolutions on individual frames and subsequently combine the resulting latent representations either by using basic operations (such as pooling or concatenation) [2, 24] or by employing recurrent neural networks [27, 68, 3, 53, 56, 32, 44]. The second strategy involves extending the convolutions to 3D to capture spatio-temporal features [82, 18, 25].

To learn more robust feature representations or enhance the detection of frame-level inconsistencies and motion artifacts, researchers have incorporated attention mechanisms into deepfake detectors [10, 75, 78]. More recent works [83, 26, 17] employed the transformer architecture [73, 22]. Such methods have a superior ability to capture long-range dependencies and are effectively applied to detect temporal inconsistencies.

To the best of our knowledge, current video deepfake detectors do not have intrinsic capabilities to explain their decisions. This is primarily caused by the lack of deepfake datasets providing explanatory annotations for the video content.

Deepfake video datasets. The task of deepfake detection has been extensively studied, and thus, there are many datasets that are now publicly available. Among these, the most popular are LAV-DF [11], GenVideo [13], DeepFake Detection Challenge [21], DeeperForensics [36], FakeAVCeleb [37], Celeb-DF [41], FaceForensics++ [67] and WildDeepfake [85]. Although such datasets contain the binary label (real or fake) associated with each video, they do not provide other kinds of annotations about the video content. Unlike existing datasets, we provide a dataset for deepfake detection with textual explanations for the artifacts observed by human annotators, along with clicks (points) that indicate artifact locations. To the best of our knowledge, our novel dataset is the first to provide explanatory annotations for deepfake video content.

Explainable deepfake detection. The research community has extensively explored explainable AI (XAI), proposing various approaches [38, 84, 34, 55, 47, 66, 48, 69]. While there are many established methods for XAI, such as Gradient-weighted Class Activation Mapping (Grad-CAM) [69], SHapley Additive exPlanations (SHAP) [48] or LIME [66], explainable AI has barely been applied to deepfake detection. Ishrak et al.[33] implemented a binary classification model to detect whether video frames are artificially generated or not. Then, they employed Grad-CAM [69] to estimate the salient regions that could explain the prediction, subsequently verifying if these regions overlap with the area of the face. Grad-CAM is a generic framework that back-propagates the gradients at various layers and computes a global average to obtain saliency maps.

Other methods opted for detecting deepfakes focusing only on a specific factor. Using only convolutional-based architectures, Haliassos et al.[28, 29] designed a detection method that concentrates on mouth movements, while Demir et al.[20] proposed an approach that analyzes face motions. Due to the focus on specific factors, these methods can inherently provide some limited level of explainability.

To the best of our knowledge, none of the existing explainable AI methods are specifically designed to explain deepfake videos. However, we acknowledge that researchers have studied the explainability of generated images. For example, the WHOOPS dataset [9] provides explanations for why synthetic images defy common sense. The authors note that such images can be easily identified by humans, raising the question if machines can do the same. In contrast, we focus on deepfake video content instead of generated images. We emphasize that deepfake content is potentially harmful to humans. Since deepfake videos are typically generated with a harmful intent in mind, they are not aimed to defy common sense (on the contrary). We thus consider the study of Bitton-Guetta et al.[9] as a complementary work to ours.

3 Proposed Benchmark

The main contribution of our work is to introduce ExDDV, a benchmark specifically designed to facilitate human-interpretable explanations of deepfake videos. The novelty of ExDDV stems not only from being the first of its kind, but also from its comprehensive exploration of the task.

Video collection.The fake videos are collected from four different sources: DeeperForensics [36], FaceForensics++ [67], DeepFake Detection Challenge (DFDC) [21] and BioDeepAV [19]. This ensures that the dataset covers a wide range of generative methods and includes various video resolutions and durations. As real video samples, we use the original movies from DeeperForensics and FaceForensics++.From FaceForensics++, we include all videos generated by the Face2Face, FaceSwap, and FaceShifter methods. From the other two datasets (DFDC and BioDeepAV), we randomly select a number of videos. In Table1, we present the breakdown of source datasets from which we gather videos for ExDDV. We provide an official split of the video collection into 4,380 training videos, 482 validation videos and 485 test videos.

Source DatasetMethod#Samples
DeeperForensicsReal1,000
FaceForensics++
DeeperForensicsDF-VAE1,000
FaceForensics++Face2Face1,000
FaceSwap1,000
FaceShifter1,000
DeepFake DetectionMultiple269
Challenge
BioDeepAVTalking-face100
Total5,369

Annotation process.To annotate the dataset, we developed a simple application from scratch, with a basic graphical user interface (GUI). As illustrated in Figure2, the GUI contains two video players, side by side. The deepfake video is played on the left side of the window, while the corresponding real video is played on the right side. By default, the real video is not played. If the annotator needs to play the real video along with the deepfake one, they could simply press the button below the player screen. When the fake video is playing, the user can click anywhere on the frame to point the location of artifacts. Behind the scene, the application records the relative pixel location and the timestamp (i.e. frame index) of the click. Under the fake video player, there is a text box in which the details describing the visual issues can be written by the user. In the bottom right, there are three radio buttons, which allow the user to indicate the difficulty level of identifying deepfake evidence. We ask users to label deepfake videos as hard, when they need to play the deepfake video at least two times, or when they need to activate the real video to observe artifacts. In a similar manner, we instruct them to label videos as easy, if they are able to identify more than one artifact with a single play of the deepfake video. For real videos, we do not collect clicks and difficulty annotations.

ExDDV: A New Dataset for Explainable Deepfake Detection in Video (2)

The annotation efforts are carried out by two paid human annotators. Each dataset is equally split in two subsets, such that each method and each possible kind of visual issue is described from two distinct points of view. We present a random selection of annotated videos in Figure 7 from the supplementary.

MeasureAverage Score
Sentence-BERT0.6238
BERTScore0.3857
BLEU0.0808
METEOR0.2349
ROUGE-10.3023
ROUGE-20.1037
ROUGE-L0.2434

Inter-annotator agreement.To estimate the inter-annotator agreement and guarantee the consistency of the annotations, we provide 100 deepfake videos to both annotators. All these videos have explanations, clicks and difficulty levels from both annotators.

We first evaluate the similarity between the text descriptions provided by the two annotators in terms of the same semantic similarity metrics used to evaluate the XAI models, which are described in Section5. Although we include several measures for completion, we emphasize that conventional scores, such as BLEU [60], are not able to capture if two sentences have the same meaning if the words are different [12], e.g. annotators use synonyms to refer to the same concept. We therefore consider the similarities based on Sentence-BERT [64] and BERTScore [80] as more reliable. The results shown in Table 2 demonstrate a close semantic alignment between textual descriptions, with an average cosine similarity in the Sentence-BERT embedding space of 0.62380.62380.62380.6238. The discrepancy between the higher Sentence-BERT and BERTScore values and the lower BLEU, METEOR and ROUGE scores indicates that the annotators used different words to explain the same artifacts in many cases.

TemporalSpatialAgreement (%)
WindowWindowTemporalSpatialJoint
3050×\times×5054.7079.6035.10
6075×\times×7566.1489.6953.82
120100×\times×10084.5493.9475.87

We also assess the temporal and spatial agreement for the clicks provided by the two annotators. We measure the temporal alignment in terms of the percentage of click pairs that match inside a temporal window of at most 30, 60 and 120 frames, respectively. Since the average FPS is 30, the window lengths correspond to 1, 2 and 4 seconds, respectively. Similarly, we measure the spatial alignment in terms of the percentage of click pairs that match inside a spatial window of 50, 75, and 100 pixels. Finally, we compute the joint spatio-temporal agreement as the percentage of click pairs situated inside the same temporal and spatial windows. As shown in Table 3, more than 53%percent5353\%53 % of the clicks made by the annotators are paired within a reasonable spatio-temporal window of 60 frames (about 2222 seconds) and 75×75757575\times 7575 × 75 pixels. This high percentage of matched clicks indicates a significant agreement in terms of the artifact locations indicated by the two annotators through clicks.

To assess the agreement of the difficulty ratings, we compute the Cohen’s κ𝜅\kappaitalic_κ coefficient between difficulty categories. The Cohen’s κ𝜅\kappaitalic_κ coefficient among difficulty ratings is 0.590.590.590.59, indicating that the agreement is moderate, with the most differences being observed for the easy and medium levels.

Overall, the various inter-annotator agreement measures indicate that the annotations are consistent. In addition, we also visually inspected the collected annotations and confirmed that they are of sufficient quality.

Post-processing.One of the post-processing steps is to correct spelling errors. We employ an automatic spell checker [6] to find the misspelled words, and then manually corrected them. We also change the forms of some words in order to consistently use American English instead of British English (e.g. changing “colour” to “color”).

StatisticMinMaxAverage
FPS86027.66
Frames501814477.5
Length (seconds)2.0172.5617.54
Clicks1354.9

Statistics.The dataset consists of a total of 5,369 videos and 2,553,148 frames, with an average number of frames per video of approximately 477. We report several statistics about ExDDV in Table 4. The frames per second (FPS) rate varies from 8888 to 60606060, 90%percent9090\%90 % of all videos having either 25252525 or 30303030 FPS. The total number of clicks is 21,282. On average, each movie is annotated with around 4.94.94.94.9 clicks, the number of clicks per video ranging from 1111 to 35353535. ExDDV comprises a wide range of video resolutions, from 480×272480272480\times 272480 × 272 to 1080×1920108019201080\times 19201080 × 1920 pixels, with most videos having 720×12807201280720\times 1280720 × 1280, 480×640480640480\times 640480 × 640 or 1080×1920108019201080\times 19201080 × 1920 pixels (see Figure 9 from the supplementary for the number of videos per resolution). Each video is annotated with a single text description.

4 Explainable Methods

We consider various vision-language models as candidates for deepfake video XAI methods. Although there have been many efforts on visual question answering (VQA) [8, 51, 4, 23, 52, 65, 50, 79, 61], it was only with the significant growth of large language models (LLMs) [62, 58, 57, 72, 35, 54, 43] that capable vision-language models (VLMs) have recently emerged. For our task, we consider three different model families: Bootstrapping Language-Image Pre-training (BLIP) [39, 40], Large Language and Vision Assistant (LLaVA) [45] and Phi-3 [1].

4.1 Baseline architectures

All chosen VLMs involve three components: an image encoder, a text encoder, and a decoder. First, the input image and the query are encoded with their corresponding modules. The second step is to project the visual tokens into the same vector space as the text tokens. Finally, all tokens are combined and decoded to generate an answer.

BLIP-2. The original BLIP architecture is designed to train a unified VLM end-to-end. It bootstraps from noisy image-text pairs by generating synthetic captions and filtering out noise. The training regime combines contrastive learning, image-text matching and autoregressive language modeling. BLIP-2 takes a different route by avoiding an end-to-end approach. Instead, it introduces a lightweight Querying Transformer (Q-Former) that serves as a bridge between the image encoder and the LLM. The Q-Former employs a fixed set of learnable query tokens to extract the most relevant features from the image encoder, resulting in an efficient design with minimal computational costs. We opt for the BLIP-2 model based on an Open Pretrained Transformer (OPT) backbone with 2.72.72.72.7B parameters. For efficiency, we use the version optimized through int8 quantization.

Phi-3-Vision.The Phi-3 [1] suite of models ranges in functionality from solving pure language tasks to multimodal (vision and language) tasks. For our task, we employ the Phi-3-Vision model, which adopts a data-centric approach rather than simply increasing model size or applying distinct training regimes. The model is pre-trained using a curated pipeline that combines heavily filtered public web data with synthetic data generated by LLMs. This pre-training strategy first develops a broad language understanding, then enhances the model’s reasoning and logical inference capabilities. We select the architecture based on 128128128128K tokens, which uses flash attention.

LLaVA-1.5. The LLaVA [45] family seamlessly integrates visual inputs with an LLM to generate natural language descriptions for visual tasks. When an image is provided as input, a pre-trained vision encoder extracts the relevant features, which are then projected into the LLM’s token space through a linear transformation that preserves both spatial and semantic information. The LLM processes this sequence auto-regressively, producing responses that are inherently grounded in visual input. In our application, this approach enables the model not only to accurately answer questions about deepfake detection, but also to provide detailed and interpretable explanations of its reasoning. We opt for the LLaVA-1.5 version, with 7B parameters.

ExDDV: A New Dataset for Explainable Deepfake Detection in Video (3)

Prompting.We prompt the pre-trained VLMs with the following query to obtain explanations about visual artifacts: “Analyze the face in the image. Identify any deepfake artifacts (if any), focusing specifically on the affected parts of the face. Provide a short and direct explanation highlighting the inconsistencies or manipulations”.

4.2 In-context learning

Pipeline.In Figure 3, we illustrate our few-shot in-context learning pipeline. The training examples are chosen by a k-nearest neighbor (k-NN) model for each test video. The k-NN model extracts feature vectors with the CLIP [63] image encoder, which is based on the ResNet [30] architecture. Features are only extracted from the training frames that are annotated with clicks. The frames are further used independently by the k-NN. For efficient inference, we store the training feature embeddings along with the associated annotations. When a test image is provided, we obtain its features via a similar process, extracting the ResNet features for the middle frame. Then, we search for the closest k𝑘kitalic_k training frames based on the cosine similarity between feature embeddings. The annotations of the nearest neighbors are used to enrich the custom prompt given to the VLM. The k-NN procedure is supposed to provide a set of k𝑘kitalic_k deepfake explanations that are potentially relevant for the test video. The custom prompt instructs the VLM to provide a similar explanation for the input test video.

Hyperparameters. We consider two alternative CLIP image encoders for the k-NN: ResNet-50 and ResNet-101. For the number of neighbors k𝑘kitalic_k, we test values in the set {1,3,5,10}13510\{1,3,5,10\}{ 1 , 3 , 5 , 10 }. The best choice is k=5𝑘5k=5italic_k = 5. To perform the k-NN search at various representation levels, we extract features at three different depth levels. We use low-level features right before the first residual block, mid-level features from an intermediate residual block (equally distanced from the input and output), and high-level features from the last layer. The features producing the best results are the high-level ones. We keep the default parameters of all VLMs during inference, except for the temperature, which is set to 00 to reduce the chance of generating hallucinations.

4.3 Fine-tuning

Pipeline.On the one hand, we follow the default fine-tuning methodology for the BLIP-2 model. On the other hand, given the increased memory requirements of Phi-3-Vision and LLaVA-1.5, we employ Low-Rank Adaptation (LoRA) [31] to fine-tune these two models.

To make a prediction for a whole video, the first step is to extract multiple frames from the video. Our strategy is to use equally-spaced frames between the first and last frames, while the number of frames depends on the model. Since BLIP-2 and Phi-3-Vision are lighter models, we provide 5 frames as input to these models. Phi-3-V supports multiple frames, while for BLIP-2, we process each frame with the visual encoder, then aggregate the resulting embeddings and provide the average embedding to the Q-former.Along with the input frames, we provide a unique query prompt: “What is wrong in the images? Explain why they look real or fake”. In our preliminary experiments, we tried to vary the prompt, considering even more complex prompts. However, we did not observe any significant differences, so we decided to stick with a single and relatively simple prompt for the presented experiments.

Hyperparameters.Phi-3-Vision and LLaVA are trained for 10 epochs with mini-batches of 16161616 and 32323232 samples, respectively. BLIP-2 is also fine-tuned for 10 epochs with a mini-batch size of 16161616, but with a gradient accumulation of 2222. All models are optimized with AdamW, using a learning rate of 210421042\cdot 10{-4}2 ⋅ 10 - 4 and a cosine annealing scheduler to reduce the learning rate during training. For Phi-3-V, we use a LoRA rank of 64646464 and a dropout rate of 0.050.050.050.05. For LLaVA, we set the rank to 128128128128 and do not employ dropout. The parameter α𝛼\alphaitalic_α of LoRA is set to 256256256256 for both models.

4.4 Click supervision

We harness the additional supervision signal in our dataset, namely the locations of the visual flaws represented by mouse clicks. While this idea was found to be useful in a number of vision tasks [59, 7, 14, 16, 49], to our knowledge, it has not been explored to train explainable video models. The rationale behind using click supervision is to help the explainable model to focus on the region of interest (ROI) and consequently provide precise explanations.

ExDDV: A New Dataset for Explainable Deepfake Detection in Video (4)
ModelMaskingSentence-BERTBERTScoreBLEUMETEORROUGE-1ROUGE-2ROUGE-L
BLIP-2 pre-trained0.090.090.090.090.030.030.030.030.010.010.010.010.040.040.040.040.050.050.050.050.000.000.000.000.050.050.050.05
BLIP-2 in-context0.44±0.008subscript0.44plus-or-minus0.0080.44_{\pm 0.008}0.44 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT0.19±0.004subscript0.19plus-or-minus0.0040.19_{\pm 0.004}0.19 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.03±0.001subscript0.03plus-or-minus0.0010.03_{\pm 0.001}0.03 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.21±0.001subscript0.21plus-or-minus0.0010.21_{\pm 0.001}0.21 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.14±0.003subscript0.14plus-or-minus0.0030.14_{\pm 0.003}0.14 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.06±0.001subscript0.06plus-or-minus0.0010.06_{\pm 0.001}0.06 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.14±0.001subscript0.14plus-or-minus0.0010.14_{\pm 0.001}0.14 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT
BLIP-2 in-contexthard0.47±0.002subscript0.47plus-or-minus0.0020.47_{\pm 0.002}0.47 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.20±0.001subscript0.20plus-or-minus0.0010.20_{\pm 0.001}0.20 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.04±0.001subscript0.04plus-or-minus0.0010.04_{\pm 0.001}0.04 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.21±0.003subscript0.21plus-or-minus0.0030.21_{\pm 0.003}0.21 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.14±0.005subscript0.14plus-or-minus0.0050.14_{\pm 0.005}0.14 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT0.06±0.001subscript0.06plus-or-minus0.0010.06_{\pm 0.001}0.06 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.14±0.002subscript0.14plus-or-minus0.0020.14_{\pm 0.002}0.14 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT
BLIP-2 in-contextsoft0.46±0.004subscript0.46plus-or-minus0.0040.46_{\pm 0.004}0.46 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.20±0.003subscript0.20plus-or-minus0.0030.20_{\pm 0.003}0.20 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.04±0.001subscript0.04plus-or-minus0.0010.04_{\pm 0.001}0.04 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.20±0.001subscript0.20plus-or-minus0.0010.20_{\pm 0.001}0.20 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.14±0.001subscript0.14plus-or-minus0.0010.14_{\pm 0.001}0.14 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.06±0.001subscript0.06plus-or-minus0.0010.06_{\pm 0.001}0.06 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.14±0.002subscript0.14plus-or-minus0.0020.14_{\pm 0.002}0.14 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT
BLIP-2 fine-tuned0.45±0.009subscript0.45plus-or-minus0.0090.45_{\pm 0.009}0.45 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT0.29±0.002subscript0.29plus-or-minus0.0020.29_{\pm 0.002}0.29 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.09±0.010subscript0.09plus-or-minus0.0100.09_{\pm 0.010}0.09 start_POSTSUBSCRIPT ± 0.010 end_POSTSUBSCRIPT0.14±0.015subscript0.14plus-or-minus0.0150.14_{\pm 0.015}0.14 start_POSTSUBSCRIPT ± 0.015 end_POSTSUBSCRIPT0.21±0.010subscript0.21plus-or-minus0.0100.21_{\pm 0.010}0.21 start_POSTSUBSCRIPT ± 0.010 end_POSTSUBSCRIPT0.09±0.007subscript0.09plus-or-minus0.0070.09_{\pm 0.007}0.09 start_POSTSUBSCRIPT ± 0.007 end_POSTSUBSCRIPT0.20±0.009subscript0.20plus-or-minus0.0090.20_{\pm 0.009}0.20 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT
BLIP-2 fine-tunedhard0.55±0.005subscript0.55plus-or-minus0.005{\color[rgb]{0,0,1}0.55}_{\pm 0.005}0.55 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT0.36±0.006subscript0.36plus-or-minus0.006{\color[rgb]{0,0,1}0.36}_{\pm 0.006}0.36 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT0.14±0.005subscript0.14plus-or-minus0.005{\color[rgb]{0,0,1}0.14}_{\pm 0.005}0.14 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT0.22±0.007subscript0.22plus-or-minus0.007{\color[rgb]{0,0,1}0.22}_{\pm 0.007}0.22 start_POSTSUBSCRIPT ± 0.007 end_POSTSUBSCRIPT0.31±0.004subscript0.31plus-or-minus0.004{\color[rgb]{0,0,1}0.31}_{\pm 0.004}0.31 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.38±0.004subscript0.38plus-or-minus0.004{\color[rgb]{0,0,1}0.38}_{\pm 0.004}0.38 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.29±0.004subscript0.29plus-or-minus0.004{\color[rgb]{0,0,1}0.29}_{\pm 0.004}0.29 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT
BLIP-2 fine-tunedsoft0.54±0.006subscript0.54plus-or-minus0.0060.54_{\pm 0.006}0.54 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT0.36±0.002subscript0.36plus-or-minus0.002{\color[rgb]{0,0,1}0.36}_{\pm 0.002}0.36 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.13±0.002subscript0.13plus-or-minus0.0020.13_{\pm 0.002}0.13 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.21±0.001subscript0.21plus-or-minus0.0010.21_{\pm 0.001}0.21 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.30±0.004subscript0.30plus-or-minus0.0040.30_{\pm 0.004}0.30 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.14±0.001subscript0.14plus-or-minus0.0010.14_{\pm 0.001}0.14 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.28±0.003subscript0.28plus-or-minus0.0030.28_{\pm 0.003}0.28 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT
Phi-3-V pre-trained0.250.250.250.250.100.100.100.100.010.010.010.010.010.010.010.010.050.050.050.050.000.000.000.000.040.040.040.04
Phi-3-V in-context0.30±0.002subscript0.30plus-or-minus0.0020.30_{\pm 0.002}0.30 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.14±0.003subscript0.14plus-or-minus0.0030.14_{\pm 0.003}0.14 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.03±0.002subscript0.03plus-or-minus0.0020.03_{\pm 0.002}0.03 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.18±0.005subscript0.18plus-or-minus0.0050.18_{\pm 0.005}0.18 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT0.15±0.002subscript0.15plus-or-minus0.0020.15_{\pm 0.002}0.15 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.05±0.002subscript0.05plus-or-minus0.0020.05_{\pm 0.002}0.05 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.13±0.002subscript0.13plus-or-minus0.0020.13_{\pm 0.002}0.13 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT
Phi-3-V in-contexthard0.30±0.002subscript0.30plus-or-minus0.0020.30_{\pm 0.002}0.30 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.14±0.005subscript0.14plus-or-minus0.0050.14_{\pm 0.005}0.14 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT0.03±0.004subscript0.03plus-or-minus0.0040.03_{\pm 0.004}0.03 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.18±0.005subscript0.18plus-or-minus0.0050.18_{\pm 0.005}0.18 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT0.16±0.003subscript0.16plus-or-minus0.0030.16_{\pm 0.003}0.16 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.05±0.003subscript0.05plus-or-minus0.0030.05_{\pm 0.003}0.05 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.13±0.005subscript0.13plus-or-minus0.0050.13_{\pm 0.005}0.13 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT
Phi-3-V in-contextsoft0.30±0.006subscript0.30plus-or-minus0.0060.30_{\pm 0.006}0.30 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT0.14±0.003subscript0.14plus-or-minus0.0030.14_{\pm 0.003}0.14 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.03±0.004subscript0.03plus-or-minus0.0040.03_{\pm 0.004}0.03 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.18±0.004subscript0.18plus-or-minus0.0040.18_{\pm 0.004}0.18 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.15±0.002subscript0.15plus-or-minus0.0020.15_{\pm 0.002}0.15 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.05±0.003subscript0.05plus-or-minus0.0030.05_{\pm 0.003}0.05 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.13±0.004subscript0.13plus-or-minus0.0040.13_{\pm 0.004}0.13 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT
Phi-3-V fine-tuned0.42±0.005subscript0.42plus-or-minus0.0050.42_{\pm 0.005}0.42 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT0.30±0.002subscript0.30plus-or-minus0.0020.30_{\pm 0.002}0.30 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.06±0.001subscript0.06plus-or-minus0.0010.06_{\pm 0.001}0.06 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.20±0.004subscript0.20plus-or-minus0.0040.20_{\pm 0.004}0.20 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.21±0.005subscript0.21plus-or-minus0.0050.21_{\pm 0.005}0.21 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT0.07±0.005subscript0.07plus-or-minus0.0050.07_{\pm 0.005}0.07 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT0.19±0.006subscript0.19plus-or-minus0.0060.19_{\pm 0.006}0.19 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT
Phi-3-V fine-tunedhard0.53±0.004subscript0.53plus-or-minus0.004{\color[rgb]{0,0,1}0.53}_{\pm 0.004}0.53 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.38±0.003subscript0.38plus-or-minus0.003{\color[rgb]{0,0,1}0.38}_{\pm 0.003}0.38 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.09±0.004subscript0.09plus-or-minus0.004{\color[rgb]{0,0,1}0.09}_{\pm 0.004}0.09 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.27±0.004subscript0.27plus-or-minus0.004{\color[rgb]{0,0,1}0.27}_{\pm 0.004}0.27 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.28±0.002subscript0.28plus-or-minus0.002{\color[rgb]{0,0,1}0.28}_{\pm 0.002}0.28 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.11±0.004subscript0.11plus-or-minus0.004{\color[rgb]{0,0,1}0.11}_{\pm 0.004}0.11 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.25±0.003subscript0.25plus-or-minus0.003{\color[rgb]{0,0,1}0.25}_{\pm 0.003}0.25 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT
Phi-3-V fine-tunedsoft0.53±0.003subscript0.53plus-or-minus0.003{\color[rgb]{0,0,1}0.53}_{\pm 0.003}0.53 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.38±0.003subscript0.38plus-or-minus0.003{\color[rgb]{0,0,1}0.38}_{\pm 0.003}0.38 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.09±0.008subscript0.09plus-or-minus0.008{\color[rgb]{0,0,1}0.09}_{\pm 0.008}0.09 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT0.26±0.011subscript0.26plus-or-minus0.0110.26_{\pm 0.011}0.26 start_POSTSUBSCRIPT ± 0.011 end_POSTSUBSCRIPT0.28±0.009subscript0.28plus-or-minus0.009{\color[rgb]{0,0,1}0.28}_{\pm 0.009}0.28 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT0.11±0.010subscript0.11plus-or-minus0.010{\color[rgb]{0,0,1}0.11}_{\pm 0.010}0.11 start_POSTSUBSCRIPT ± 0.010 end_POSTSUBSCRIPT0.25±0.009subscript0.25plus-or-minus0.009{\color[rgb]{0,0,1}0.25}_{\pm 0.009}0.25 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT
LLaVA pre-trained0.390.390.390.390.180.180.180.180.020.020.020.020.180.180.180.180.120.120.120.120.020.020.020.020.100.100.100.10
LLaVA in-context0.48±0.002subscript0.48plus-or-minus0.0020.48_{\pm 0.002}0.48 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.22±0.008subscript0.22plus-or-minus0.0080.22_{\pm 0.008}0.22 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT0.03±0.001subscript0.03plus-or-minus0.0010.03_{\pm 0.001}0.03 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.21±0.001subscript0.21plus-or-minus0.0010.21_{\pm 0.001}0.21 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.12±0.002subscript0.12plus-or-minus0.0020.12_{\pm 0.002}0.12 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.03±0.002subscript0.03plus-or-minus0.0020.03_{\pm 0.002}0.03 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.10±0.002subscript0.10plus-or-minus0.0020.10_{\pm 0.002}0.10 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT
LLaVA in-contexthard0.47±0.002subscript0.47plus-or-minus0.0020.47_{\pm 0.002}0.47 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.29±0.010subscript0.29plus-or-minus0.0100.29_{\pm 0.010}0.29 start_POSTSUBSCRIPT ± 0.010 end_POSTSUBSCRIPT0.05±0.001subscript0.05plus-or-minus0.0010.05_{\pm 0.001}0.05 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.22±0.001subscript0.22plus-or-minus0.0010.22_{\pm 0.001}0.22 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.20±0.002subscript0.20plus-or-minus0.0020.20_{\pm 0.002}0.20 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.04±0.000subscript0.04plus-or-minus0.0000.04_{\pm 0.000}0.04 start_POSTSUBSCRIPT ± 0.000 end_POSTSUBSCRIPT0.16±0.002subscript0.16plus-or-minus0.0020.16_{\pm 0.002}0.16 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT
LLaVA in-contextsoft0.50±0.006subscript0.50plus-or-minus0.006{\color[rgb]{0,0,1}0.50}_{\pm 0.006}0.50 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT0.23±0.004subscript0.23plus-or-minus0.0040.23_{\pm 0.004}0.23 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.04±0.001subscript0.04plus-or-minus0.0010.04_{\pm 0.001}0.04 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.20±0.002subscript0.20plus-or-minus0.0020.20_{\pm 0.002}0.20 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.12±0.002subscript0.12plus-or-minus0.0020.12_{\pm 0.002}0.12 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.04±0.001subscript0.04plus-or-minus0.0010.04_{\pm 0.001}0.04 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.11±0.002subscript0.11plus-or-minus0.0020.11_{\pm 0.002}0.11 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT
LLaVA fine-tuned0.45±0.003subscript0.45plus-or-minus0.0030.45_{\pm 0.003}0.45 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.33±0.003subscript0.33plus-or-minus0.0030.33_{\pm 0.003}0.33 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.07±0.002subscript0.07plus-or-minus0.0020.07_{\pm 0.002}0.07 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.23±0.003subscript0.23plus-or-minus0.0030.23_{\pm 0.003}0.23 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.24±0.001subscript0.24plus-or-minus0.0010.24_{\pm 0.001}0.24 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.08±0.001subscript0.08plus-or-minus0.0010.08_{\pm 0.001}0.08 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT0.21±0.001subscript0.21plus-or-minus0.0010.21_{\pm 0.001}0.21 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT
LLaVA fine-tunedhard0.49±0.011subscript0.49plus-or-minus0.0110.49_{\pm 0.011}0.49 start_POSTSUBSCRIPT ± 0.011 end_POSTSUBSCRIPT0.35±0.009subscript0.35plus-or-minus0.009{\color[rgb]{0,0,1}0.35}_{\pm 0.009}0.35 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT0.08±0.005subscript0.08plus-or-minus0.005{\color[rgb]{0,0,1}0.08}_{\pm 0.005}0.08 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT0.24±0.009subscript0.24plus-or-minus0.0090.24_{\pm 0.009}0.24 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT0.25±0.011subscript0.25plus-or-minus0.011{\color[rgb]{0,0,1}0.25}_{\pm 0.011}0.25 start_POSTSUBSCRIPT ± 0.011 end_POSTSUBSCRIPT0.09±0.007subscript0.09plus-or-minus0.007{\color[rgb]{0,0,1}0.09}_{\pm 0.007}0.09 start_POSTSUBSCRIPT ± 0.007 end_POSTSUBSCRIPT0.22±0.008subscript0.22plus-or-minus0.008{\color[rgb]{0,0,1}0.22}_{\pm 0.008}0.22 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT
LLaVA fine-tunedsoft0.49±0.006subscript0.49plus-or-minus0.0060.49_{\pm 0.006}0.49 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT0.35±0.004subscript0.35plus-or-minus0.004{\color[rgb]{0,0,1}0.35}_{\pm 0.004}0.35 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT0.08±0.002subscript0.08plus-or-minus0.002{\color[rgb]{0,0,1}0.08}_{\pm 0.002}0.08 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.25±0.002subscript0.25plus-or-minus0.002{\color[rgb]{0,0,1}0.25}_{\pm 0.002}0.25 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT0.25±0.003subscript0.25plus-or-minus0.003{\color[rgb]{0,0,1}0.25}_{\pm 0.003}0.25 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT0.09±0.000subscript0.09plus-or-minus0.000{\color[rgb]{0,0,1}0.09}_{\pm 0.000}0.09 start_POSTSUBSCRIPT ± 0.000 end_POSTSUBSCRIPT0.22±0.003subscript0.22plus-or-minus0.003{\color[rgb]{0,0,1}0.22}_{\pm 0.003}0.22 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT

In Figure4, we present our pipeline based on click-supervision, which comprises a click predictor based on ResNet, and a frame masking heuristic, which preserves the ROI and masks the rest of the frame. To train the click predictor, we first extract all the training frames in which a visual glitch is annotated through a click, along with the coordinates of each click. The coordinates are further normalized between 0 and 1 to make the click predictor invariant to distinct frame resolutions. Next, we train a ViT-based regression model to predict click coordinates. The ViT [22] backbone is pre-trained on ImageNet. During inference, we apply the regressor on each test frame to predict the clicks. Then, we mask the image area outside the ROI, which is defined as a round area of radius r𝑟ritalic_r around the click coordinates. We consider two alternative masking operations, soft and hard. Hard masking implies replacing the masked pixels with zero. Soft masking is based on a 2D Gaussian distribution centered in the predicted click location, where the σ=r𝜎𝑟\sigma=ritalic_σ = italic_r. The masking operation is performed after the input images are normalized.

Hyperparameters. We consider two models pre-trained on ImageNet to predict clicks, a ViT-B and a ResNet-50. We replace their classification heads with a regression head to predict the two coordinates of a click. The click predictors are trained for 15151515 epochs with the AdamW optimizer and a learning rate of 91059superscript1059\cdot 10^{-5}9 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. The learning rate is reduced on plateau by half, if the validation loss does not change for two consecutive epochs. An important hyperparameter for the masking heuristic is the mask radius r𝑟ritalic_r, which needs to be fixed to provide just enough context for the VLM. We tuned r𝑟ritalic_r between 50505050 and 150150150150 pixels with a step of 5555 pixels. The optimal radius is r=75𝑟75r=75italic_r = 75.

5 Experiments and Results

Research questions.Through our experiments, we aim to address the following research questions (RQs):

  1. 1.

    Are the collected annotations useful to train XAI models for deepfake videos?

  2. 2.

    What is the performance impact of click supervision?

  3. 3.

    How many data samples are required to train XAI models for deepfake videos?

  4. 4.

    Can the locations of artifacts be accurately predicted?

To address RQ1, we compare off-the-shelf (pre-trained) VLMs with VLMs based on two alternative training strategies, namely fine-tuning and in-context learning. To answer RQ2, in both types of learning frameworks, we harness the additional information provided through clicks and assess the performance gains brought by this supervision signal. To answer RQ3, we trained the top-scoring VLM with varying training set dimensions in the set {128,256,512,1024,2048,4380}128256512102420484380\{128,256,512,1024,2048,4380\}{ 128 , 256 , 512 , 1024 , 2048 , 4380 }. To address RQ4, we report results with two click predictors that are both fine-tuned on ExDDV training data.

Evaluation measures. Although there are many measures to assess either the semantic similarity or the n-gram overlap of two text samples, such measures are not able to fully capture the similarity between texts. As a result, we evaluate the XAI models using a wide range of metrics, aiming to provide an extensive evaluation of the results. For semantic understanding, we employ Sentence-BERT [64] to embed both predicted and ground-truth descriptions, and compute the cosine similarity between the two. We also employ BERTScore [80] to compute the similarity at the token level. Every token in the ground-truth is greedily matched with a token in the prediction to compute a recall. Similarly, every token in the prediction is matched with a token in the ground-truth to compute the precision. These are combined into an F1 score, called BERTScore. To assess n-gram overlaps, we adopt the most popular evaluation measures used in image captioning: BLEU [60], METEOR [5] and ROUGE [42]. For these metrics, we set the maximum n-gram length to n=2𝑛2n=2italic_n = 2. We evaluate click predictors in terms of the mean absolute error (MAE). We run each experiment three times and report the average scores and the standard deviations.

ExDDV: A New Dataset for Explainable Deepfake Detection in Video (5)

Quantitative results.The results of our experiments are shown in Table5. We present the results for all models (BLIP-2, Phi-3-Vision and LLaVA-1.5) and learning scenarios (pre-training, in-context learning and fine-tuning). Click supervision is integrated via soft or hard masking, respectively. Consistent with the inter-annotator agreement scores, we observe that all models yield better scores in terms of semantic measures than n-gram overlap measures. The gap can be explained by the fact that models generate varied outputs, often using alternative phrases and words to express the same concept.

All fine-tuned models surpass the pre-trained versions by large margins. Although the metrics indicate that the pre-trained LLaVA is close to the fine-tuned models, a visual inspection of its generated answers indicates that they are very generic, in many cases just describing the videos and not providing any information about their authenticity. The reported results offer strong evidence for RQ1, indicating that the collected annotations are useful to train XAI models, both via in-context learning and fine-tuning.

The reported scores also attest the advantages of integrating click predictions via soft and hard masking. This is observable for all three models, although the gains are somewhat lower for LLaVA. Both masking strategies appear to be equally effective. In response to RQ2, we find that the benefits of using click supervision has a positive impact, boosting the performance of all tested VLMs.

In Figure6, we showcase the performance of the fine-tuned LLaVA model for different dimensions of the training set. The performance reaches a plateau after 2,000 training samples. This observation suggests that our dataset contains enough samples to train an explainable deepfake detection model, ellucidating RQ3.

ExDDV: A New Dataset for Explainable Deepfake Detection in Video (6)

In Table 6, we report the mean absolute errors for both ViT and ResNet-50 click predictors. The error of ViT is slightly lower, representing an average offset of only 12121212 pixels w.r.t.the ground-truth coordinates. These results confirm that the regression models can accurately localize visual artifacts, thus providing a positive answer to RQ4.

Qualitative results.Besides the quantitative measurements, we also present qualitative results. In Figure5, we illustrate some explanations provided by various variants of LLaVA. The examples include both relevant explanations as well as wrong explanations, e.g. identifying artifacts on real videos. We present examples for additional versions of LLaVA in Figure 8 from the supplementary.

ModelViTResNet-50
MAE0.05530.05530.05530.05530.05950.05950.05950.0595

In Figure10 from the supplementary, we showcase some examples of how the ViT-based click predictor compares with the ground-truth click locations. We observe that the predictor is able to precisely locate visual artifacts.

6 Conclusion

In our work, we introduced a novel dataset for explainable deepfake detection in videos and made it publicly available. ExDDV consists of 5.4K manually annotated real and fake videos. In addition to the explanations for each video, the dataset also contains the locations of visual artifacts. We also explored different VLMs on the explainable deepfake detection task and evaluated their performance. The empirical results showed that the models are capable of learning to predict the source of visual errors in fake videos, while also detecting real videos. While the reported results are promising, we found that the tested VLMs are well below the inter-annotator agreement scores, suggesting that further exploration is required to build more capable models. In future work, we also aim to harness the difficulty labels, e.g. via curriculum learning, to boost performance.

With deepfake methods rapidly becoming more powerful and easier to access by the whole public, we consider our work as a stepping stone towards developing more robust and transparent detection models that will overcome the harms of deepfakes. We believe that our research will contribute to trustworthy AI systems that will only bring benefits to society, as well as reduce the skepticism around AI technology.

Acknowledgments.This work was supported by a grant of the Ministry of Research, Innovation and Digitization, CCCDI - UEFISCDI, project number PN-IV-P6-6.3-SOL-2024-2-0227, within PNCDI IV.

References

  • Abdin etal. [2024]Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, AmmarAhmad Awan,Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, etal.Phi-3 technical report: A highly capable language model locally onyour phone.arXiv preprint arXiv:2404.14219, 2024.
  • Agarwal etal. [2020]Shruti Agarwal, Hany Farid, Tarek El-Gaaly, and Ser-Nam Lim.Detecting deep-fake videos from appearance and behavior.In Proceedings of WIFS, pages 1–6, 2020.
  • Amerini and Caldelli [2020]Irene Amerini and Roberto Caldelli.Exploiting prediction error inconsistencies through lstm-basedclassifiers to detect deepfake videos.In Proceedings of IH&MMSec, pages 97–102, 2020.
  • Antol etal. [2015]Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,CLawrence Zitnick, and Devi Parikh.VQA: Visual Question Answering.In Proceedings of ICCV, pages 2425–2433, 2015.
  • Banerjee and Lavie [2005]Satanjeev Banerjee and Alon Lavie.METEOR: An automatic metric for MT evaluation with improvedcorrelation with human judgments.In Proceedings of ACL, pages 65–72, 2005.
  • Barrus [2024]Tyler Barrus.pyspellchecker - Pure Python Spell Checking Library, 2024.Accessed: 2025-03-05.
  • Benenson etal. [2019]Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari.Large-scale interactive object segmentation with human annotators.In Proceedings of CVPR, pages 11700–11709, 2019.
  • Bigham etal. [2010]JeffreyP. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller,RobertC. Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, SamualWhite, and Tom Yeh.VizWiz: nearly real-time answers to visual questions.In Proceedings of UIST, pages 333––342, 2010.
  • Bitton-Guetta etal. [2023]Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, YuvalElovici, Gabriel Stanovsky, and Roy Schwartz.Breaking common sense: WHOOPS! a vision-and-language benchmark ofsynthetic and compositional images.In Proceedings of ICCV, pages 2616–2627, 2023.
  • Bonettini etal. [2021]Nicolo Bonettini, EdoardoDaniele Cannas, Sara Mandelli, Luca Bondi, PaoloBestagini, and Stefano Tubaro.Video face manipulation detection through ensemble of CNNs.In Proceedings of ICPR, pages 5012–5019, 2021.
  • Cai etal. [2022]Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat.Do You Really Mean That? Content Driven Audio-Visual DeepfakeDataset and Multimodal Method for Temporal Forgery Localization.In Proceedings of DICTA, pages 1–10, 2022.
  • Callison-Burch etal. [2006]Chris Callison-Burch, Miles Osborne, and Philipp Koehn.Re-evaluating the Role of BLEU in Machine Translation Research.In Proceedings of EACL, pages 249–256, 2006.
  • Chen etal. [2024]Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, JunLan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, and Huaxiong Li.DeMamba: AI-Generated Video Detection on Million-Scale GenVideoBenchmark.arXiv preprint arXiv:2405.19707, 2024.
  • Chen etal. [2022]Pengfei Chen, Xuehui Yu, Xumeng Han, Najmul Hassan, Kai Wang, Jiachen Li, JianZhao, Humphrey Shi, Zhenjun Han, and Qixiang Ye.Point-to-Box Network for Accurate Object Detection via Single PointSupervision.In Proceedings of ECCV, pages 51–67, 2022.
  • Chen etal. [2025]Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma.EchoMimic: Lifelike Audio-Driven Portrait Animations throughEditable Landmark Conditions.In Proceedings of AAAI, 2025.
  • Cheng etal. [2022]Bowen Cheng, Omkar Parkhi, and Alexander Kirillov.Pointly-supervised instance segmentation.In Proceedings of CVPR, pages 2617–2626, 2022.
  • Choi etal. [2024]Jongwook Choi, Taehoon Kim, Yonghyun Jeong, Seungryul Baek, and Jongwon Choi.Exploiting style latent flows for generalizing deepfake videodetection.In Proceedings of CVPR, pages 1133–1143, 2024.
  • Cozzolino etal. [2021]Davide Cozzolino, Andreas Rössler, Justus Thies, Matthias Nießner, andLuisa Verdoliva.ID-Reveal: Identity-aware DeepFake Video Detection.In Proceedings of ICCV, pages 15108–15117, 2021.
  • Croitoru etal. [2024]Florinel-Alin Croitoru, Andrei-Iulian Hiji, Vlad Hondru, NicolaeCatalinRistea, Paul Irofti, Marius Popescu, Cristian Rusu, RaduTudor Ionescu,FahadShahbaz Khan, and Mubarak Shah.Deepfake Media Generation and Detection in the Generative AI Era: ASurvey and Outlook.arXiv preprint arXiv:2411.19537, 2024.
  • Demir and Çiftçi [2024]Ilke Demir and UmurAybars Çiftçi.How Do Deepfakes Move? Motion Magnification for Deepfake SourceDetection.In Proceedings of WACV, pages 4768–4778, 2024.
  • Dolhansky etal. [2020]Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang,and CristianCanton Ferrer.The deepfake detection challenge (DFDC) dataset.arXiv preprint arXiv:2006.07397, 2020.
  • Dosovitskiy etal. [2021]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, Sylvain Gelly, etal.An Image is Worth 16x16 Words: Transformers for Image Recognition atScale.In Proceedings of ICLR, 2021.
  • Gao etal. [2015]Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu.Are you talking to a machine? dataset and methods for multilingualimage question answering.In Proceedings of NeurIPS, pages 2296––2304, 2015.
  • Gu etal. [2022a]Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, and Lizhuang Ma.Delving into the local: Dynamic inconsistency learning for deepfakevideo detection.In Proceedings of AAAI, pages 744–752, 2022a.
  • Gu etal. [2022b]Zhihao Gu, Taiping Yao, Yang Chen, Shouhong Ding, and Lizhuang Ma.Hierarchical contrastive inconsistency learning for deepfake videodetection.In Proceedings of ECCV, pages 596–613, 2022b.
  • Guan etal. [2022]Jiazhi Guan, Hang Zhou, Zhibin Hong, Errui Ding, Jingdong Wang, Chengbin Quan,and Youjian Zhao.Delving into sequential patches for deepfake detection.In Proceedings of NeurIPS, pages 4517–4530, 2022.
  • Güera and Delp [2018]David Güera and EdwardJ Delp.Deepfake video detection using recurrent neural networks.In Proceedings of AVSS, pages 1–6, 2018.
  • Haliassos etal. [2021]Alexandros Haliassos, Konstantinos Vougioukas, Stavros Petridis, and MajaPantic.Lips don’t lie: A generalisable and robust approach to face forgerydetection.In Proceedings of CVPR, pages 5037–5047, 2021.
  • Haliassos etal. [2022]Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, and Maja Pantic.Leveraging real talking faces via self-supervision for robust forgerydetection.In Proceedings of CVPR, pages 14930–14942, 2022.
  • He etal. [2016]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In Proceedings of CVPR, pages 770–778, 2016.
  • Hu etal. [2022a]EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, SheanWang, Lu Wang, and Weizhu Chen.LoRA: Low-rank adaptation of large language models.In Proceedings of ICLR, 2022a.
  • Hu etal. [2022b]Juan Hu, Xin Liao, Jinwen Liang, Wenbo Zhou, and Zheng Qin.FInfer: Frame Inference-Based Deepfake Detection forHigh-Visual-Quality Videos.In Proceedings of AAAI, pages 951–959, 2022b.
  • Ishrak etal. [2024]GaziHasin Ishrak, Zalish Mahmud, MD Farabe, TaheraKhanom Tinni, Tanzim Reza,and MohammadZavid Parvez.Explainable Deepfake Video Detection using Convolutional NeuralNetwork and CapsuleNet.arXiv preprint arXiv:2404.12841, 2024.
  • Jain etal. [2020]Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and ByronC. Wallace.Learning to faithfully rationalize by construction.In Proceedings of ACL, pages 4459–4473, 2020.
  • Jiang etal. [2023]AlbertQ Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel,Guillaume Lample, Lucile Saulnier, etal.Mistral 7B.arXiv preprint arXiv:2310.06825, 2023.
  • Jiang etal. [2020]Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and ChenChange Loy.Deeperforensics-1.0: A large-scale dataset for real-world faceforgery detection.In Proceedings of CVPR, pages 2889–2898, 2020.
  • Khalid etal. [2021]Hasam Khalid, Shahroz Tariq, Minha Kim, and SimonS Woo.FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset.In Proceedings of NeurIPS, 2021.
  • Lei etal. [2016]Tao Lei, Regina Barzilay, and Tommi Jaakkola.Rationalizing neural predictions.In Proceedings of EMNLP, pages 107–117, 2016.
  • Li etal. [2022]Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.BLIP: Bootstrapping Language-Image Pre-training for UnifiedVision-Language Understanding and Generation.In Proceedings of ICML, pages 12888–12900, 2022.
  • Li etal. [2023]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.BLIP-2: Bootstrapping Language-Image Pre-training with Frozen ImageEncoders and Large Language Models.In Proceedings of ICML, pages 19730–19742, 2023.
  • Li etal. [2020]Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu.Celeb-DF: A large-scale challenging dataset for deepfake forensics.In Proceedings of CVPR, pages 3207–3216, 2020.
  • Lin [2004]Chin-Yew Lin.ROUGE: A Package for Automatic Evaluation of summaries.In Proceedings of ACL, page10, 2004.
  • Liu etal. [2024a]Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, ChengqiDengr, Chong Ruan, Damai Dai, Daya Guo, etal.DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-ExpertsLanguage Model.arXiv preprint arXiv:2405.04434, 2024a.
  • Liu etal. [2023a]Baoping Liu, Bo Liu, Ming Ding, Tianqing Zhu, and Xin Yu.TI2Net: Temporal Identity Inconsistency Network for DeepfakeDetection.In Proceedings of WACV, pages 4691–4700, 2023a.
  • Liu etal. [2023b]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual Instruction Tuning.In Proceedings of NeurIPS, 2023b.
  • Liu etal. [2024b]Tao Liu, Feilong Chen, Shuai Fan, Chenpeng Du, Qi Chen, Xie Chen, and Kai Yu.AniTalker: Animate Vivid and Diverse Talking Faces throughIdentity-Decoupled Facial Motion Encoding.In Proceedings of ACMMM, pages 6696–6705, 2024b.
  • Liu etal. [2022]Yibing Liu, Haoliang Li, Yangyang Guo, Chenqi Kong, Jing Li, and Shiqi Wang.Rethinking Attention-Model Explainability through FaithfulnessViolation Test.In Proceedings of ICML, pages 13807–13824, 2022.
  • Lundberg and Lee [2017]ScottM. Lundberg and Su-In Lee.A unified approach to interpreting model predictions.In Proceedings of NeurIPS, pages 4765–4774, 2017.
  • Luo etal. [2024]Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Junchi Yan, and Yansheng Li.PointOBB: Learning Oriented Object Detection via Single PointSupervision.In Proceedings of CVPR, pages 16730–16740, 2024.
  • Ma etal. [2016]Lin Ma, Zhengdong Lu, and Hang Li.Learning to answer questions from image using convolutional neuralnetwork.In Proceedings of AAAI, pages 3567––3573, 2016.
  • Malinowski and Fritz [2014]Mateusz Malinowski and Mario Fritz.A multi-world approach to question answering about real-world scenesbased on uncertain input.In Proceedings of NeurIPS, pages 1682––1690, 2014.
  • Malinowski etal. [2015]Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz.Ask your neurons: A neural-based approach to answering questionsabout images.In Proceedings of ICCV, pages 1–9, 2015.
  • Masi etal. [2020]Iacopo Masi, Aditya Killekar, RoystonMarian Mascarenhas, ShenoyPratikGurudatt, and Wael AbdAlmageed.Two-branch recurrent network for isolating deepfakes in videos.In Proceedings of ECCV, pages 667–684, 2020.
  • Minaee etal. [2024]Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher,Xavier Amatriain, and Jianfeng Gao.Large Language Models: A Survey.arXiv preprint arXiv:2402.06196, 2024.
  • Mohankumar etal. [2020]AkashKumar Mohankumar, Preksha Nema, Sharan Narasimhan, MiteshM Khapra,BalajiVasan Srinivasan, and Balaraman Ravindran.Towards Transparent and Explainable Attention Models.In Proceedings of ACL, pages 4206–4216, 2020.
  • Montserrat etal. [2020]DanielMas Montserrat, Hanxiang Hao, SriK Yarlagadda, Sriram Baireddy, RuitingShao, János Horváth, Emily Bartusiak, Justin Yang, David Guera,Fengqing Zhu, etal.Deepfakes detection with automatic face weighting.In Proceedings of CVPR, pages 668–669, 2020.
  • Naveed etal. [2023]Humza Naveed, AsadUllah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, MuhammadUsman, Naveed Akhtar, Nick Barnes, and Ajmal Mian.A Comprehensive Overview of Large Language Models.arXiv preprint arXiv:2307.06435, 2023.
  • Ouyang etal. [2022]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, PamelaMishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.Training language models to follow instructions with human feedback.In Proceedings of NeurIPS, pages 27730–27744, 2022.
  • Papadopoulos etal. [2017]DimP. Papadopoulos, JasperR.R. Uijlings, Frank Keller, and Vittorio Ferrari.Training object class detectors with click supervision.In Proceedings of CVPR, pages 6374–6383, 2017.
  • Papineni etal. [2002]Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.BLEU: A Method for Automatic Evaluation of Machine Translation.In Proceedings of ACL, pages 311–318, 2002.
  • Piergiovanni etal. [2022]AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, andAnelia Angelova.Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering.arXiv preprint arXiv:2205.00949, 2022.
  • Radford etal. [2018]Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.Improving Language Understanding by Generative Pre-Training.OpenAI, 2018.
  • Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,etal.Learning transferable visual models from natural languagesupervision.In Proceedings of ICML, pages 8748–8763, 2021.
  • Reimers and Gurevych [2019]Nils Reimers and Iryna Gurevych.Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.In Proceedings of EMNLP, pages 3982–3992, 2019.
  • Ren etal. [2015]Mengye Ren, Ryan Kiros, and Richard Zemel.Exploring models and data for image question answering.In Proceedings of NeurIPS, pages 2953––2961, 2015.
  • Ribeiro etal. [2016]MarcoTulio Ribeiro, Sameer Singh, and Carlos Guestrin.”Why should I trust you?” Explaining the predictions of anyclassifier.In Proceedings of SIGKDD, pages 1135–1144, 2016.
  • Rossler etal. [2019]Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, JustusThies, and Matthias Nießner.FaceForensics++: Learning to detect manipulated facial images.In Proceedings of ICCV, pages 1–11, 2019.
  • Sabir etal. [2019]Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, andPrem Natarajan.Recurrent convolutional strategies for face manipulation detection invideos.Proceedings of CVPR, pages 80–87, 2019.
  • Selvaraju etal. [2020]RamprasaathR. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam,Devi Parikh, and Dhruv Batra.Grad-CAM: visual explanations from deep networks via gradient-basedlocalization.International Journal of Computer Vision, 128:336–359, 2020.
  • Stypułkowski etal. [2024]Michał Stypułkowski, Konstantinos Vougioukas, Sen He, Maciej Ziȩba,Stavros Petridis, and Maja Pantic.Diffused Heads: Diffusion Models Beat GANs on Talking-FaceGeneration.In Proceedings of WACV, pages 5091–5100, 2024.
  • Tian etal. [2024]Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo.EMO: Emote Portrait Alive – Generating Expressive Portrait Videoswith Audio2Video Diffusion Model under Weak Conditions.In Proceedings of ECCV, pages 244–260, 2024.
  • Touvron etal. [2023]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-AnneLachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, EricHambro, Faisal Azhar, etal.LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971, 2023.
  • Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need.In Proceedings of NeurIPS, pages 6000–6010, 2017.
  • Wang etal. [2024]Cong Wang, Kuan Tian, Jun Zhang, Yonghang Guan, Feng Luo, Fei Shen, ZhiweiJiang, Qing Gu, Xiao Han, and Wei Yang.V-Express: Conditional Dropout for Progressive Training of PortraitVideo Generation.arXiv preprint arXiv:2406.02511, 2024.
  • Wang and Chow [2023]Tianyi Wang and KamPui Chow.Noise Based Deepfake Detection via Multi-Head Relative-Interaction.In Proceedings of AAAI, pages 14548–14556, 2023.
  • Xu etal. [2024]Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, JingdongWang, Yao Yao, and Siyu Zhu.Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait ImageAnimation.arXiv preprint arXiv:2406.08801, 2024.
  • Xu etal. [2025]Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang,Yizhong Zhang, Xin Tong, and Baining Guo.VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time.In Proceedings of NeurIPS, pages 660–684, 2025.
  • Yan etal. [2024]Zhiyuan Yan, Yandan Zhao, Shen Chen, Xinghe Fu, Taiping Yao, Shouhong Ding, andLi Yuan.Generalizing Deepfake Video Detection with Plug-and-Play:Video-Level Blending and Spatiotemporal Adapter Tuning.arXiv preprint arXiv:2408.17065, 2024.
  • Yang etal. [2016]Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola.Stacked attention networks for image question answering.In Proceedings of CVPR, pages 21–29, 2016.
  • Zhang etal. [2020]Tianyi Zhang, Varsha Kishore, Felix Wu, KilianQ Weinberger, and Yoav Artzi.BERTScore: Evaluating Text Generation with BERT.In Proceedings of ICLR, 2020.
  • Zhang etal. [2024]Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, andFangyuan Zou.MimicMotion: High-Quality Human Motion Video Generation withConfidence-aware Pose Guidance.arXiv preprint arXiv:2406.19680, 2024.
  • Zhao etal. [2020]Yiru Zhao, Wanfeng Ge, Wenxin Li, Run Wang, Lei Zhao, and Jiang Ming.Capturing the persistence of facial expression features for deepfakevideo detection.In Proceedings of ICICS, pages 630–645, 2020.
  • Zheng etal. [2021]Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen.Exploring temporal coherence for more general video face forgerydetection.In Proceedings of ICCV, pages 15044–15054, 2021.
  • Zhong etal. [2019]Ruiqi Zhong, Steven Shao, and Kathleen McKeown.Fine-grained sentiment analysis with faithful attention.arXiv preprint arXiv:1908.06870, 2019.
  • Zi etal. [2020]Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang.WildDeepfake: A Challenging Real-World Dataset for DeepfakeDetection.In Proceedings of ACMMM, pages 2382–2390, 2020.

\thetitle

Supplementary Material

ExDDV: A New Dataset for Explainable Deepfake Detection in Video (7)

ExDDV: A New Dataset for Explainable Deepfake Detection in Video (8)

7 Additional results

In this section, we present some additional data and more results. In Figure7, we present more annotated samples from ExDDV, having different levels of difficulty, click locations, and explanatory text lengths.

In Figure9, we plot a bar chart showing the various resolutions that comprise ExDDV. The bar chart clearly shows that the first three resolutions are significantly more frequent than the others.

Figure8 contains a comprehensive diagram with qualitative samples for all possible training scenarios applied on LLaVA [45]. We observe that most explanations are meaningful, although there are cases when various model versions fail.

In Figure10, we showcase a comparison between the predictions of the click regressor and the ground-truth click locations, highlighting the precise localization performance of our model.

8 Training environments

We have worked on multiple environments for our experiments. For the in-context learning experiments, we used a Tesla V100-SXM2 GPU with 32GB VRAM. Phi-3-Vision [1] and LLaVA [45] were fine-tuned on a single H100 GPU with 80GB VRAM. BLIP-2 [40] was fine-tuned using an RTX 4090 GPU with 24GB VRAM. The same training environment as for BLIP-2 was used for the click predictors.

ExDDV: A New Dataset for Explainable Deepfake Detection in Video (9)
ExDDV: A New Dataset for Explainable Deepfake Detection in Video (10)
ExDDV: A New Dataset for Explainable Deepfake Detection in Video (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Carlyn Walter

Last Updated:

Views: 5840

Rating: 5 / 5 (50 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Carlyn Walter

Birthday: 1996-01-03

Address: Suite 452 40815 Denyse Extensions, Sengermouth, OR 42374

Phone: +8501809515404

Job: Manufacturing Technician

Hobby: Table tennis, Archery, Vacation, Metal detecting, Yo-yoing, Crocheting, Creative writing

Introduction: My name is Carlyn Walter, I am a lively, glamorous, healthy, clean, powerful, calm, combative person who loves writing and wants to share my knowledge and understanding with you.