🏆
Leaderboard arXivIVY-FAKE is the first unified benchmark and explainable framework for detecting AI-generated images and videos. It provides over 150,000 annotated training samples and 18,700 evaluation examples, each paired with human-readable explanations. We introduce IVY-XDETECTOR, a vision-language model capable of multimodal AIGC detection with high accuracy and transparency. Our approach offers detailed spatial and temporal reasoning, enabling robust identification of synthetic content in real-world scenarios.
The rapid advancement of Artificial Intelligence Generated Content (AIGC) has produced hyper-realistic synthetic media, raising concerns about authenticity and integrity. Current detection methods are often black-box, lack interpretability, and don't support unified image and video detection, hindering transparency and deployment. To address this, we introduce IVY-FAKE, a large-scale, unified dataset for explainable multimodal AIGC detection with over 150,000 annotated training samples and 18,700 evaluation examples, each with natural-language reasoning. We also propose IVY-XDETECTOR, a unified vision-language model for state-of-the-art explainable detection of both image and video content.
Advancements in AIGC, driven by models like DALL-E, Imagen, Stable Diffusion, and SORA, have led to highly realistic synthetic images and videos, posing challenges to content authenticity and public trust. Most current AIGC detection methods are binary classifiers with limited interpretability and often lack support for diverse generators or modalities. While Multimodal Large Language Models (MLLMs) show promise for explainable detection, existing benchmarks are inadequate, often lacking video data or sufficient annotation depth.
To overcome these limitations, we introduce IVY-FAKE, a comprehensive benchmark with diverse multimodal data (94,781 images, 54,967 videos for training) and rich, explainable annotations. Building on this, our IVY-XDETECTOR model excels at identifying and explaining spatial and temporal generative artifacts in both images and videos. Our key contributions are a unified vision-language detector and the first large-scale benchmark for explainable multimodal AIGC detection.
The IVY-FAKE dataset is designed for explainable multimodal AIGC detection, containing 94,781 training images, 54,967 training videos, and around 18,700 total test samples. It features diverse content (animals, objects, DeepFakes, etc.) and sources (GANs, Diffusion models, Transformers) and is kept current by collecting new AIGC content.
Video data (approx. 110,000 clips) was sourced from public benchmarks like GenVideo and LOKI, and web-crawled platforms, including outputs from models like SORA and Stable Video Diffusion. Image data (approx. 110,000 images) was similarly collected from public datasets like FakeClue and WildFake, and web sources, covering GANs and Diffusion models. A stratified sampling strategy ensures balanced representation.
Explainable annotations were generated using Gemini 2.5 Pro with a knowledge distillation process and a structured template requiring reasoning before conclusion (<think>...</think><conclusion>...</conclusion>
). Gemini was provided with ground-truth labels and a prompt like "This {file_type} is {label}. Explain the reason.". Explanations are categorized into Spatial Features (8 sub-dimensions) and Temporal Features (4 sub-dimensions, video-only). IVY-FAKE uniquely integrates explainable annotations across both image and video modalities, unlike prior datasets which are often limited in scope or modality.
Figure 2: Overview of the IVY-FAKE dataset creation process showing data domains, labeling prompts, and annotation structure.
This chapter details the IVY-XDETECTOR, a multimodal large language model designed for robust and explainable AIGC detection. The architecture, based on the LLaVA paradigm and initialized with Ivy-VL-LLaVA weights, consists of a Visual Encoder (SigLIP), a Visual Projector, and a Large Language Model. It supports dynamic high-resolution image inputs and processes video frames while preserving temporal information. The model is trained using a progressive three-stage framework: initial video understanding, fine-tuning for AIGC detection, and joint optimization for both detection accuracy and high-quality, human-understandable explanations.
The three-stage training pipeline for Ivy-Detector, including general video understanding, detection instruction tuning, and interpretability instruction tuning.
This section describes the extensive experiments conducted to evaluate the detection and explanation capabilities of the IVY-XDETECTOR. The model was tested on classification tasks (distinguishing real from fake content) for both images and videos, using metrics like accuracy and F1-score. Reasoning abilities were assessed by comparing model-generated explanations with reference annotations using ROUGE-L scores and an LLM-as-a-judge approach (evaluating completeness, relevance, detail, and explanation quality). The proposed method demonstrated superior performance on various benchmarks for both image and video content classification and produced more transparent reasoning compared to several leading multimodal models.
The paper introduces IVY-FAKE, the first unified, large-scale dataset for explainable AI-Generated Content (AIGC) detection across both images and videos, featuring extensive annotated samples with natural-language reasoning. Accompanying this, the Ivy Explainable Detector (IVY-XDETECTOR) is proposed, a vision-language architecture that jointly detects and explains synthetic content. The model sets new state-of-the-art benchmarks in AIGC detection and explainability, and the publicly released resources aim to provide a robust foundation for transparent and trustworthy multimodal analysis.
Showcase of the Video artifact detection
Performance metrics for various Multi-Modal Large Language Models (MLLMs) on Image and Video tasks.
Models are ranked by the 'AVG' score.
# | Model | Model Params | Date | Overall | Image | Video | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Auto Metrics (AVG) | GPT Assisted (AVG) | Auto Metrics | GPT Assisted | Auto Metrics | GPT Assisted | ||||||||||||||||||
Acc | F1 | ROUGE-L | SIM | Com. | Rel. | Det. | Exp. | AVG | Acc | F1 | ROUGE-L | SIM | Com. | Rel. | Det. | Exp. | AVG | ||||||
1 |
Ivy-xDet
Pi3AI |
3B | 2025-06-02 | 0.702 | 4.130 | 0.805 | 0.802 | 0.271 | 0.767 | 4.39 | 4.21 | 4.33 | 4.54 | 4.40 | 0.945 | 0.945 | 0.303 | 0.776 | 3.76 | 4.00 | 3.71 | 3.97 | 3.86 |
2 |
GPT-4o
Openai |
- | 2025-06-02 | 0.605 | 3.905 | 0.766 | 0.759 | 0.155 | 0.683 | 3.69 | 3.79 | 3.67 | 3.80 | 3.74 | 0.828 | 0.813 | 0.150 | 0.682 | 4.02 | 4.13 | 3.99 | 4.12 | 4.07 |
3 |
Gemini 2.5 Flash
|
- | 2025-06-02 | 0.587 | 3.590 | 0.668 | 0.656 | 0.223 | 0.709 | 3.29 | 3.33 | 3.27 | 3.33 | 3.30 | 0.787 | 0.784 | 0.188 | 0.678 | 3.85 | 3.93 | 3.82 | 3.92 | 3.88 |
4 |
GPT-4o-mini
Openai |
- | 2025-06-02 | 0.529 | 3.315 | 0.653 | 0.650 | 0.149 | 0.645 | 3.21 | 3.26 | 3.21 | 3.26 | 3.23 | 0.685 | 0.672 | 0.134 | 0.645 | 3.38 | 3.42 | 3.37 | 3.42 | 3.40 |
5 |
InternVL3
Shanghai AI Lab |
8B | 2025-06-02 | 0.513 | 3.095 | 0.614 | 0.605 | 0.159 | 0.680 | 3.04 | 3.07 | 3.04 | 3.07 | 3.05 | 0.632 | 0.616 | 0.165 | 0.629 | 3.13 | 3.16 | 3.12 | 3.16 | 3.14 |
6 |
Qwen2.5-VL
Alibaba |
7B | 2025-06-02 | 0.484 | 2.830 | 0.556 | 0.516 | 0.199 | 0.660 | 2.73 | 2.77 | 2.73 | 2.77 | 2.75 | 0.589 | 0.527 | 0.207 | 0.621 | 2.89 | 2.94 | 2.89 | 2.89 | 2.91 |
7 |
Phi-3.5-Vision
Microsoft |
4B | 2025-06-02 | 0.332 | 2.740 | 0.560 | 0.555 | 0.092 | 0.366 | 2.66 | 2.72 | 2.66 | 2.72 | 2.69 | 0.559 | 0.479 | 0.001 | 0.046 | 2.78 | 2.79 | 2.78 | 2.79 | 2.79 |
@article{jiang2025ivyfake,
title = {Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection},
author = {Dong, Wenhui and Jiang, Changjiang and Zhang, Zhonghao and Yu, Fengchang and Peng, Wei},
year = {2025},
}