Project Logo

Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

1\(\pi^3\) AI Lab
2Wuhan University
3Nanjing University
4Stanford University
*Equal Contribution
Code Dataset

🏆

Leaderboard
arXiv Huggingface Paper
Teaser Image

Overview of the IVY-FAKE framework: By conducting in-depth analysis of temporal and spatial artifacts, the framework enables explainable detection of AI-generated content

IVY-FAKE Overview

IVY-FAKE is the first unified benchmark and explainable framework for detecting AI-generated images and videos. It provides over 150,000 annotated training samples and 18,700 evaluation examples, each paired with human-readable explanations. We introduce IVY-XDETECTOR, a vision-language model capable of multimodal AIGC detection with high accuracy and transparency. Our approach offers detailed spatial and temporal reasoning, enabling robust identification of synthetic content in real-world scenarios.

The rapid advancement of Artificial Intelligence Generated Content (AIGC) has produced hyper-realistic synthetic media, raising concerns about authenticity and integrity. Current detection methods are often black-box, lack interpretability, and don't support unified image and video detection, hindering transparency and deployment. To address this, we introduce IVY-FAKE, a large-scale, unified dataset for explainable multimodal AIGC detection with over 150,000 annotated training samples and 18,700 evaluation examples, each with natural-language reasoning. We also propose IVY-XDETECTOR, a unified vision-language model for state-of-the-art explainable detection of both image and video content.

Introduction

Advancements in AIGC, driven by models like DALL-E, Imagen, Stable Diffusion, and SORA, have led to highly realistic synthetic images and videos, posing challenges to content authenticity and public trust. Most current AIGC detection methods are binary classifiers with limited interpretability and often lack support for diverse generators or modalities. While Multimodal Large Language Models (MLLMs) show promise for explainable detection, existing benchmarks are inadequate, often lacking video data or sufficient annotation depth.

To overcome these limitations, we introduce IVY-FAKE, a comprehensive benchmark with diverse multimodal data (94,781 images, 54,967 videos for training) and rich, explainable annotations. Building on this, our IVY-XDETECTOR model excels at identifying and explaining spatial and temporal generative artifacts in both images and videos. Our key contributions are a unified vision-language detector and the first large-scale benchmark for explainable multimodal AIGC detection.

IVY-FAKE Dataset

The IVY-FAKE dataset is designed for explainable multimodal AIGC detection, containing 94,781 training images, 54,967 training videos, and around 18,700 total test samples. It features diverse content (animals, objects, DeepFakes, etc.) and sources (GANs, Diffusion models, Transformers) and is kept current by collecting new AIGC content.

Video data (approx. 110,000 clips) was sourced from public benchmarks like GenVideo and LOKI, and web-crawled platforms, including outputs from models like SORA and Stable Video Diffusion. Image data (approx. 110,000 images) was similarly collected from public datasets like FakeClue and WildFake, and web sources, covering GANs and Diffusion models. A stratified sampling strategy ensures balanced representation.

Explainable annotations were generated using Gemini 2.5 Pro with a knowledge distillation process and a structured template requiring reasoning before conclusion (<think>...</think><conclusion>...</conclusion>). Gemini was provided with ground-truth labels and a prompt like "This {file_type} is {label}. Explain the reason.". Explanations are categorized into Spatial Features (8 sub-dimensions) and Temporal Features (4 sub-dimensions, video-only). IVY-FAKE uniquely integrates explainable annotations across both image and video modalities, unlike prior datasets which are often limited in scope or modality.

IVY-FAKE Dataset Overview

Figure 2: Overview of the IVY-FAKE dataset creation process showing data domains, labeling prompts, and annotation structure.

Detector: IVY-XDETECTOR

This chapter details the IVY-XDETECTOR, a multimodal large language model designed for robust and explainable AIGC detection. The architecture, based on the LLaVA paradigm and initialized with Ivy-VL-LLaVA weights, consists of a Visual Encoder (SigLIP), a Visual Projector, and a Large Language Model. It supports dynamic high-resolution image inputs and processes video frames while preserving temporal information. The model is trained using a progressive three-stage framework: initial video understanding, fine-tuning for AIGC detection, and joint optimization for both detection accuracy and high-quality, human-understandable explanations.

IVY-FAKE Dataset Overview

The three-stage training pipeline for Ivy-Detector, including general video understanding, detection instruction tuning, and interpretability instruction tuning.

Experiments

This section describes the extensive experiments conducted to evaluate the detection and explanation capabilities of the IVY-XDETECTOR. The model was tested on classification tasks (distinguishing real from fake content) for both images and videos, using metrics like accuracy and F1-score. Reasoning abilities were assessed by comparing model-generated explanations with reference annotations using ROUGE-L scores and an LLM-as-a-judge approach (evaluating completeness, relevance, detail, and explanation quality). The proposed method demonstrated superior performance on various benchmarks for both image and video content classification and produced more transparent reasoning compared to several leading multimodal models.

Conclusion

The paper introduces IVY-FAKE, the first unified, large-scale dataset for explainable AI-Generated Content (AIGC) detection across both images and videos, featuring extensive annotated samples with natural-language reasoning. Accompanying this, the Ivy Explainable Detector (IVY-XDETECTOR) is proposed, a vision-language architecture that jointly detects and explains synthetic content. The model sets new state-of-the-art benchmarks in AIGC detection and explainability, and the publicly released resources aim to provide a robust foundation for transparent and trustworthy multimodal analysis.

IVY-FAKE Dataset Overview

Showcase of the Video artifact detection

MLLM Leaderboard

Performance metrics for various Multi-Modal Large Language Models (MLLMs) on Image and Video tasks.
Models are ranked by the 'AVG' score.

# Model Model Params Date Overall Image Video
Auto Metrics (AVG) GPT Assisted (AVG) Auto Metrics GPT Assisted Auto Metrics GPT Assisted
Acc F1 ROUGE-L SIM Com. Rel. Det. Exp. AVG Acc F1 ROUGE-L SIM Com. Rel. Det. Exp. AVG
1 Ivy-xDet

Pi3AI

3B 2025-06-02 0.702 4.130 0.805 0.802 0.271 0.767 4.39 4.21 4.33 4.54 4.40 0.945 0.945 0.303 0.776 3.76 4.00 3.71 3.97 3.86
2 GPT-4o

Openai

- 2025-06-02 0.605 3.905 0.766 0.759 0.155 0.683 3.69 3.79 3.67 3.80 3.74 0.828 0.813 0.150 0.682 4.02 4.13 3.99 4.12 4.07
3 Gemini 2.5 Flash

Google

- 2025-06-02 0.587 3.590 0.668 0.656 0.223 0.709 3.29 3.33 3.27 3.33 3.30 0.787 0.784 0.188 0.678 3.85 3.93 3.82 3.92 3.88
4 GPT-4o-mini

Openai

- 2025-06-02 0.529 3.315 0.653 0.650 0.149 0.645 3.21 3.26 3.21 3.26 3.23 0.685 0.672 0.134 0.645 3.38 3.42 3.37 3.42 3.40
5 InternVL3

Shanghai AI Lab

8B 2025-06-02 0.513 3.095 0.614 0.605 0.159 0.680 3.04 3.07 3.04 3.07 3.05 0.632 0.616 0.165 0.629 3.13 3.16 3.12 3.16 3.14
6 Qwen2.5-VL

Alibaba

7B 2025-06-02 0.484 2.830 0.556 0.516 0.199 0.660 2.73 2.77 2.73 2.77 2.75 0.589 0.527 0.207 0.621 2.89 2.94 2.89 2.89 2.91
7 Phi-3.5-Vision

Microsoft

4B 2025-06-02 0.332 2.740 0.560 0.555 0.092 0.366 2.66 2.72 2.66 2.72 2.69 0.559 0.479 0.001 0.046 2.78 2.79 2.78 2.79 2.79

BibTeX

@article{jiang2025ivyfake,
  title     = {Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection},
  author    = {Dong, Wenhui and Jiang, Changjiang and Zhang, Zhonghao and Yu, Fengchang and Peng, Wei},
  year      = {2025},
}