š
Leaderboard arXivIVY-FAKE is the first unified benchmark and explainable framework for detecting AI-generated images and videos. It provides over 100,000 annotated training samples and 5,000 evaluation examples, each paired with human-readable explanations. We introduce IVY-XDETECTOR, a vision-language model capable of multimodal AIGC detection with high accuracy and transparency. Our approach offers detailed spatial and temporal reasoning, enabling robust identification of synthetic content in real-world scenarios.
The rapid advancement of Artificial Intelligence Generated Content (AIGC) has produced hyper-realistic synthetic media, raising concerns about authenticity and integrity. Current detection methods are often black-box, lack interpretability, and don't support unified image and video detection, hindering transparency and deployment. To address this, we introduce IVY-FAKE, a large-scale, unified dataset for explainable multimodal AIGC detection with over 100,000 annotated training samples and 5,000 evaluation examples, each with natural-language reasoning. We also propose IVY-XDETECTOR, a unified vision-language model for state-of-the-art explainable detection of both image and video content.
Advancements in AIGC, driven by models like DALL-E, Imagen, Stable Diffusion, and SORA, have led to highly realistic synthetic images and videos, posing challenges to content authenticity and public trust. Most current AIGC detection methods are binary classifiers with limited interpretability and often lack support for diverse generators or modalities. While Multimodal Large Language Models (MLLMs) show promise for explainable detection, existing benchmarks are inadequate, often lacking video data or sufficient annotation depth.
To overcome these limitations, we introduce IVY-FAKE, a comprehensive benchmark with diverse multimodal data and rich, explainable annotations. Building on this, our IVY-XDETECTOR model excels at identifying and explaining spatial and temporal generative artifacts in both images and videos. Our key contributions are a unified vision-language detector and the first large-scale benchmark for explainable multimodal AIGC detection.
The IVY-FAKE dataset is designed for explainable multimodal AIGC detection, containing 100K+ training images, around 5,000 total test samples. It features diverse content (animals, objects, DeepFakes, etc.) and sources (GANs, Diffusion models, Transformers) and is kept current by collecting new AIGC content.
Video data was sourced from public benchmarks like GenVideo and LOKI, and web-crawled platforms, including outputs from models like SORA and Stable Video Diffusion. Image data was similarly collected from public datasets like FakeClue and WildFake, and web sources, covering GANs and Diffusion models. A stratified sampling strategy ensures balanced representation.
Explainable annotations were generated using Gemini 2.5 Pro with a knowledge distillation process and a structured template requiring reasoning before conclusion (<think>...</think><conclusion>...</conclusion>). Gemini was provided with ground-truth labels and a prompt like "This {file_type} is {label}. Explain the reason.". Explanations are categorized into Spatial Features (8 sub-dimensions) and Temporal Features (4 sub-dimensions, video-only). IVY-FAKE uniquely integrates explainable annotations across both image and video modalities, unlike prior datasets which are often limited in scope or modality.
Table 1: Example of IVY-FAKE annotations.
This chapter details the IVY-XDETECTOR, a multimodal large language model designed for robust and explainable AIGC detection. The architecture, based on the Qwen2.5-VL paradigm and initialized with Qwen2.5-VL-7B weights. It supports dynamic high-resolution image inputs and processes video frames while preserving temporal information. The model is trained using a progressive two-stage framework: initial video understanding, fine-tuning for AIGC detection, and joint optimization for both detection accuracy and high-quality, human-understandable explanations.
The two-stage training pipeline for IVY-XDETECTOR, including general video understanding, detection instruction tuning, and interpretability instruction tuning.
This section describes the extensive experiments conducted to evaluate the detection and explanation capabilities of the IVY-XDETECTOR. The model was tested on classification tasks (distinguishing real from fake content) for both images and videos, using metrics like accuracy and F1-score. Reasoning abilities were assessed by comparing model-generated explanations with reference annotations using ROUGE-L scores and an LLM-as-a-judge approach (evaluating completeness, relevance, detail, and explanation quality). The proposed method demonstrated superior performance on various benchmarks for both image and video content classification and produced more transparent reasoning compared to several leading multimodal models.
The paper introduces IVY-FAKE, the first unified, large-scale dataset for explainable AI-Generated Content (AIGC) detection across both images and videos, featuring extensive annotated samples with natural-language reasoning. Accompanying this, the Ivy Explainable Detector (IVY-XDETECTOR) is proposed, a vision-language architecture that jointly detects and explains synthetic content. The model sets new state-of-the-art benchmarks in AIGC detection and explainability, and the publicly released resources aim to provide a robust foundation for transparent and trustworthy multimodal analysis.
Showcase of the Video artifact detection
The official evaluation scripts for IVY-FAKE benchmark can be found at:
https://github.com/Pi3AI/Ivy-Fake
These scripts support performance evaluation for both image and video modalities, including:
- Automatic metric computation (Acc, F1, ROUGE-L, Bert-Score.)
- GPT-assisted explanation grading
- Final ranking and reporting for MLLM benchmarks
Performance comparison of models on Image and Video tasks. āAuto Metricsā include Acc, F1, ROUGE-L, SIM. āGPT Assistedā includes Comprehensiveness, Relevance, Detail, Explanation.
| Model | Image | Video | Overall | |||
|---|---|---|---|---|---|---|
| Auto Metrics | GPT Assisted | Auto Metrics | GPT Assisted | Auto Metrics | GPT Assisted | |
| Acc/F1/ROUGE-L/SIM | Com./Rel./Det./Exp. | Acc/F1/ROUGE-L/SIM | Com./Rel./Det./Exp. | Acc/F1/ROUGE-L/SIM | Com./Rel./Det./Exp. | |
| Closed-source MLLMs | ||||||
| GPT-4o | 0.725/0.723/0.108/0.525 | 2.34/3.20/2.04/3.26 | 0.448/0.579/0.072/0.451 | 1.79/2.35/1.67/2.40 | 0.587/0.663/0.090/0.488 | 2.07/2.78/1.85/2.83 |
| Gemini-2.5-Flash | 0.747/0.737/0.263/0.733 | 3.94/4.11/4.04/4.09 | 0.810/0.811/0.246/0.723 | 4.00/4.37/4.03/4.36 | 0.779/0.776/0.254/0.728 | 3.97/4.24/4.04/4.22 |
| Open-source MLLMs | ||||||
| 7B-Parameters Models | ||||||
| InternVL3.5-8B | 0.605/0.602/0.194/0.680 | 2.83/3.49/2.69/3.32 | 0.574/0.588/0.188/0.664 | 2.75/3.35/2.68/3.28 | 0.589/0.596/0.191/0.672 | 2.79/3.42/2.69/3.30 |
| MiMo-VL-7B | 0.662/0.637/0.121/0.593 | 1.99/2.80/1.85/2.89 | 0.778/0.783/0.112/0.580 | 2.04/2.91/1.90/3.19 | 0.720/0.715/0.116/0.586 | 2.01/2.86/1.87/3.04 |
| Qwen2.5-VL-7B | 0.013/0.026/0.006/0.264 | 1.02/1.03/1.01/1.52 | 0.092/0.159/0.015/0.280 | 1.14/1.23/1.12/2.21 | 0.053/0.096/0.010/0.272 | 1.08/1.13/1.07/1.86 |
| LLaVA-OneVision-1.5-8B | 0.500/0.333/0.080/0.499 | 1.51/2.62/1.49/2.38 | 0.500/0.333/0.068/0.481 | 1.49/2.26/1.37/2.19 | 0.500/0.333/0.074/0.490 | 1.50/2.44/1.43/2.28 |
| MiniCPM-V-4.5 | 0.666/0.680/0.169/0.637 | 3.20/3.93/3.06/3.60 | 0.491/0.505/0.152/0.627 | 2.83/3.66/2.76/3.36 | 0.579/0.610/0.161/0.632 | 3.01/3.80/2.91/3.48 |
| 3B-Parameters Models | ||||||
| Qwen2.5-VL-3B | 0.641/0.612/0.023/0.391 | 1.19/1.33/1.18/3.28 | 0.689/0.686/0.017/0.381 | 1.42/1.56/1.40/3.68 | 0.665/0.652/0.020/0.386 | 1.31/1.45/1.29/3.48 |
| Gemma-3-4B-IT | 0.408/0.477/0.170/0.576 | 2.55/3.36/2.46/3.11 | 0.396/0.482/0.149/0.561 | 2.30/3.03/2.37/2.95 | 0.402/0.482/0.159/0.568 | 2.43/3.19/2.42/3.03 |
| InternVL3.5-2B | 0.602/0.573/0.177/0.648 | 2.62/3.29/2.51/3.13 | 0.435/0.459/0.159/0.631 | 2.46/3.16/2.42/2.99 | 0.518/0.518/0.168/0.640 | 2.54/3.22/2.47/3.06 |
| InternVL3.5-4B | 0.651/0.652/0.190/0.660 | 3.01/3.68/2.95/3.58 | 0.614/0.617/0.181/0.653 | 2.93/3.61/2.83/3.52 | 0.632/0.635/0.186/0.656 | 2.97/3.64/2.89/3.55 |
| Ivy-xDetector | 0.831/0.831/0.283/0.714 | 3.54/4.04/3.61/3.85 | 0.897/0.897/0.300/0.726 | 3.72/4.12/3.75/4.24 | 0.864/0.864/0.291/0.720 | 3.63/4.08/3.68/4.05 |
@article{jiang2025ivyfake,
title = {Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection},
author = {Changjiang Jiang and Wenhui Dong and Zhonghao Zhang and Chenyang Si and Fengchang Yu and Wei Peng and Xinbin Yuan and Yifei Bi and Ming Zhao and Zian Zhou and Caifeng Shan},
year = {2025},
url = {https://arxiv.org/abs/2506.00979}
}