Task-Oriented Omni-Modal Evaluation Harness

TOBench

A benchmark for evaluating real-world tool-using agents through closed-loop multimodal verification.

Overview

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise actions before producing final results.

TOBench introduces a benchmark and executable evaluation harness for task-oriented omni-modal tool use. It evaluates the full perceive–act–inspect–revise loop rather than isolated tool calls or final-answer matching.

100Executable Tasks
27MCP Servers
324Tools
20Scenario Slices
TOBench construction framework

Leaderboard

Modern agents still struggle with realistic omni-modal tool use.

The strongest evaluated model reaches only 41.0% average task success, while the human benchmark reaches 94.0%.

RankModelAvg.Tool CallsTokens (k)Cost ($)
Human Benchmark94.00%
1Qwen3.5-Plus41.00%25.0559.10.17
2Claude-Opus-4.632.00%28.2329.72.37
3Gemini-3-Pro32.00%18.01300.52.62
4Kimi-K2.531.00%25.0668.30.41
5Gemini-3.1-Pro30.00%21.51506.63.03
6Claude-Haiku-4.527.00%22.9244.00.27
7GPT-526.80%24.3620.00.94
TOBench model performance comparison

Case Gallery

Representative task-oriented workflows.

TOBench tasks are built from realistic user needs, professional roles, multimodal assets, executable tools, and task-specific grounded verifiers.

Customer Service

E-commerce Support

Diagnose a customer issue from multimodal evidence, retrieve relevant order or product information, and produce a grounded service response.

Inputs
Images, receipts, web pages, documents
Tools
Search, browser, spreadsheet, file utilities
Verifier
Checks factual grounding, required fields, and response constraints
Office Workflow

Document & Spreadsheet Editing

Extract information from files, transform structured data, edit an office artifact, and inspect the produced result before submission.

Inputs
PDFs, Excel files, screenshots, text files
Tools
Filesystem, office tools, calculator, rendering tools
Verifier
Combines format checks, code checks, and artifact inspection
Intelligent Creation

Advertising Asset Generation

Create or edit visual assets under user requirements, inspect the generated artifact, and revise visual details when constraints are not met.

Inputs
Reference images, brand constraints, user requirements
Tools
Image generation, image processing, browser, file tools
Verifier
Evaluates semantic alignment, visual constraints, and output format
Multimodal Reasoning

Audio / Video Evidence Tasks

Process temporal or acoustic evidence, align observations with task requirements, and produce a verifiable action or answer.

Inputs
Audio clips, videos, transcripts, screenshots
Tools
ASR, media processing, retrieval, calculator
Verifier
Checks extracted evidence, reasoning consistency, and final artifacts
TOBench task and tool distribution

Citation

Cite TOBench

If you find TOBench useful for your research, please cite our work.

@article{tobench2026,
  title={TOBench: A Task-Oriented Omni-Modal Evaluation Harness for Real-World Tool-Using Agents},
  author={TOBench Team},
  year={2026}
}