Task-Oriented Omni-Modal Evaluation Harness

TOBench

A benchmark for evaluating real-world tool-using agents through closed-loop multimodal verification.

Paper Code Tasks Leaderboard

Overview

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise actions before producing final results.

TOBench introduces a benchmark and executable evaluation harness for task-oriented omni-modal tool use. It evaluates the full perceive–act–inspect–revise loop rather than isolated tool calls or final-answer matching.

100Executable Tasks

27MCP Servers

324Tools

20Scenario Slices

Leaderboard

Modern agents still struggle with realistic omni-modal tool use.

The strongest evaluated model reaches only 41.0% average task success, while the human benchmark reaches 94.0%.

Rank	Model	Avg.	Tool Calls	Tokens (k)	Cost ($)
—	Human Benchmark	94.00%	—	—	—
1	Qwen3.5-Plus	41.00%	25.0	559.1	0.17
2	Claude-Opus-4.6	32.00%	28.2	329.7	2.37
3	Gemini-3-Pro	32.00%	18.0	1300.5	2.62
4	Kimi-K2.5	31.00%	25.0	668.3	0.41
5	Gemini-3.1-Pro	30.00%	21.5	1506.6	3.03
6	Claude-Haiku-4.5	27.00%	22.9	244.0	0.27
7	GPT-5	26.80%	24.3	620.0	0.94

Case Gallery

Representative task-oriented workflows.

TOBench tasks are built from realistic user needs, professional roles, multimodal assets, executable tools, and task-specific grounded verifiers.

Customer Service

E-commerce Support

Diagnose a customer issue from multimodal evidence, retrieve relevant order or product information, and produce a grounded service response.

Inputs: Images, receipts, web pages, documents
Tools: Search, browser, spreadsheet, file utilities
Verifier: Checks factual grounding, required fields, and response constraints

Office Workflow

Document & Spreadsheet Editing

Extract information from files, transform structured data, edit an office artifact, and inspect the produced result before submission.

Inputs: PDFs, Excel files, screenshots, text files
Tools: Filesystem, office tools, calculator, rendering tools
Verifier: Combines format checks, code checks, and artifact inspection

Intelligent Creation

Advertising Asset Generation

Create or edit visual assets under user requirements, inspect the generated artifact, and revise visual details when constraints are not met.

Inputs: Reference images, brand constraints, user requirements
Tools: Image generation, image processing, browser, file tools
Verifier: Evaluates semantic alignment, visual constraints, and output format

Multimodal Reasoning

Audio / Video Evidence Tasks

Process temporal or acoustic evidence, align observations with task requirements, and produce a verifiable action or answer.

Inputs: Audio clips, videos, transcripts, screenshots
Tools: ASR, media processing, retrieval, calculator
Verifier: Checks extracted evidence, reasoning consistency, and final artifacts

Citation

Cite TOBench

If you find TOBench useful for your research, please cite our work.

@article{tobench2026,
  title={TOBench: A Task-Oriented Omni-Modal Evaluation Harness for Real-World Tool-Using Agents},
  author={TOBench Team},
  year={2026}
}