2026-06-18原文2 min readUpdated: 2026-06-18

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL is a hierarchical multimodal skill library that incorporates visual figures into skill artifacts, significantly improving computer-use agents' performance on long-horizon tasks and unseen software. On CUA-World and OSExpert-Eval benchmarks, a Claude Code CLI agent using VISUALSKILL achieved an average score of 0.456, a +15.3 point absolute lift over the no-skill baseline and +8.3 points over a text-only skill.

SourcearXiv Computational LinguisticsAuthor: Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang

[2606.18448] VISUALSKILL: Multimodal Skills for Computer-Use Agents

[Submitted on 16 Jun 2026]

Title:VISUALSKILL: Multimodal Skills for Computer-Use Agents

View a PDF of the paper titled VISUALSKILL: Multimodal Skills for Computer-Use Agents, by Ziyan Jiang and 7 other authors

View PDF HTML (experimental)

Abstract:Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at this https URL.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2606.18448 [cs.CL]

(or arXiv:2606.18448v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.18448

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Ziyan Jiang [view email] [v1] Tue, 16 Jun 2026 19:57:07 UTC (6,385 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled VISUALSKILL: Multimodal Skills for Computer-Use Agents, by Ziyan Jiang and 7 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CL

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)