2026-06-16原文2 min readUpdated: 2026-06-16

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

PhoneHarness is a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. It integrates GUI, CLI, and host-side tool actions, achieving a 75.0% pass rate, outperforming non-PhoneHarness settings by 12.9 percentage points, highlighting the importance of action-surface routing and verifiable execution.

SourcearXiv Computational LinguisticsAuthor: Chenxin Li, Zhengyao Fang, Zhengyang Tang, Pengyuan Lyu, Xingran Zhou, Xin Lai, Fei Tang, Liang Wu, Yiduo Guo, Weinong Wang, Junyi Li, Yi Zhang, Yang Ding, Huawen Shen, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Chengquan Zhang, Han Hu

[2606.14832] PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

[Submitted on 12 Jun 2026]

Title:PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

View a PDF of the paper titled PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions, by Chenxin Li and 20 other authors

View PDF HTML (experimental)

Abstract:Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.

Comments: Project Page: this https URL

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2606.14832 [cs.CL]

(or arXiv:2606.14832v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.14832

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Chenxin Li [view email] [v1] Fri, 12 Jun 2026 15:01:32 UTC (2,886 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions, by Chenxin Li and 20 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CL

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)