2026-05-28 08:01 UTCIn-site rewrite2 min readUpdated: 2026-06-30 13:03 UTC

7B Model Beats o3 and GPT-5: Medical AI Agents Teach Models Where and How to Look

The LeapQuest team at Shanghai Innovation Institute, in collaboration with multiple universities, introduces a new medical AI paradigm that enables models to actively use visual tools during reasoning, transforming from passive input receivers to active evidence seekers. Two papers are accepted at ICML 2026.

Source量子位Author: 听雨

Article intelligence

EngineersAdvanced

Key points

LeapQuest proposes Ophiuchus and MedScope for medical images and videos, adopting the Think with Images/Videos paradigm.
Ophiuchus-7B achieves an average score of 68.0 on 8 VQA benchmarks, surpassing o3 (62.2) and GPT-5 (59.9).
MedScope, trained with ClinVideoSuite and GA-GRPO, achieves open-source SOTA on video understanding tasks.
The new paradigm allows models to actively invoke segmentation, localization, and zoom-in tools during the reasoning chain, enabling evidence-driven visual reasoning.

Why it matters

This matters because leapQuest proposes Ophiuchus and MedScope for medical images and videos, adopting the Think with Images/Videos paradigm.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

In a groundbreaking development for medical artificial intelligence, a 7B-parameter model has outperformed industry giants like OpenAI's o3 and GPT-5 in medical image diagnostics. The research, led by the LeapQuest team at Shanghai Innovation Institute in collaboration with Zhejiang University, Shanghai Jiao Tong University, and Fudan University, presents two papers accepted at ICML 2026 that introduce a novel paradigm:

For years, multimodal medical models have operated by encoding images or videos into visual features and then generating answers and explanations through large language models. However, this approach has a critical flaw: models can produce seemingly coherent explanations without truly identifying the key evidence—whether it's a tiny lesion, a subtle boundary change, or a fleeting surgical movement. To address this, the team developed Ophiuchus and MedScope, which implement the "Think with Images" and "Think with Videos" paradigms, respectively.

Ophiuchus transforms the model into a visual agent that can collaborate with medical image tools. During the reasoning process, the model decides when to call external tools such as SAM2 for fine-grained segmentation, BiomedParse for locating medical structures based on text prompts, and Zoom-in for magnifying critical regions. The output from these tools is fed back into the reasoning chain as observations, driving subsequent decisions. This is not a simple tool add-on; the tools become an integral part of the reasoning chain, requiring the model to learn when and which tool to use, how to interpret the tool output, and how to adjust strategies when results are unreliable. In evaluations, Ophiuchus-7B achieved an average score of 68.0 on 8 VQA benchmarks, significantly higher than OpenAI o3 (62.2), Gemini 2.5 Pro (61.8), and GPT-5 (59.9). Its tool usage accuracy reached 97.9%, demonstrating that model size and language reasoning are not the only bottlenecks when fine-grained visual evidence is required.

MedScope extends this paradigm to the more challenging domain of long clinical videos. In such videos, key evidence is not only fine-grained but also sparse in time—a surgical maneuver or a change in endoscopic view may last only a few seconds. MedScope mimics the way clinicians observe: it first establishes a global understanding, then returns to suspicious time windows, uses crop_video to extract segments, get_frame to obtain key frames, and integrates these local observations into the final answer. This makes the reasoning process inherently auditable: one can see which video segments the model reviewed and which frames it used to support its conclusion. To train such behavior, the team constructed ClinVideoSuite, a dataset containing 635K dense captions with timestamps, 254K evidence-related QA pairs, 34K visual-CoT trajectories, and an interactive environment for reinforcement learning. The training employs a three-stage pipeline: clinical reasoning warm-up, visual-CoT cold-start SFT, and GA-GRPO (grounding-aware GRPO) with an evidence-modulated advantage to encourage retrieval of truly supporting visual segments. On benchmarks like SVU-31K and ClinVideo-Eval, MedScope achieves state-of-the-art results among open-source models in multi-granularity video understanding, fine-grained temporal reasoning, and grounded VQA. Ablation studies show that removing the evidence reward significantly reduces localization quality, indicating that answer-level supervision is insufficient for reliable evidence selection.

Together, Ophiuchus and MedScope define a new medical multimodal intelligence paradigm: the model's reasoning process is no longer just a sequence of language tokens but a closed-loop interaction among language, tools, image regions, video clips, and evidence feedback. This shift is crucial for clinical AI, where every conclusion requires an evidence chain. By actively seeking, verifying, and citing visual evidence before giving an answer, these models move closer to real clinical visual reasoning—fewer hallucinations, stronger interpretability, and suitability for complex workflows. This represents a fundamental change: from seeing passively to thinking with images and videos, from outputting answers to actively searching for evidence, and from language chains to multimodal thought chains driven by visual evidence. The era of medical AI that truly "watches while thinking" has arrived.