2026-05-09原文

OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support

OncoAgent is an open-source, privacy-preserving clinical decision support system for oncology. It features a dual-tier LLM architecture (9B speed vs 27B deep reasoning), multi-agent LangGraph topology, Corrective RAG pipeline over 70+ NCCN and ESMO guidelines, and a three-layer reflexion safety validator with Zero-PHI policy. The system routes queries via complexity scoring and was fine-tuned on AMD Instinct MI300X, achieving 56x throughput acceleration. It supports on-premises deployment to ensure data sovereignty.

Article intelligence

EngineersAdvanced

Key points

Open-source, privacy-preserving oncology decision support system for on-premises deployment.
Dual-tier LLM: 9B speed-optimized and 27B deep-reasoning models, routed via additive complexity scorer.
Multi-agent LangGraph topology with 8 specialized nodes for decomposed clinical reasoning.
Zero-PHI policy and four-layer safety architecture (retrieval gate, confidence gate, reflexion critic, HITL interrupt).

Why it matters

This matters because open-source, privacy-preserving oncology decision support system for on-premises deployment.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Back to Articles

"OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support"

Team Article Published May 9, 2026

Upvote

Máximo López Chenlo

MaximoLopezChenlo

lablab-ai-amd-developer-hackathon

thumbnail: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/oncoagent-thumbnail.png authors:

user: oncoagent-research tags:

oncology

multi-agent

LangGraph

RAG

QLoRA

AMD

open-source

clinical-ai

healthcare

OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support

Technical preprint · May 2026 · OncoAgent Research Group

Abstract

We present OncoAgent, an open-source, privacy-preserving clinical decision support system for oncology. OncoAgent combines a dual-tier fine-tuned LLM architecture with a state-of-the-art multi-agent LangGraph topology, a four-stage Corrective RAG pipeline over 70+ physician-grade NCCN and ESMO guidelines, and a three-layer reflexion safety validator enforcing a strict Zero-PHI policy.

The system routes clinical queries through an additive complexity scorer to either a 9B parameter speed-optimised model (Tier 1) or a 27B deep-reasoning model (Tier 2), both fine-tuned via QLoRA on a corpus of 266,854 real and synthetically generated oncological cases using the Unsloth framework on AMD Instinct MI300X hardware (192 GB HBM3).

Sequence packing on MI300X enabled full-dataset fine-tuning in approximately 50 minutes — a 56× throughput acceleration over API-based generation. Post-fix, CRAG document grading achieved a 100% success rate with a mean RAG confidence score of 2.3+. The complete system is 100% open source and deployable on-premises, eliminating proprietary cloud API dependency and preserving patient data sovereignty.

Keywords: clinical decision support, oncology AI, multi-agent systems, retrieval-augmented generation, QLoRA, AMD ROCm, open-source healthcare AI, HITL safety, LangGraph, Corrective RAG

Introduction

Oncology is one of the most information-dense and cognitively demanding domains in clinical medicine. The volume, heterogeneity, and rapid evolution of evidence-based guidelines — from the National Comprehensive Cancer Network (NCCN) to the European Society for Medical Oncology (ESMO) — create a persistent knowledge gap between published evidence and bedside practice.

AI-assisted clinical decision support systems hold transformative potential for closing this gap, yet most commercially available systems fail in three critical ways:

Hallucinated recommendations not grounded in validated guidelines

Cloud API dependency that precludes on-premises deployment in privacy-sensitive hospital environments

Monolithic LLM architectures prone to context saturation under complex multi-comorbidity presentations

OncoAgent is designed around three core principles:

Architectural decomposition: Clinical reasoning is decomposed across eight specialised LangGraph nodes, each with a bounded, auditable function.

Grounded generation: All model outputs are anchored to a curated vector knowledge base through a four-stage retrieval pipeline with explicit relevance gating.

Hardware sovereignty: The full inference and training stack runs natively on AMD Instinct MI300X using ROCm and open-source frameworks — enabling hospital deployment without data exfiltration.

Related Work

2.1 Clinical LLMs and Decision Support

Large language models have demonstrated significant promise in clinical NLP tasks including diagnostic coding, literature summarisation, and patient communication. Domain-specific fine-tuning approaches — exemplified by BioMedLM, Med-PaLM 2, and ClinicalBERT — consistently improve performance on medical benchmarks over general-purpose models. OncoAgent extends this line of work by targeting the specific subdomain of oncological triage and treatment pathway recommendation, where hallucination consequences are most severe.

2.2 Multi-Agent Architectures

Decomposed multi-agent systems have emerged as a principled approach to complex reasoning tasks. OncoAgent synthesises four canonical SOTA patterns:

Claude Code pattern — deterministic safety harnesses separated from LLM reasoning

Hermes Agent pattern — structured tool-calling with per-session memory isolation

Corrective RAG (Shi et al., 2024) — document relevance grading and query reformulation

Reflexion (Shinn et al., 2023) — self-correcting generation via feedback-augmented retry loops

2.3 Retrieval-Augmented Generation in Medicine

Standard bi-encoder retrieval is ill-suited for clinical domains where terminological precision is critical (e.g., "tyrosine kinase inhibitor" vs. "TKI"). OncoAgent implements a multi-stage pipeline with cross-encoder re-ranking, and integrates Hypothetical Document Embeddings (HyDE; Gao et al., 2022) to resolve medical synonym mismatches by projecting natural language queries into the guideline embedding space.

System Architecture

3.1 Overview

OncoAgent is implemented as a stateful directed graph using LangGraph. The system state is represented as an immutable AgentState TypedDict containing 11 logical sections and approximately 30 typed keys. Each node appends to specific keys without mutating upstream data, preserving a complete audit trail.

The 8-node topology is:

Router → Ingestion → Corrective RAG → Specialist ↔ Critic → HITL Gate → Formatter → END ↓ Fallback → END

Key properties:

5 conditional edges

1 reflexion retry loop (max 2 iterations)

1 mandatory HITL interrupt for high-complexity or low-confidence outputs

3.2 Complexity Router and Model Tiering

Case complexity is quantified using a weighted additive model prior to specialist invocation:

S = w_cancer + w_stage + w_mutations + w_treatment

Where:

Factor Condition Weight

Cancer type Rare +0.40

Cancer type Unknown primary +0.30

Stage Stage IV +0.25

Stage Stage III +0.15

Mutations ≥2 identified +0.30

Mutations Single +0.15

Prior treatment Any keyword match +0.10

Decision boundary: S ≥ 0.5 → Tier 2 (Qwen 3.6-27B deep reasoning) · S

4. BF16 workaround (is_bf16_supported() returns False on ROCm despite hardware support)

training_args = TrainingArguments(fp16=True, ...)

Final deployment uses native BF16:

model = AutoModelForCausalLM.from_pretrained(..., torch_dtype=torch.bfloat16)

5.4 Sequence Packing and Throughput Breakthrough

Sequence packing via packing=True in SFTConfig concatenates multiple short clinical records into single 2048-token sequences, eliminating padding overhead and drastically reducing forward pass count.

The combined effect of Unsloth kernels and sequence packing on the MI300X enabled full-dataset fine-tuning of the 266,854-sample corpus in approximately 50 minutes — against an initial 5-hour estimate — representing roughly a 6× training time compression. GPU utilisation peaked at ~70%, with consistent throughput at ~11.3 s/iteration.

Checkpoint-1000 results: Tier 1 adapter trained for 1,339 steps · training loss ≈ 0.05 · adapter size 187 MB · verified against 11-file manifest including adapter_model.safetensors, adapter_config.json, and tokenizer.json.

The system supports adaptive inference routing: local BF16 inference via the LocalModelManager singleton when ROCm is available, with graceful fallback to the Featherless.ai API for high availability.

Safety and Privacy Framework

6.1 Zero-PHI Policy

A dedicated Zero-PHI redaction node runs as the first processing step in the Ingestion node, before any text reaches an LLM. It identifies and replaces Protected Health Information (patient names, dates of birth, MRN numbers, addresses, facility identifiers) with clinically neutral placeholders. The redacted representation is stored in AgentState; the original text is discarded.

This ensures that no PHI reaches any downstream LLM call — local or remote — and satisfies HIPAA de-identification requirements by design rather than policy.

6.2 Layered Safety Architecture

The system's safety guarantees are enforced at four independent layers. A failure at any single layer does not compromise the overall posture.

Layer Mechanism Addresses

L1: Retrieval Gate Distance Gate (cosine threshold 0.10) Out-of-domain hallucinations

L2: Confidence Gate RAG confidence score < 0.3 → block Low-quality retrieval grounding

L3: Reflexion Critic Formatting + safety + LLM entailment (max 2 retries) Unsupported or unsafe Specialist outputs

L4: HITL Gate Mandatory clinician interrupt for Tier 2 / flagged cases High-complexity cases requiring expert judgment

Layers 1 and 2 operate at the retrieval layer. Layer 3 at the generation layer. Layer 4 at the deployment layer. All Layer 3 checks run as deterministic code — not LLM-controlled logic — preventing safety bypass via adversarial prompting.

Clinical Interface

The OncoAgent UI is implemented as a real-time streaming Gradio application in a ChatGPT-style conversational layout. It features:

Left sidebar: Session controls, KPI tiles, evidence source tabs

Main chat area: Live agentic reasoning updates as each node completes

Real-time transparency is achieved via LangGraph's .stream(stream_mode="updates") API, which emits {node_name: node_output} dictionaries as each node completes. The UI maps each node to a human-readable clinical label (e.g., corrective_rag → "Retrieving NCCN/ESMO guidelines"), providing clinicians with full pipeline visibility.

The rag_confidence score and retrieved source count are prominently surfaced, giving clinicians immediate visibility into the quality of guideline grounding behind each recommendation.

The interface was designed to WCAG 2.1 AA standards — Lucide-style inline SVG icons, slate-900/sky-500 dark theme, Figtree/Inter typography, prefers-reduced-motion media query, all transitions capped at 200 ms.

Results

Component Metric Value

Knowledge Base Guidelines ingested 70+

PDFs parsed 138 in < 60 s

Index parsing errors 0

CRAG Pipeline Document grading success rate (post-fix) 100%

RAG confidence score (uterine cancer test) 2.3+ (was 0.0 pre-fix)

Parallel grading latency (3–5 docs) < 5 s

Complexity Router Stage IV pancreatic + KRAS + BRCA2 Score = 0.80 → Tier 2 ✅

Training (Tier 1, 9B) Full 266k-sample training time ~50 min (vs. 5 hr estimate)

Steady-state throughput ~11.3–16 s/step

GPU utilisation (MI300X) ~70% peak

VRAM utilisation (Unsloth) ~64 GB / 192 GB

Training loss at checkpoint-1000 ~0.05

Synthetic data throughput (MI300X vs. API) 6,800 vs. 120 cases/hr (56× ↑)

Synthetic corpus rejection rate 0.65%

Graph Topology Compiled nodes verified 8 / 8

Module test suites passed 6 / 6

UI Browser timeouts during triage 0

UI rendering latency < 200 ms

Discussion

9.1 Hardware Sovereignty as a Clinical Requirement

The ability to run the complete OncoAgent stack — training, inference, RAG, and UI — on a single AMD MI300X instance without cloud API dependencies is not merely an engineering convenience. In hospital environments governed by HIPAA (US), GDPR (EU), and equivalent national frameworks, the legal and ethical obligation to maintain data within controlled infrastructure is absolute. OncoAgent demonstrates that SOTA multi-agent clinical AI is achievable within this constraint.

9.2 The Throughput Breakthrough

The 56× synthetic data generation acceleration (from ~120 to ~6,800 cases/hr) and the ~6× training time compression together represent a significant practical contribution to the feasibility of domain-specific fine-tuning in time-constrained settings. These results suggest that AMD's CDNA3 architecture, when paired with Unsloth's Triton kernel optimisations and SFT sequence packing, may be substantially underutilised by standard HuggingFace training pipelines — and that the performance gap can be closed without changes to the underlying model architecture.

9.3 Limitations

Several limitations warrant acknowledgement:

The training corpus relies on approximately 36% synthetically generated cases. Clinical accuracy validation against board-certified oncologist judgments has not yet been performed at scale.

The current knowledge base covers NCCN guidelines primarily in English; ESMO and non-English clinical corpora remain for future work.

The Tier 1 adapter represents checkpoint-1000 of a potentially longer trajectory; full convergence and downstream clinical benchmark evaluation (MedQA, USMLE-style oncology subsets) are planned for subsequent releases.

Conclusion

OncoAgent establishes a complete, open-source, privacy-preserving clinical decision support architecture for oncology that integrates SOTA multi-agent design patterns, domain-specific fine-tuning, and a four-stage grounded retrieval pipeline.

The system demonstrates that production-grade clinical AI does not require proprietary infrastructure: the full stack — including 266k-sample QLoRA fine-tuning, 70+ guideline RAG, eight-node LangGraph orchestration, three-layer reflexion safety validation, and real-time clinical streaming UI — runs on a single AMD Instinct MI300X instance under ROCm.

The architectural contributions — particularly the synthesis of Corrective RAG, Reflexion, and HITL gating into a single coherent safety stack — represent a replicable blueprint for domain-specific clinical AI deployments where hallucination consequences are life-critical.

All code, adapter weights, and the OncoCoT synthetic corpus will be released publicly on Hugging Face Spaces and GitHub.

References

Singhal, K. et al. (2023). Large language models encode clinical knowledge. Nature, 620, 172–180.

Nori, H. et al. (2023). Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv:2311.16452.

Wang, L. et al. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), 186345.

Shi, W. et al. (2024). Corrective Retrieval Augmented Generation. arXiv:2401.15884.

Shinn, N. et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023.

Nogueira, R. and Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.

Gao, L. et al. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv:2212.10496.

Hu, E.J. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.

Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023.

Han, S. et al. (2024). LangGraph: Building stateful multi-actor applications with LLMs. LangChain Technical Report.

OncoAgent is intended as a clinical decision support tool. All outputs require review by licensed medical professionals prior to any clinical application.