Harness-1: The 20B Retrieval Subagent That Beats GPT-5.4 at Search
Harness-1 is a compact retrieval agent that separates state management from the model, using an eight-tool interface and two-phase compression for efficient search.
-->
Harness-1: The 20B Retrieval Agent That Beats GPT-5.4 at Search
India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder
d
:
h
:
m
:
s
Career
GenAI
Prompt Engg
ChatGPT
LLM
Langchain
RAG
AI Agents
Machine Learning
Deep Learning
GenAI Tools
LLMOps
Python
NLP
SQL
AIML Projects
Reading list
How to Become a Data Analyst in 2025: A Complete RoadMap
A Comprehensive Learning Path to Tableau in 2025
A Comprehensive NLP Learning Path 2025
Learning Path to Become a Data Scientist in 2025
Step-by-Step Roadmap to Become a Data Engineer in 2025
A Comprehensive MLOps Learning Path: 2025 Edition
Roadmap to Become an AI Engineer in 2025
A Comprehensive Learning Path to Master Computer Vision in 2025
Best Roadmap to Learn Generative AI in 2025
GenAI Roadmap for Enterprises
Large Language Models Demystified: A Beginner’s Roadmap
Learning Path to Become a Prompt Engineering Specialist
Harness-1: The 20B Retrieval Subagent That Beats GPT-5.4 at Search
Riya Bansal Last Updated : 24 Jun, 2026
8 min read
Most search agents try to handle too many jobs at once. They generate new queries, remember what they have already explored, collect evidence, and decide what is relevant as the search keeps expanding. That can make the whole process messy, expensive, and hard to control.
Harness-1 takes a simpler approach. Built with researchers from UIUC, UC Berkeley, and Chroma, it separates the work of finding search terms from the work of tracking search progress. The result is a compact retrieval agent that feels easier to reason about and performs far above what its size might suggest.
In this article, we take a closer look at Harness-1 and why its approach to retrieval agents matters.
Table of contents
Why Existing Search Agents Plateau?
What the Harness Actually Does?
The Eight-Tool Interface
The Cold Start Problem (And Its Solution)
How Training Works: SFT Then RL
Stage 1: Supervised Fine Tuning
Stage 2: Reinforcement Learning
Hands-On: Running Harness-1 Locally
Benchmark Results: Where It Stands
What Harness-1 Doesn’t Do?
Conclusion
Frequently Asked Questions
Why Existing Search Agents Plateau?
Most retrieval agents are trained end to end. The model produces queries, reads chunks, decides what matters, and keeps all that context in a growing transcript. The policy learns everything, search strategy, evidence tracking, deduplication, and those stopping conditions too.
The problem is reinforcement learning then tries to improve all of this at once. Semantic search decisions like should I search for “merger date” or “acquisition year” get tangled with the more low-level bookkeeping. Have I seen this chunk before? RL ends up optimizing both, and honestly, they don’t share the same learning dynamics. So, it gets a bit messy.
The researchers call this the core design flaw. Their fix is clean, move state management out of the model and into a harness.
What the Harness Actually Does?
The stateful harness comprises the main breakthrough. The harness runs the model as a state machine. It maintains these four persistent structures throughout each episode:
A candidate pool consists of all compressed, deduplicated documents from all candidate searches.
A curated set is the final output with up to 30 documents identified with importance flags (very_high, high, fair, low).
A full-text store contains every piece of data retrieved, stored outside of the machine prompt.
An evidence graph is a collection of auto-extracted entities, their bridge documents, and singleton leads.
The evidence graph portion of this structure is quite clever. The regex extractor scans each piece of retrieved data for proper nouns, years, and dates. Bridge documents that contain two or more entities frequently found together are flagged as being of very high priority. Singletons mark potential follow-up searches. At each turn of play, the harness presents this information in an efficient, compact manner.
The Eight-Tool Interface
The eight-tool based on the model function on each turn. Every turn, the model emits exactly one action.
Two phase compression is applied to the output from search phase of retrieval. The first phase of compression uses Sentence-BM25 to rank all sentences and select the top 4 from each chunk. The second phase of compression is accomplished through two-level de-duplication: the first stage is de-duplication by chunk ID, the second stage is de-duplication by content fingerprint. The policy never sees the raw retrieval output prior to the completion of two-phase de-duplication.
The design has paid off, as the model has kept its context clean. The model has only processed signals, and all tokens are not noise.
The Cold Start Problem (And Its Solution)
The first issue in retrieval training is determining how a policy learns to create a curated dataset out of nothing, which leads to randomness in the policy’s first few RL episodes. Because the initial state for the policy does not have a prior to refine from, it doesn’t know how to curate. Therefore, the policy either throws everything into the curated dataset or does not curate any at all.
Harness-1 addresses this issue using warm-start seeding. After the harness has successfully performed a search for the first time, it automatically generates a curated dataset using the top 8 reranked results that were tagged with a fairness rating. Thus, the policy has a remedial function (refinement, increasing the value of quality documents and decreasing the quality of weak documents) instead of a primary function (removing all documents and creating from scratch).
This small change creates a significant amount of stability in training and demonstrates that curation is learned more easily through refinement than it is through creation.
How Training Works: SFT Then RL
There are two stages in the training pipeline that do different kinds of work:
Stage 1: Supervised Fine Tuning
A teacher model (GPT-5.4) is running in the complete harness in a live state and being trained with a large set of diverse queries at this point. After filtering out all of the poorly performing trajectories we were left with a total of 899 episodes that covered the correct use of the interface to train the model how to call tools, structure actions, and update the curated set.
LoRA configuration for SFT
lora_config = { "rank": 32, "target_modules": ["q_proj", "v_proj"], "base_model": "gpt-oss-20b", "epochs": 3, "checkpoint_for_rl": 550, # step-550 initializes RL training }
Stage 2: Reinforcement Learning
At the second stage of Reinforcement Learning, on-policy CISPO is used with a reward function based on terminal rewards only, and has a cap of 40 turns. The training data consisted of SEC (financial document) queries, but the policies learned through training at this stage were generalizable to all 8 benchmark domains. The reward function has two major benefits:
The first benefit is separation of discovery and selection. The two elements are provided as independent rewards when finding and curating a discovery (i.e., a relevant document is found and then curated).
The second benefit is the addition of a diversity bonus for tools being used. This bonus is more important than you might think.
Without the diversity bonus, the agent gets stuck in a loop. The agent repeatedly issues the same search query in slightly varying forms, fills the curated set with many similar items, and experiences stalling (0.53 curated recall). The agent learns to utilize grep_corpus, verify, and read_document in addition to search_corpus when a diversity bonus is added, and as a result, the agent’s recall score increases to 0.60 from this one change.
Simplified reward structure
def compute_reward(episode): discovery_score = count_newly_found_relevant_docs(episode) selection_score = curated_recall(episode.final_curated_set) diversity_bonus = tool_diversity_score(episode.action_sequence)
Terminal reward only - no intermediate shaping
return selection_score + 0.3 * discovery_score + 0.2 * diversity_bonus
Hands-On: Running Harness-1 Locally
Let’s try it out.
At the moment this repo is using uv for dependency management and vLLM for serving. You will need to have enough GPU VRAM to run a 20B model. For example, a single A100 (80GB) will work nicely. Alternatively, two A100s (40GB) will work very nicely using tensor parallelism if you have them.
Clone the repository and install it
git clone https://github.com/pat-jj/harness-1.git cd harness-1
If you haven't installed uv, do it now
pip install uv
Pull all dependencies including vLLM
uv sync --extra vllm
Note that pulling in vLLM and its CUDA dependencies is done with the --extra vllm flag and may take some time during the first pull of the package. If you do not follow through with this step, the inference script will not run due to its reliance on the vLLM server.
The first time you run an application with this model installed it will download about 40GB of weights from HuggingFace and setup a local OpenAI compatible server using uvicorn. After uvicorn has started and you can open the server at http://0.0.0.0:8000, you should be able to run your model.
uv run python inference/vllm_local_inference.py serve \ --model pat-jj/harness-1 \ --served-model-name harness-1
If you have two GPUs, you can add --tensor-parallel-size 2 to create a split between both GPUs. Without this option, you will hit out of memory issues with one, 40GB, GPU.
The execution of Step 3 means you can now issue a search request directly to the Harness-1 server. You must format your search request as a structured query directed against a Chroma corpus. Here’s what a minimal test would look like, using the BrowseComp+ benchmark format:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create( model="harness-1", messages=[ { "role": "user", "content": "Search for documents about the 2024 EU AI Act enforcement timeline.", } ], max_tokens=512, temperature=0.0, # deterministic for eval runs )
The model emits a structured tool action - parse it
action = response.choices[0].message.content print(action)
In response to your query, you will receive an output that is not narrative in nature. The output will be in the form of a structured action; e.g. fan_out_search(queries=["EU AI Act enforcement 2024", "AI Act timeline implementation"]). This is expected since Harness-1 is a retrieval sub-agent as opposed to a chat model. The output of Harness-1 will then be sent to the harness, which will process the action against your corpus.
After a full search episode gets completed, you can see the metrics that matters in the log file.
Benchmark Results: Where It Stands
Harness-1 was tested against eight different benchmarks, including web search, SEC financial filings, patents, and multi-hop question answering (QA).
Curated Recall is the core metric used to measure Harness-1 performance, that is, what percentage of all relevant documents created by Harness-1 at the final output of 30 total documents, made it into the output.
Model Size Curated Recall Trajectory Recall
Harness-1 20B open 0.730 0.807
Tongyi DeepResearch 30B open 0.616 0.673
Context-1 20B open 0.603 0.756
Search-R1 32B open 0.289 0.289
Opus-4.6 frontier 0.764 0.794
GPT-5.4 frontier 0.709 0.752
Sonnet-4.6 frontier 0.688 0.725
Kimi-K2.5 frontier 0.647 0.794
What Harness-1 Doesn’t Do?
It is a retrieval subagent, which returns a ranked document set and does not perform any reasoning, summarizing, or synthesizing an answer from that document set. Therefore, the downstream answering model is not considered in scope.
The RL training was only conducted on SEC queries, but it is promising to see the transfer performance onto web-based, patent and multi-hop QA queries. However, we did not consider domain generalization as part o
[truncated for AI cost control]