AI News HubLIVE
站内改写

How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python

This tutorial explores AgentTrove, the largest open-source collection of agentic interaction traces with 1.7M rows. Learn to stream the dataset without full downloads, normalize agent turns, analyze trajectories, and export successful traces into a clean ShareGPT-style JSONL format for supervised fine-tuning.

Article intelligence

EngineersAdvanced

Key points

  • Stream 1.7M agentic traces without downloading the full dataset
  • Normalize conversation structure across user, assistant, system, and tool roles
  • Analyze trajectory statistics and visualize key patterns
  • Export successful traces as a clean SFT dataset in ShareGPT format

Why it matters

This matters because stream 1.7M agentic traces without downloading the full dataset.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

In this tutorial, we explore AgentTrove, one of the largest open-source collections of agentic interaction traces, and learn how to work with it efficiently. Instead of downloading the full dataset, we use streaming to inspect rows, detect the conversation schema, normalize agent turns, and understand how user, assistant, system, and tool messages are structured. We also build utilities to parse command-style assistant outputs, render complete trajectories in a readable format, and study how agents interact with tools across different tasks. Also, we create a lightweight analytical workflow that samples thousands of traces, converts them into a DataFrame, summarizes turn-level statistics, visualizes important dataset patterns, and exports successful traces into a clean ShareGPT-style JSONL format for supervised fine-tuning.

Copy CodeCopiedUse a different Browser

!pip -q install "datasets>=2.19" pandas matplotlib pyarrow huggingface_hub import itertools, json, collections, textwrap, re, random, statistics import pandas as pd import matplotlib.pyplot as plt from datasets import load_dataset REPO = "open-thoughts/AgentTrove" random.seed(0) print(" Imports ready. Target dataset:", REPO) ds = load_dataset(REPO, split="train", streaming=True) print(" Streaming dataset opened.") first = next(iter(ds)) print("\n Columns present in a row:") for k in first.keys(): v = first[k] t = type(v).name preview = (str(v)[:70] + "…") if v is not None and len(str(v)) > 70 else v print(f" • {k:= 1.0 except (TypeError, ValueError): return False out_path = "agenttrove_clean_sft.jsonl" kept, scanned, SCAN, KEEP = 0, 0, 1500, 200 print(f"\n Scanning up to {SCAN} rows, keeping up to {KEEP} successful traces…") with open(out_path, "w") as f: for row in itertools.islice(load_dataset(REPO, split="train", streaming=True), SCAN): scanned += 1 if not is_success(row): continue turns = normalize_turns(row[TRACE_KEY]) conv = [{"from": r, "value": c} for r, c in turns if c.strip()] if len(conv) = KEEP: break print(f" Scanned {scanned} rows → wrote {kept} clean traces to '{out_path}'") def search_traces(keyword=None, source=None, limit=3, scan=3000): """Stream the dataset and yield-print traces matching filters.""" hits = 0 for row in itertools.islice(load_dataset(REPO, split="train", streaming=True), scan): if source and row.get("original_source") != source: continue if keyword: blob = " ".join(c for _, c in normalize_turns(row[TRACE_KEY])) if keyword.lower() not in blob.lower(): continue render_trace(row, max_chars=300) hits += 1 if hits >= limit: break if hits == 0: print("No matches in the scanned window — try increasing scan.") print("\n Searching for 'nl2bash' source traces:") search_traces(source="nl2bash", limit=2, scan=4000) print("\n Tutorial complete! Next ideas:") print(" • Increase N / SCAN for bigger analyses.") print(" • Filter by original_source (swesmith, codeforces, r2egym…) for a domain SFT set.") print(" • Feed agenttrove_clean_sft.jsonl into Axolotl / LLaMA-Factory for fine-tuning.")

We define a success filter that retains traces marked as resolved, passed, correct, or positively rewarded. We then export successful trajectories into a clean ShareGPT-style JSONL file for downstream fine-tuning workflows. Also, we add a search utility to find traces by keyword or source, making the dataset easier to explore for specific agentic tasks.

In conclusion, we built a complete, hands-on pipeline to inspect, analyze, filter, and export data from AgentTrove in a Colab-friendly way. We started with streaming access, then progressively added schema detection, turn normalization, command extraction, trajectory rendering, statistical analysis, visualization, success-based filtering, and keyword or source-based search. This workflow helps us understand the internal structure of agentic traces and gives us a reusable foundation for preparing high-quality subsets for fine-tuning or evaluation. We also keep the process scalable by avoiding full dataset downloads and using streamed samples only when needed. Also, we demonstrated how AgentTrove can be used as more than a static dataset: we treated it as a rich source of agent behavior, tool usage, task outcomes, and training-ready conversations that can support future experiments in agent learning, workflow analysis, and domain-specific SFT dataset creation.

Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python appeared first on MarkTechPost.