2026-05-28 18:10 UTCOriginal source9 min readUpdated: 2026-06-30 13:03 UTC

Build a test suite that grows with your agent with dataset management in Amazon Bedrock AgentCore

Agent evaluation is most powerful when combining fast-moving online signals with stable offline baselines. Amazon Bedrock AgentCore's dataset management provides versioned test fixtures, enabling consistent measurement and ground truth verification.

SourceAWS Machine Learning BlogAuthor: Visakh Madathil

Agent evaluation is most powerful when you combine fast-moving online signals with stable offline baselines. To understand whether your agent is truly improving over time, you need a fixed benchmark alongside your changing real-world traffic.

Managing test cases for evaluation baselines as a dataset in Amazon Bedrock AgentCore brings the discipline of versioned test fixtures to agent evaluation. You can author scenarios with inputs, expected outputs, assertions, and tool sequences, then publish them as immutable numbered versions that don’t shift beneath a run. You can iterate freely on a mutable draft until you’re ready to lock a checkpoint. And when something breaks in production, that failure becomes a permanent test case that every future change gets evaluated against.

In this post, we walk through the full workflow with a financial market-intelligence agent. We capture failures from production traces, build a versioned dataset, run an evaluation, fix the agent, and confirm the improvement against the same locked inputs.

Why datasets matter

Agents are non-deterministic by design. The same input can produce different outputs across runs, which makes a single evaluation result nearly meaningless. You can’t tell if a score moved because the agent changed or because the model sampled differently. Consistent measurement across stable inputs is the only way to know whether a change actually helped.

But stable inputs alone aren’t enough. A large language model (LLM) judge can tell you whether a response sounds helpful. It cannot tell you whether the stock price is accurate, whether the broker workflow ran in the right order, or whether personally identifiable information (PII) leaked between sessions. For those checks you need ground truth: the expected response, the required tool sequence, and the assertions that must hold regardless of how the response is phrased. Ground truth is what turns a subjective score into a verifiable measurement. Without it, you’re measuring the appearance of correctness, not correctness itself.

Versioned datasets give you both. They hold the inputs still so scores are comparable across runs, and they carry the ground truth that makes those scores mean something. This matters most in the two places where agent evaluation actually happens.

The inner loop is the developer desk. You invoke the agent, read the scores, adjust a tool description, and invoke again. The cycle is minutes. The problem isn’t running evaluations at this stage, it’s that the test cases tend to be whatever was nearby: questions someone wrote last week or a session you happened to save. When a score improves you want to believe the fix worked. But without stable inputs underneath it, you can’t know if the agent got better or the questions got easier.

The outer loop is the CI/CD pipeline. Before a change ships, something needs to say it didn’t break anything. Most teams have this gate. What they often don’t have is a stable, versioned set of inputs with explicit assertions beneath it. This means the gate is testing whatever someone last pointed it at, with no ground truth to check against. A pipeline that passes a build because the questions changed isn’t catching regressions, it’s missing them.

A versioned dataset closes that gap. The developer curates failures into the draft during the inner loop. In the outer loop, a published version of that draft becomes the gate. It’s immutable, ground truth intact, and tests the same scenarios it tested last sprint and the sprint before that. The score that told a developer the fix worked is the same score the pipeline uses to decide whether it ships.

Two types of test scenarios

Datasets in Amazon Bedrock AgentCore support two schema types that serve these two loops differently.

Predefined scenarios are backward-looking. You have defined the exact queries your user will send to your agent and you know what correct looks like: the expected response, the tool sequence, and the assertions that must hold. You write them down and the evaluator checks whether the agent met them. Once a failure is formalized as a predefined scenario, it stays in every future evaluation run. They belong in the outer loop gate because the pass and fail criteria are explicit, repeatable, and don’t depend on how the conversation went.

User simulation scenarios are forward-looking. Instead of scripting turns, you describe a persona: who the actor is, what they want to achieve, and how they communicate. An LLM-backed actor drives a real multi-turn conversation with your agent until the goal is met or the turn limit is reached. You don’t script what the actor says. Coverage emerges from the interaction. For more information, see User simulation in the AgentCore User Guide.

This is different from anything in the standard evaluation toolkit. A predefined scenario tests whether your agent handles a specific input correctly. A simulated scenario tests whether your agent can satisfy a type of user across whatever path that user takes.

Throughout this post we use a Market Trends Agent as the running example. The agent serves investment brokers at financial institutions. A broker messages the agent with something like “I’m Sarah Chen from Morgan Stanley, focused on tech and clean energy — what’s happening with NVDA today?” The agent identifies the broker, stores their preferences in AgentCore Memory, retrieves the current NVIDIA stock price, and searches for relevant news across Bloomberg and Reuters. It then delivers a personalized briefing that connects the data to Sarah’s stated investment focus. When Sarah comes back the next day, the agent remembers her profile and tailors its response accordingly.

For a predefined scenario, you might have production traces of how a user interacted with your agent, and you can curate them for your evaluation dataset. An example of this for the Market Trends Agent looks like this:

PreDefinedScenario{ "scenario_id": "broker_profile_onboarding", "turns": [ { "input": ( "Hi, I'm Sarah Chen from Morgan Stanley. " "I focus on tech and clean energy. " "Risk tolerance: moderate-high. " "Client base: institutional and high-net-worth." ) } ], "expected_trajectory": {"toolNames": ["identify_broker", "update_broker_financial_interests"]}, "assertions": [ "Agent identifies the broker by name and firm.", "Agent stores the broker's sector preferences and risk tolerance.", "Agent acknowledges receipt of the profile and offers to help.", ], "metadata": {"category": "onboarding", "priority": "high"}, }

A simulated senior tech analyst might open with a broad question about NVIDIA Corporation Common Stock (NVDA). She pushes back when the response feels thin, asks for a comparison to Advanced Micro Devices, Inc. Common Stock (AMD), and only signals completion when she has something citable for a client call. No one scripted those turns. The actor generated them from the profile. You can define this scenario as follows:

SimulatedScenario( scenario_id="sim-tech-analyst-nvda-amd-deep-dive", scenario_description=( "A senior technology research analyst probes for a deep, citable NVDA vs AMD briefing ahead of a client call." ), actor_profile=ActorProfile( traits={ "expertise": "senior", "focus": "semiconductors", "style": "skeptical and data-driven", }, context=( "Senior sell-side technology analyst preparing talking points for a high-value client call. " "Expects multi-layered analysis, not surface-level summaries, and will push back when answers feel generic or thin." ), goal=( "Pressure-test the agent's semiconductor domain depth by asking about NVIDIA, then insisting on richer detail, " "requesting a structured comparison with AMD, and only concluding when she has citable points for a client conversation." ), ), input=( "I'm prepping for a client call and need a quick but solid briefing on NVIDIA. " "Start with NVDA's recent performance and positioning in semiconductors." ), max_turns=8, assertions=[ "Agent provides an initial NVDA summary with recent performance and positioning", "Agent responds with deeper fundamentals, product/roadmap, or moat detail for NVDA", "Agent produces a structured NVDA vs AMD comparison (e.g., valuation, growth, segments)", "Agent includes specific, citable data points or metrics suitable for a client call" ], )

User simulation is particularly useful in the inner loop, when you’re not sure what failure modes you haven’t found yet. The failures that surface become candidates for predefined scenarios in the next dataset version, feeding directly into the outer loop gate.

A few things are worth knowing about how simulation works under the hood. The actor runs on a Bedrock model you specify in SimulationConfig. At each turn, the actor receives the agent’s response and produces three things: its internal reasoning about whether the goal was met, the next message to send, and a stop signal. The conversation ends when the actor signals completion, when max_turns is reached, or when the actor produces no next message. Because the conversation path is dynamic, simulated scenarios don’t support expected_trajectory or per-turn expected_response. Use assertions for ground truth instead, and describe the outcome you expect regardless of how the conversation got there.

For the Market Trends Agent, a predefined scenario covers the price drift bug that already burned a broker. A simulated scenario covers the environmental, social, and governance (ESG) specialist who hasn’t surfaced a failure yet but represents a real user type the agent needs to handle well.

How datasets in AgentCore work

Datasets are built into AgentCore as a first-class resource with ARNs, IAM authorization, and tags. There are no Amazon Simple Storage Service (Amazon S3) buckets to provision or external services to configure.

Draft and publish. Every dataset has one mutable draft where you add and remove scenarios freely. When you want a stable checkpoint, publish it. The draft becomes an immutable numbered version. Pin an evaluation run to Version 3, and it will use the exact scenarios that were in Version 3, regardless of what you’ve added to the draft since.

Schema validation at write time. You declare a schema type when you create the dataset, and every scenario is validated against that schema before it’s accepted. Malformed examples are rejected at ingest rather than surfacing as errors halfway through a 30-minute evaluation run.

One dataset, multiple runners. Load a dataset with DatasetManagementServiceProvider and pass it to either the on-demand runner for fast per-scenario feedback, or the batch runner for aggregate scoring across many sessions. The same scenarios, assertions, and dataset ID apply whether you’re iterating at your desk or gating a deployment.

The agent: Market Trends Assistant

We use the Market Trends Agent, a LangGraph application deployed on AgentCore Runtime, as the running example. The full source is available in the AgentCore samples repository under 02-use-cases/market-trends-agent. The agent has the following tools that help it serve queries from financial brokers:

Tool What it does

get_stock_data Current price, daily change, volume for a ticker.

search_news Multi-source financial news search (Bloomberg, Reuters, CNBC, WSJ, FT).

identify_broker Extracts broker identity from the message for memory lookup.

get_broker_financial_profile Reads stored preferences, risk tolerance, sector focus.

update_broker_financial_interests Writes new preferences to long-term memory.

Three failure modes come up often enough to warrant permanent test cases:

Stale prices — the agent quotes a number that’s drifted more than 2% from the live value, usually because it reused a cached tool response rather than making a fresh call.

Skipped identity check — the agent jumps straight to get_broker_financial_profile without calling identify_broker first, which can result in pulling the wrong broker’s preferences and delivering a response tailored to someone else’s portfolio.

PII bleed — personally identifiable information from one broker’s profile leaks into a response to a different session, typically when session boundaries aren’t respected in the memory layer.

Each of these failures is subtle enough to pass a manual spot check but serious enough to erode trust with the brokers who depend on the agent daily.

Implementation

This hands-on walkthrough takes about 30 minutes.

Prerequisites

You need the following:

An AWS account with permissions for AgentCore Runtime, Memory, Evaluations, and Amazon CloudWatch.

AWS Command Line Interface (AWS CLI) configured.

CloudWatch Transaction Search enabled (one-time account opt-in).

The samples repo cloned and the Market Trends Agent deployed (uv run python deploy.py).

The full sample is available in the 02-use-cases/market-trends-agent directory of the AgentCore samples repository.

Walkthrough

Deploy the Market Trends Agent. Run uv run python deploy.py to provision the AgentCore Runtime, Memory, IAM role, and ECR container. The agent ARN is written to .agent_arn.

Create and version evaluation datasets. Run uv run python optimization/manage_dataset.py --no-cleanup to create two datasets and publish an immutable version of each.The predefined dataset includes five scripted test cases covering the agent’s core failure modes: broker onboarding, stock data retrieval, multi-turn profile followed by news, memory recall for a returning broker, and a PII safety check. The PII case uses a fabricated SSN in the user’s message that should never appear in the response.The simulated dataset includes three actor-profile scenarios. The first is a senior tech broker who needs a momentum briefing before a client call. The second is an ESG specialist reviewing portfolio alignment. The third is a dividend-focused investor screening for income opportunities. Each actor drives its own conversation without scripted turns.The script also demonstrates the day-to-day curation workflow: adding new examples, updating existing ones, and deleting stale cases before publishing.

Run evaluation against the versioned dataset. Run uv run python optimization/user_simulated_dataset.py to load the simulated scenarios, invoke the agent against each one, and wait for spans to land in CloudWatch. The script then submits a batch evaluation with Correctness, Helpfulness, and GoalSuccessRate. Per-scenario scores and explanations print to the console.

Iterate: fix the agent and re-evaluate. Update the tool description or system prompt based on evaluation explanations. Add the newly surfaced edge case to the draft with add_examples_and_wait(), publish a new version with create_dataset_version_and_wait(), and re-run. Because the scenarios and assertions are identical between runs, the before and after comparison isolates the effect of your change. A Correctness improvement on the price drift scenario now means something: the same input was tested both times. Alternately, you can perform this iteration of improving the agent using recommendations from AgentCore directly which uses a evaluator as a signal and provides suggestions on improving the system prompt and tool descriptions of an agent, as demonstrated by the optimize_agent.py in the Market Trends Agent sample.

View results. Scores show in the AgentCore Observability console and in a dedicated CloudWatch log group. The explanation field tells you why a scenario passed or failed: “identify_broker was never called before get_broker_financial_profile” or “agent response contained SSN pattern matching PII safety assertion.”

After completing these steps, the dataset persists as a managed resource in your AWS account. Future evaluation jobs can reference the same dataset ID and version, whether triggered from a developer’s machine, a CI/CD pipeline, or a scheduled regression check.

Using the dataset across your workflow

Mode Use case What happens

On-demand runner Development, CI/CD gates SDK invokes the agent, collects spans, scores immediately, returns per-scenario detail.

Batch runner Baselines, large-scale comparison Service handles scoring asynchronously, writes aggregate results to CloudWatch.

Use the on-demand runner when you need immediate per-scenario feedback during development on smaller datasets when you manage the concurrency. Use the batch runner when measuring aggregate quality across a large dataset or comparing two agent versions at scale.

For deployment gates, pin the evaluation to a published version and fail the build if any evaluator drops below your threshold. Because the version is immutable, the gate tests the same scenarios on every PR regardless of what’s been added to the draft since.

The same versioned dataset also feeds into AgentCore Optimization. When you use the Recommendations API to generate improved system prompts or tool descriptions, the evaluation scores driving those decisions are grounded in your dataset. The same applies when you set up an A/B test to validate those improvements against live traffic. Stable inputs make the optimization loop trustworthy rather than a side effect of a shifting test set.

Practices worth adopting early

Ground your test suite in real incidents. The scenarios that catch the most are the ones sourced from actual production failures. A broker received a stale price, a profile got crossed between sessions, or PII appeared where it shouldn’t. These target weaknesses your agent has already demonstrated in front of real users. Invented questions target imagined ones.

Use predefined for depth, simulated for breadth. Predefined scenarios guard the bugs you’ve already found. Simulated scenarios surface the ones you haven’t. A healthy dataset includes both. For the Market Trends Agent, predefined scenarios cover the pricing drift and workflow-order bugs that already burned a user. Simulated scenarios exercise varied broker personas that push the agent in directions no one anticipated.

Publish a version before every change. Versions are immutable and cost nothing to keep. When you’re debugging a score regression months later, you’ll want to know exactly which scenarios were in play at each checkpoint.

One dataset, many versions. Resist creating a new dataset every sprint. The value is in continuity. The same dataset ID accumulates every failure your agent has ever encountered, and every future evaluation inherits that history. Publishing a new version means building on everything you’ve already learned. Creating a new dataset means starting from scratch.

Cleanup

To avoid incurring future charges, delete the dataset and its versions with DatasetClient.delete_dataset(). Remove the Market Trends Agent by following the cleanup section in the repository README.

Conclusion

A test suite is only useful if it holds still. When the inputs change between runs, your scores measure drift in the test set rather than improvement in the agent.

Managing datasets in AgentCore gives you versioned, schema-validated test cases as a managed resource. Production failures become permanent regression scenarios, simulated personas generate coverage you couldn’t script by hand, and immutable versions let you compare honestly across agent releases.

The pricing bug that a broker caught last quarter is a test case now. Every change to the system prompt, every tool description update, and every model swap gets evaluated against it. The suite accumulates institutional knowledge about how your agent has failed, and it holds every future version accountable to that history.

To get started, see the Amazon Bedrock AgentCore documentation and the Market Trends Agent sample.

About the authors