2026-06-30 18:32 UTCOriginal source4 min readUpdated: 2026-06-30 18:34 UTC

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

IBM Research introduces ScarfBench, an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java. The benchmark includes 34 applications, 102 framework implementations, and 204 migration tasks. Current top agents achieve less than 10% behavioral success, highlighting the difficulty of preserving behavior during migration.

SourceHugging Face Blog

Article intelligence

EngineersAdvanced

Key points

ScarfBench evaluates AI agents on framework migration between Spring, Jakarta EE, and Quarkus, requiring build, deployment, and behavioral validation.
The benchmark comprises 34 applications, ~2,000 source and test files, and 1,331 expert-written tests.
Best current agents achieve under 10% behavioral success, and agents are often overconfident in their results.
Configuration and dependency management dominate migration effort, while environmental issues like Docker cache and port connectivity frequently cause failures.

Why it matters

This matters because scarfBench evaluates AI agents on framework migration between Spring, Jakarta EE, and Quarkus, requiring build, deployment, and behavioral validation.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

Back to Articles

a]:hidden">

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

Enterprise Article

Published June 30, 2026

Upvote

Raju Pavuluri

rpavuluri

ibm-research

Rahul Krishna

rkrsn

ibm-research

Srikanth Govindaraj Tamilselvam

stamilse

ibm-research

Bridget M

brmcg

ibm-research

Ashita Saxena

ashitasaxenaIBM

ibm-research

George Safta

george-safta

ibm-research

Advait Pavuluri

apavuluri

ibm-research

Michele Merler

mimerler

ibm-research

⭐ Star ScarfBench on GitHub

Modernizing enterprise applications is one of the largest and most expensive software engineering activities organizations undertake. Teams migrate applications across frameworks to improve maintainability, cloud readiness, developer productivity, and access to modern capabilities.

Recent advances in coding agents have sparked excitement around AI-assisted modernization. But an important question remains:

Can AI agents reliably modernize real-world enterprise applications?

Existing software engineering benchmarks have demonstrated impressive progress in bug fixing and code generation, but framework migration presents a fundamentally different challenge. Success requires not only translating code, but also preserving behavior, adapting build systems, and navigating runtime dependencies.

To address this gap, we introduce ScarfBench (Self-Contained Application Refactoring Benchmark), an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java.

ScarfBench focuses on migrations across three major Java ecosystems:

Spring

Jakarta EE

Quarkus

Unlike traditional benchmarks that compare generated code against reference implementations, ScarfBench evaluates whether migrated applications actually build, deploy, and preserve behavior.

Why Migration Is Hard

Framework migration is much more than replacing annotations.

A simple repository migration can require changes across dependency injection, persistence configuration, queries, and framework descriptors. Small mistakes in any of these pieces can prevent successful deployment.

Figure: Spring → Jakarta Migration Example

Framework migration requires translating framework semantics, not just source code.

Introducing ScarfBench

ScarfBench provides a systematic way to evaluate AI agents on enterprise Java framework migration tasks.

Applications are required to:

Build successfully.

Deploy correctly.

Pass behavioral validation.

This provides a much more realistic measure of modernization quality.

Benchmark at a Glance

Metric Value

Applications34

Framework implementations102

Migration tasks204

Lines of code~151K

Source and test files~2,000

Expert-written tests1,331

ScarfBench includes both focused migration tasks and whole-application migrations.

Figure: ScarfBench Construction Pipeline

Starting from a JSR-based enterprise Java taxonomy, expert migrations create verified implementations across Spring, Jakarta EE, and Quarkus.

How Do Frontier Agents Perform?

We evaluated several state-of-the-art coding agents on ScarfBench.

Despite strong performance on traditional software engineering benchmarks, framework migration remains difficult. Success rates vary considerably across framework pairs and whole-application migrations remain particularly challenging.

Figure: Current Leaderboard

Source:

scarfbench.info/leaderboard

Even the strongest current agents achieve less than 10% behavioral success, illustrating the gap between generating compilable code and preserving application behavior.

Figure: Compile → Deploy → Test Progression

Compile success consistently exceeds deploy success, which in turn exceeds behavioral success. Build success alone significantly overestimates migration quality.

Figure: Migration Outcomes by Target Framework

Migration difficulty depends strongly on the target framework, with Jakarta EE proving particularly challenging.

What We Learned About AI Agents for Java Modernization

Beyond measuring success rates, ScarfBench helps us understand how agents behave during modernization.

Can Agents Reliably Tell When a Migration Is Complete?

A migrated application is only useful if it actually builds and runs.

We therefore compared agent-reported outcomes against independent build verification.

Finding: Agents Are Overconfident

Claude Code reported successful builds for 29 out of 30 whole applications.

Only 22 of those applications actually built successfully.

Meanwhile, the single application classified as failed by the agent ultimately built correctly.

This suggests that agent self-assessment should not be treated as a reliable signal of migration completion.

Independent build and test validation remains essential.

How Do Agents Navigate Application Dependencies?

Framework migrations rarely affect a single file or layer.

Changes in configuration, services, databases, and web components often cascade across the application.

Finding: Migration Is Iterative Rather Than Linear

The most frequently visited layers were:

Configuration

Web

Database

Service

Common transitions included:

Configuration ↔ Web

Service ↔ Database

This suggests that migration is an iterative dependency-resolution process rather than a simple source-to-source transformation.

Where Do Agents Spend Most of Their Effort?

We used layer revisit frequency as a proxy for migration effort. Layers that required repeated visits typically involved debugging, dependency resolution, or framework adaptation.

Finding: Configuration Dominates Migration Effort

Rather than proceeding linearly, agents repeatedly returned to configuration-related artifacts while resolving framework differences and dependency issues.

What Challenges Are Not About Code Transformation?

Not every migration issue originates from source code.

Finding: Environment and Tooling Matter

Agents frequently struggled with environmental issues, including:

Docker cache inconsistencies

Port connectivity problems

Maven wrapper and build tooling issues

These operational concerns often delayed validation even when the source-code migration itself was largely complete.

Figure: Failure Mode Distribution

Modernization failures span build systems, deployment environments, dependency injection, databases, endpoints, assertions, and infrastructure.

Key Takeaway

The biggest challenge in framework modernization is not translating Java code.

It is managing the web of dependencies across configuration, infrastructure, and runtime environments.

While frontier agents can automate substantial portions of the migration process, reliable validation and architectural reasoning remain critical for achieving successful outcomes.

ScarfBench helps expose these challenges and provides a standardized way to measure progress toward truly autonomous application modernization.

Explore ScarfBench

ScarfBench is designed as an open resource for researchers and practitioners.

Resources include:

Benchmark dataset

Evaluation infrastructure

Public leaderboard

Documentation

Open-source code

Researchers can compare agent architectures and techniques. Practitioners can use ScarfBench to evaluate modernization solutions before deploying them in production environments.

Website

https://scarfbench.info

Dataset

https://huggingface.co/datasets/ibm-research/ScarfBench

Space

https://huggingface.co/spaces/ibm-research/ScarfBench

GitHub Repository

https://github.com/scarfbench/scarfbench

Leaderboard

https://scarfbench.info/leaderboard

Paper

https://arxiv.org/abs/2605.06754

Framework migration remains one of the largest unsolved problems in AI-assisted software engineering. We hope ScarfBench helps the community measure progress and accelerate the next generation of AI-assisted application modernization.

We invite researchers, practitioners, and framework communities to evaluate their agents, contribute new migration scenarios and help advance the state of the art.

Datasets mentioned in this article 1

Spaces mentioned in this article 1