AI News HubLIVE
In-site rewrite5 min read

Teaching AI to Reason About Software

A team at AWS trained a small language model on Soteria's symbolic execution traces, beating a model four times its size at catching bugs in C. We explore what they did and why it matters for the future of AI-assisted software engineering.

SourceHacker News AIAuthor: giltho

Back to Blog

Teaching AI to Reason About Software

A team at AWS taught a small language model to catch bugs in C by training it on Soteria’s symbolic execution traces, beating a model four times its size. We look at what they did and why it matters.

Azalea Raad, CEO · 1 July 2026

RSS

A team at Amazon Web Services (AWS), working independently of Soteria, recently published research  that uses Soteria to teach AI models to reason about code. Their headline result is striking: by continuing to train an 8-billion-parameter model on just a few thousand symbolic execution traces from Soteria, they made it better at catching bugs in C than a model four times its size.

Specifically, by training models on the rich semantic execution traces produced by Soteria, they showed significant improvements in the models’ ability to understand code and detect correctness violations. In other words, they showed that Soteria can teach AI how programs actually work!

In this post, we take a closer look at that work, explain why Soteria’s traces are uniquely valuable as AI training data and explore what this means for the future of AI-assisted software engineeing.

Today's AI training

Source Code

Model Training

Code Generation

With Soteria

Source Code

Symbolic Execution

Semantic Traces

Model Training

Program Reasoning

Better Code Generation

Modern AI coding assistants are transforming the way and pace at which software is developed, allowing engineers to move faster than ever before. They can generate entire functions, implement complex APIs and produce substantial software systems from a short prompt.

Despite this progress, today’s AI systems still suffer from a fundamental limitation. They can generate code that looks correct, but they often struggle to understand whether that code is actually correct.

Large language models learn from enormous quantities of source code. They see billions of lines of software from repositories, documentation, tutorials and technical discussions. However, software engineering is not simply about producing syntactically valid code. A human software engineer develops an understanding of programs by observing how they run. We learn through debugging sessions, execution traces, test failures and careful inspection of program state.

This process teaches us something deeper than syntax. It teaches us semantics!

Furthermore, today’s frontier models (such as Mythos) have demonstrated just how powerful AI can be when applied to program analysis and bug detection. However, these models are extremely large and expensive to train and deploy. If we can teach smaller, more efficient models to reason about programs just as effectively, or further improve the reasoning capabilities of state of the art models such as Mythos, we can make high quality AI-assisted software verification both more capable and significantly more accessible.

The AWS researchers measured this gap directly. They built an evaluation of 500 verification tasks in C, covering memory safety, overflows, termination, reachability and data races, drawn from SV-COMP 2025 , a software-verification benchmark. For each task, a model has to decide one thing: does the property hold, or is it violated?

Across 14 models from six families, a clear pattern emerged: models are very good at confirming that a property holds (most scoring above 90%) but much worse at detecting when one is violated. Four of the 14 models caught fewer than half of the real bugs, and accuracy fell off sharply as programs grew longer: a model dropped below 10% on programs of just 100–200 lines!

Here is the kind of program that trips models up:

1 extern unsigned int nondet_uint(void);

2

3 int main() {

4 unsigned int x = nondet_uint();

5 if (x > 0) {

6 while (x != 0) {

7 x = x - 2;

8 }

9 }

10 return 0;

11 }

Does this loop always terminate? When x is odd, the repeated subtraction wraps around past zero and the loop runs forever. One 675-billion-parameter model spotted the unsigned wrapping but still concluded, incorrectly, that the loop must reach zero. The code is short; it’s the reasoning about how it executes that’s hard.

That is the gap. AI has learned to write code. The harder problem is teaching it to understand how code runs, and once it does, to find bugs before they reach production.

This is where Soteria enters the picture. Soteria is a symbolic execution tool designed to reason about software behaviour at a deep semantic level. Rather than executing a program using a single concrete input, symbolic execution systematically explores many possible execution paths and reasons about what can happen across all of them. (You can read more about Soteria in our previous post).

Along the way, Soteria records a great deal: program states, symbolic values, path conditions, branch decisions and the precise circumstances under which a property is violated. It captures why a bug occurs, with traces that can be exported as machine-readable JSON or interactive HTML.

This transforms symbolic execution from a verification technology into a source of training data! Rather than learning solely from source code, AI models can learn from the reasoning process itself.

Below is a snippet of a Soteria trace in JSON format, for the program above. It shows the path taken through the program, the symbolic values of variables, and the conditions under which the loop does not terminate. You can click through the trace to explore the branches in detail.

Example of Soteria’s HTML trace output

The researchers asked a simple question:

Can symbolic execution traces help AI reason about software more effectively?

To find out, they ran Soteria over open-source C code – filtered down from millions of files in the public CodeParrot dataset – and collected the resulting traces. They then used those traces for continued pretraining of Qwen3-8B, an open-weights model. Importantly, the training data came from arbitrary open-source code, not the benchmark, so any improvement reflects general skill rather than memorised answers.

The most exciting part of the story is that this approach works remarkably well! In the evaluation done by the AWS research, models trained using Soteria-generated traces demonstrated substantially stronger performance on software verification and correctness tasks.

The combination that worked best was training on Soteria’s bug traces, then letting the model reason step by step at inference time. It improved violation detection by 17.9 percentage points over the baseline! This is a super-additive effect: reasoning on its own worsened results (-1.4 points), and training on the traces alone improved results only modestly (+7.3 points). Together, they produced a much larger improvement than either could achieve alone.

Violation detection before and after training on Soteria traces

Violations correctly detected

0%

20%

40%

60%

80%

100%

49%

48%

57%

67%

78%

Qwen3-8B baseline

Qwen3-8B + reasoning

Qwen3-32B 4× larger

Qwen3-8B + Soteria + ~3k traces + reasoning

Qwen3-32B 4× larger, + reasoning

Share of real violations correctly detected. Reasoning alone, without the traces, does nothing (48%); training the 8B model on ~3,000 Soteria traces and letting it reason lifts it past the plain 32B four times its size, though the 32B, given the same chance to reason, stays ahead. Figures from the paper (rounded); overall accuracy on safe programs is preserved.

Furthermore, the trained 8B model detected violations more reliably (67%) than the un-trained Qwen3-32B (57%), a model four times its size (though with reasoning disabled for the larger model).

Perhaps even more importantly, the training generalised: the traces targeted memory safety and overflows, yet performance improved across all five property types, including termination, data races and reachability. The model was learning general reasoning patterns, not memorising bug shapes!

The first generation of coding AI learned from source code. This work shows that the next generation can also learn from program semantics, from records of how code actually executes.

That matters because the tasks we increasingly ask of AI, such as reviewing pull requests, debugging failures, maintaining and refactoring large systems, all depend on reasoning about behaviour, not just producing plausible text. Symbolic execution is one of the few techniques that can generate this kind of reasoning data automatically, at scale, without human labelling.

That is why we believe Soteria will become much more than a verification platform. It will become part of the infrastructure used to train AI systems that understand software at a fundamentally deeper level, enabling them to reason about code correctness, explain its behaviour and find bugs before they reach production.

This will create a powerful virtuous cycle. AI generates code. Soteria analyses that code and produces semantic traces. Those traces help train better AI models. Better AI models generate higher-quality code, which in turn produces richer opportunities for reasoning, verification and further learning. Over time, this cycle will improve both software quality and AI capability simultaneously. Verification and AI will therefore become tightly integrated components of a single development workflow.

A semantic feedback loop

AI writes code

Soteria analyses code

Soteria generates semantic traces

AI learns program semantics

AI produces better code

This vision is already beginning to take shape, but we believe we are only scratching the surface. Our long-term goal is ambitious: to run Soteria at ecosystem-scale across the entirety of crates.io (Rust’s package registry), generating the largest semantic dataset ever assembled for Rust code. By analysing every crate on crates.io, Soteria will accumulate an unprecedented collection of high-quality execution traces, capturing how millions of lines of real-world code behave rather than simply how they are written. The progress we have made so far gives us every reason to be optimistic!

The future of software engineering will be shaped by the interaction between AI and formal verification. We believe the most powerful systems will combine the creativity and scalability of modern language models with the rigour and precision of formal reasoning.

Soteria was built at that intersection.

We are only beginning to explore what becomes possible when machines learn not just how to write programs, but how to reason about them. And we think this future is arriving sooner than most realise!

We thank the AWS researchers who did the research presented in this blog post, and particularly Stefan Zetzsche for his help in reviewing this post.