2026-06-07 17:05 UTCIn-site rewrite5 min readUpdated: 2026-06-30 13:03 UTC

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation

In this tutorial, we use GEPA as a reflective prompt-evolution framework to improve how a small language model solves multi-step arithmetic word problems. We start from a weak seed prompt, build a deterministic benchmark, and define a structured evaluator that returns actionable feedback. A multi-component setup evolves both the instruction field and the output-format rules together. We then compare the baseline and optimized prompts on a held-out validation set to check whether the gains generalize.

SourceMarkTechPostAuthor: Sana Hassan

In this tutorial, we use GEPA as a reflective prompt-evolution framework to improve the way a language model solves arithmetic word problems. We begin with a weak seed prompt, create a small deterministic benchmark, define a structured evaluator, and pass actionable feedback to GEPA so it can understand why a candidate prompt fails. We also use a multi-component prompt setup in which both the instruction field and the output-format rules evolve together. By the end, we compare the baseline prompt with the optimized prompt on a held-out validation set and inspect how the evolutionary process improves performance.

Installing GEPA and LiteLLM and Configuring the Task and Reflection Models

Copy CodeCopiedUse a different Browser

!pip install -q gepa litellm import os, re, json, random, getpass, textwrap import litellm import gepa.optimize_anything as oa from gepa.optimize_anything import ( optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig, ) litellm.suppress_debug_info = True if not os.environ.get("OPENAI_API_KEY"): os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ") TASK_LM = "openai/gpt-4o-mini" REFLECTION_LM = "openai/gpt-4.1" MAX_METRIC_CALLS = 100

We install GEPA and LiteLLM, then import the required libraries for prompt optimization and model calls. We securely set up the OpenAI API key and define two models: a task model that solves the problem and a reflection model that improves the prompt. We also set the maximum metric-call budget to keep the optimization process under control.

Building a Deterministic Arithmetic Benchmark Dataset

Copy CodeCopiedUse a different Browser

def make_problems(n, seed=0): rng = random.Random(seed) out = [] for _ in range(n): t = rng.choice(["discount", "travel", "wallet", "chain"]) if t == "discount": unit = rng.choice([40, 60, 80, 120]) qty = rng.choice([5, 6, 8, 10]) disc = rng.choice([10, 20, 25, 50]) total = unit * qty gold = total - total * disc // 100 q = (f"A shop sells notebooks at {unit} rupees each. You buy {qty} " f"notebooks and get a {disc}% discount on the total bill. " f"How many rupees do you pay in total?") elif t == "travel": s1, h1 = rng.choice([40, 50, 60]), rng.choice([2, 3]) s2, h2 = rng.choice([30, 45, 70]), rng.choice([1, 2, 3]) gold = s1 * h1 + s2 * h2 q = (f"A car drives at {s1} km/h for {h1} hours, then at {s2} km/h " f"for {h2} hours. What is the total distance travelled, in km?") elif t == "wallet": tens = rng.choice([3, 5, 7, 9]) fifties= rng.choice([2, 4, 6]) spent = rng.choice([50, 80, 110, 150]) gold = tens * 10 + fifties * 50 - spent q = (f"You have {tens} ten-rupee notes and {fifties} fifty-rupee " f"notes. You spend {spent} rupees. How many rupees are left?") else: x = rng.choice([6, 9, 12, 15]); y = rng.choice([4, 7, 10]); z = rng.choice([3, 8, 11]) gold = x * 2 - y + z q = (f"Start with the number {x}. Double it, then subtract {y}, " f"then add {z}. What number do you end with?") out.append({"question": q, "answer": gold}) return out all_problems = make_problems(18, seed=42) random.Random(1).shuffle(all_problems) trainset = all_problems[:12] valset = all_problems[12:] print(f"Dataset: {len(trainset)} train / {len(valset)} val problems\n")

We create a small deterministic dataset of arithmetic word problems covering discounts, travel distance, wallet calculations, and chained operations. We generate the correct answer for each problem programmatically, which keeps the benchmark reliable and easy to evaluate. We then shuffle the examples and split them into a training set for optimization and a validation set for testing generalization.

Defining the Evaluator and Structured Feedback for GEPA

Copy CodeCopiedUse a different Browser

def build_system_prompt(candidate: dict) -> str: return (f"{candidate['instructions']}\n\n" f"OUTPUT FORMAT RULES:\n{candidate['format_rules']}") def call_task_lm(system_prompt: str, question: str) -> str: for attempt in range(3): try: r = litellm.completion( model=TASK_LM, messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": question}], temperature=0, max_tokens=600, timeout=60, ) return r["choices"][0]["message"]["content"] or "" except Exception as e: if attempt == 2: return f"[LM_ERROR] {e}" return "" def parse_answers(text: str): formatted = re.search(r"####\s*(-?\d+)", text) all_nums = re.findall(r"-?\d+", text) fmt_val = int(formatted.group(1)) if formatted else None last_val = int(all_nums[-1]) if all_nums else None return fmt_val, last_val def evaluate(candidate: dict, example: dict): system = build_system_prompt(candidate) raw = call_task_lm(system, example["question"]) gold = example["answer"] fmt_val, last_val = parse_answers(raw) if fmt_val is not None and fmt_val == gold: score, fb = 1.0, "Correct and correctly formatted." elif fmt_val is not None and fmt_val != gold: score, fb = 0.0, (f"WRONG ANSWER. You output '#### {fmt_val}' but the " f"correct answer is {gold}. Re-check the arithmetic and " f"the order of the steps.") elif last_val == gold: score, fb = 0.5, (f"Right number ({gold}) but FORMAT VIOLATION: the final " f"line was not exactly '#### {gold}'. Always end with a " f"line of the form '#### ' and nothing else.") else: score, fb = 0.0, (f"WRONG. Correct answer is {gold}. The model's final " f"number was {last_val}. Likely a multi-step reasoning " f"slip; show each step and verify before answering.") oa.log(f"score={score} gold={gold} parsed_fmt={fmt_val} parsed_last={last_val}") side_info = { "feedback": fb, "problem": example["question"], "gold_answer": gold, "model_output": raw[:500], } return score, side_info def eval_set(candidate, dataset, label=""): scores, exact, formatted = [], 0, 0 for ex in dataset: s, info = evaluate(candidate, ex) scores.append(s) if s == 1.0: exact += 1; formatted += 1 elif s == 0.5: formatted += 0 acc = exact / len(dataset) avg = sum(scores) / len(dataset) print(f" [{label}] avg_score={avg:.3f} exact_correct+formatted={exact}/{len(dataset)}") return avg, acc

We define how the candidate prompt is converted into a system prompt and how the task model receives each question. We also create the evaluator that parses the model output, checks whether the final answer follows the required #### format, and assigns a score. We return structured feedback as actionable side information so that GEPA can determine whether the issue is incorrect reasoning, poor formatting, or both.

Configuring GEPA and Running the Prompt Optimization

Copy CodeCopiedUse a different Browser

seed_candidate = { "instructions": "Solve the math problem.", "format_rules": "Give the answer.", } print("=== BASELINE (seed prompt) ===") print("Train:"); base_train = eval_set(seed_candidate, trainset, "train") print("Val: "); base_val = eval_set(seed_candidate, valset, "val") print() objective = ( "Evolve a system prompt (the 'instructions' and 'format_rules' fields) so a " "small LLM reliably solves multi-step arithmetic word problems AND always " "ends with a line of exactly the form '#### '. Maximize the score." ) background = ( "Scoring: 1.0 = correct number in the exact '#### ' format; 0.5 = correct " "number but wrong/missing format; 0.0 = wrong number. Common failures are (a) not " "emitting the '####' line, and (b) order-of-operations or multi-step slips. The " "winning prompt should force explicit step-by-step work, a verification step, and " "a strict final-answer line." ) config = GEPAConfig( engine=EngineConfig( max_metric_calls=MAX_METRIC_CALLS, max_workers=4, parallel=True, display_progress_bar=True, seed=0, ), reflection=ReflectionConfig( reflection_lm=REFLECTION_LM, ), ) print("=== RUNNING GEPA (this calls the LLMs; ~1-4 min) ===") result = optimize_anything( seed_candidate=seed_candidate, evaluator=evaluate, dataset=trainset, valset=valset, objective=objective, background=background, config=config, )

We start with a weak seed prompt and evaluate its baseline performance on both the training and validation sets. We then define the optimization objective, background scoring rules, and GEPA configuration, including parallel evaluation and the reflection model. Finally, we run optimize_anything so GEPA can evolve the instruction and format-rule fields using the evaluator feedback.

Comparing the Baseline and GEPA-Optimized Prompts on the Validation Set

Copy CodeCopiedUse a different Browser

best = result.best_candidate print("\n" + "=" * 78) print("OPTIMIZED CANDIDATE") print("=" * 78) print("\n--- instructions ---\n" + textwrap.fill(best["instructions"], 96)) print("\n--- format_rules ---\n" + textwrap.fill(best["format_rules"], 96)) print("\n" + "=" * 78) print("BEFORE vs AFTER (held-out validation set)") print("=" * 78) print("Seed prompt:"); _ = eval_set(seed_candidate, valset, "val-seed") print("GEPA prompt:"); _ = eval_set(best, valset, "val-gepa") print(f"\nBaseline val avg_score : {base_val[0]:.3f}") print("\n" + "=" * 78) print("EVOLUTION HISTORY (candidate index -> val score, parents)") print("=" * 78) cands = getattr(result, "candidates", []) vscores = getattr(result, "val_aggregate_scores", []) parents = getattr(result, "parents", [None] * len(cands)) for i, sc in enumerate(vscores): par = parents[i] if i < len(parents) else None tag = " <-- BEST" if cands and cands[i] == best else "" print(f" cand {i:2d}: val_score={sc:.3f} parents={par}{tag}") print(f"\nTotal metric calls used : {getattr(result, 'total_metric_calls', 'n/a')}") print(f"Full validation evals : {getattr(result, 'num_full_val_evals', 'n/a')}") print("\nDone. Try raising MAX_METRIC_CALLS or swapping REFLECTION_LM for a stronger model.")

We extract the best prompt found by GEPA and print its optimized instruction and format-rule components. We compare the seed prompt and the GEPA-optimized prompt on the held-out validation set to check whether the improvement transfers to unseen examples. We also inspect the evolution history, validation scores, parent relationships, and total metric calls to understand how the prompt improved over the course of optimization.

In conclusion, we used GEPA to show how prompt optimization can move beyond manual trial and error. We created a complete workflow where a task model solves examples, an evaluator scores the outputs, and a reflection model uses detailed feedback to propose better prompts. We also tested the optimized prompt on unseen validation problems, which helps us assess whether the improvement generalizes rather than merely fitting the training set. Also, we built a practical example of reflective prompt evolution in which structured feedback, strict evaluation, and iterative refinement work together to produce a stronger, more reliable prompt.

Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation appeared first on MarkTechPost.