AI News HubLIVE
站内改写4 min read

Microsoft's SkillOpt boosts GPT-5.5 by using nothing but a trained Markdown file

Microsoft and three Chinese universities have developed SkillOpt, a method that optimizes instruction documents for AI agents using principles from traditional model training. A simple Markdown file is enough to boost GPT-5.5 by about 23 points on procedural tasks, and the same file transfers across models and agent environments like Codex and Claude Code.

SourceThe DecoderAuthor: Jonathan Kemper

A simple Markdown file is apparently enough to boost GPT-5.5 by more than 20 points on procedural tasks. That's the promise of SkillOpt, a method from Microsoft and three Chinese universities that trains instruction documents for AI agents the same way model weights get trained.

These kinds of instruction documents, known as "skills," are already common in commercial products. Anthropic, for example, added a modular skill system to Claude last year that automatically loads topic-specific instructions, scripts, and resources depending on the task.

Skills typically bundle procedures, tool-use rules, output formats, and known failure patterns, and they've become a standard approach. Until now, according to the Microsoft team's paper, they were either written by hand, generated in a single pass by a language model, or loosely self-revised. None of these approaches behaves like a real optimizer, and none guarantees the skill actually improves.

[caption id="attachment_56642" align="aligncenter" width="1800"] SkillOpt trains the skill document like model weights, only keeping changes that measurably pay off. | Image: Yang et al.[/caption] The skill document becomes a trainable state SkillOpt treats the skill document as an external, trainable state for a frozen target model. A second, separate language model acts as the optimizer. It reads logs from the agent's runs, spots recurring error and success patterns, and proposes limited edits to the skill: adding, deleting, or replacing individual passages. Each change is only accepted if it performs better on a held-out validation set.

The authors map several deep learning concepts onto the text level. A kind of learning rate caps how many edits can land per step. A scheduler shrinks the step size across epochs. Rejected edits go into a buffer and serve as negative examples for later reflection. A slow update at the end of each epoch preserves stable edit directions across training rounds, similar to how gradient smoothing works in traditional training.

[caption id="attachment_56643" align="aligncenter" width="1800"] The target model stays frozen while a second model suggests small skill edits that are only accepted after passing validation. | Image: Yang et al.[/caption]

What makes this practical is the clean split between training and deployment. The optimizer model only runs during training, and once that's done, it's out of the picture. At inference time, the target model simply receives a plain Markdown file of 300 to 2,000 tokens as context. Beating every comparison method consistently The authors tested their approach on six benchmarks covering search, spreadsheets, document analysis, math, and embodied action. Seven systems served as target models, including GPT-5.5 and the much smaller Qwen3.5-4B. Tasks ran in direct chat as well as in the agent environments Codex and Claude Code.

Across every combination, SkillOpt leads or ties with the best comparison result. That holds against handwritten skills, one-shot LLM-generated skills, and specialized methods like Trace2Skill, TextGrad, GEPA, and EvoSkill. On GPT-5.5 in direct chat, the average across all six benchmarks jumps by about 23 points.

The biggest gains show up on tasks with strict format requirements and tool use, like spreadsheet editing. Smaller models benefit too, which the authors take as evidence that a well-trained skill delivers procedural knowledge these models lack in their weights.

[caption id="attachment_56641" align="aligncenter" width="1800"] Across training rounds, the method typically picks skills that also perform well on unseen test data. | Image: Yang et al.[/caption] Skills transfer across models and environments One key finding is transferability. A skill trained on a larger model also improves smaller models in the same family. A spreadsheet skill trained in the Codex loop works unchanged in Claude Code, lifting performance there to the same level as a skill trained directly in Claude Code. A math skill optimized on olympiad problems still delivers gains on a related benchmark without any retraining.

The ablation studies explain why the method stays stable. Without a bounded edit budget, the skill drifts too far with each revision. Without the buffer for rejected edits, the optimizer repeats the same failed attempts.

Removing the slow update at epoch's end costs SpreadsheetBench more than twenty points, the largest drop in the entire experiment. Only the combination of bounded step size, validation gating, negative feedback, and long-term consolidation makes skill training behave like a controlled optimization process, the authors say. Short, readable documents do the heavy lifting The final skills stay compact: The finished documents rarely exceed 2,000 tokens, and the improvements result from just one to four accepted edits across four training epochs. On OfficeQA, the largest gain came from a single accepted change.

The learned rules read as if an experienced practitioner had jotted them down after a day working with the benchmark. For spreadsheets, the skill learns to check the worksheet structure first and write directly evaluated values into the entire target range instead of using Excel formulas.

For ALFWorld, it keeps a log of visited locations and avoids heading to the goal before picking up the target object. For document questions, it anchors the question to the right table row before accepting an answer. None of these rules refer to a specific task. They describe procedures.

The authors acknowledge that the method depends on reliable automatic scoring. For open-ended tasks where success is hard to measure, the validation step would need human or model-based judgments. SkillOpt also deliberately optimizes a single document rather than a skill library, which could become a bottleneck for highly varied domains. Where SkillOpt fits in the self-improvement race While most current self-improvement approaches eventually tweak model weights, SkillOpt takes a remarkably lean path. OpenClaw-RL, a framework from Princeton researchers, uses follow-up signals from every interaction—like user responses or test results—as a live training source.

MetaClaw pulls compact behavioral rules from failed tasks and injects them into the prompt, updating weights only during idle phases via reinforcement learning. One parallel to SkillOpt: weaker models benefit the most in both cases because they lack procedural knowledge that a rule or skill can supply directly.

Other groups go further. AutoTTS lets a coding agent search for better reasoning control algorithms on its own, shifting the human role from designing rules to designing the environment. Meta's Hyperagents optimize the very mechanism they use to improve themselves. SkillOpt, by contrast, keeps the model frozen and changes nothing but a readable text file.