2026-06-22 02:08 UTCIn-site rewrite8 min readUpdated: 2026-06-22 03:02 UTC

Sakana Fugu: One Model to Command Them All

Sakana AI launches Fugu, a multi-agent system that dynamically orchestrates a diverse pool of top models via a single API, achieving frontier-level performance on complex tasks like coding and reasoning without vendor lock-in. Based on ICLR 2026 papers, Fugu learns to assemble and coordinate expert agents, offering two tiers: Fugu (balanced performance and latency) and Fugu Ultra (maximized answer quality). Benchmark results rival top models, with the added benefit of no export control risk. Not yet available in EU/EEA.

SourceHacker News AIAuthor: Finbarr

Sakana Fugu

One Model to Command Them All マルチエージェントを指揮する、一つのモデル

Frontier-level performance without single-vendor dependency. Fugu dynamically orchestrates the world's best models to tackle complex, multi-step tasks. Plug collective intelligence directly into your workflows today with a single API.

Sakana Fugu は、世界のトップモデル群を動的にオーケストレーションし、複数ステップに及ぶ複雑なタスクを自動的に解決します。高いパフォーマンスを実現するAPIを、あなたのワークフローに組み込みましょう。

Start Using Sakana Fugu

今すぐはじめる

See the technology 基盤技術を見る

Not yet available in the EU/EEA while we work toward compliance with GDPR and EU-specific regulations. GDPR等のEU/EEA固有規制への対応を進めており、現在はEU・EEA域内ではご利用いただけません。

What is Sakana Fugu ?

A Multi-Agent System, Delivered as One Model マルチエージェントを、一つのモデルAPIとして提供

Sakana Fugu achieves superior performance by dynamically coordinating and orchestrating a diverse pool of powerful models. Instead of using domain knowledge to prescribe team organization, roles, or workflows, Fugu learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns.

Sakana Fugu は、強力で多様なモデル群を動的に組み合わせ、協調させることで高いパフォーマンスを実現します。人間が思い付かないようなモデルの編成や役割分担、処理の進め方など、効率よく学習しながら成果を発揮します。

One API to Access All in an Optimized Way 一つのAPIで、複数モデルを最適に活用

Access a coordinated pool of specialized models through one API. Fugu handles model selection and switching for each task, reducing API complexity while improving cost-performance.

専門特化型のモデル群を、一つのAPIから利用することができます。タスクごとのモデルの選択と切り替えは Sakana Fugu が担うため、APIまわりの煩雑さを抑えつつ、コストパフォーマンスを高められます。

Offering Superior Performance on Complex Tasks 複雑なタスクで優れたパフォーマンス

Built for coding, reasoning, and other quality-critical workflows, Fugu coordinates expert agents to tackle complex tasks with stronger, more reliable results.

Sakana Fugu は、コーディングや推論（リーズニング）など、高い品質が問われるワークフローのために設計されています。専門エージェントを連携させることで、複雑なタスクにもより確かで信頼できる答えを導きます。

Providing Flexibility in Agent Selection 柔軟なエージェント選択

Control which agents can participate in Fugu’s model pool. Opt out of specific providers or models to meet data, privacy, compliance, or organizational requirements.

Sakana Fugu のモデルプールに加えるエージェントを選ぶことができます。データ、プライバシー、コンプライアンス、または組織の要件を満たすために、特定のプロバイダーやモデルを除外することが可能です。

Tech Behind

Research-Driven Coordination for Multi-Agent Intelligence

マルチエージェントの知能を支える、

最新研究に基づく協調技術

Sakana Fugu is grounded in two ICLR 2026 papers on learned model orchestration: TRINITY and the Conductor. Together, they show how systems can learn to assemble, route, and coordinate expert agents for each task instead of relying on hand-designed workflows. For a deeper look at the ideas behind the system, explore our technical report .

Sakana Fugu は、モデルのオーケストレーションを学習で実現する2本のICLR 2026論文「TRINITY」と「Conductor」を基盤としています。これらの研究は、人手で設計したワークフローに頼るのではなく、タスクごとに専門エージェントをどう編成し、振り分け、連携させるかをシステム自身が学習できることを示しています。仕組みの詳細は、テクニカルレポートをご覧ください。

PAPER

TRINITY: An Evolved LLM Coordinator TRINITY：進化型LLMコーディネーター

Trinity uses a lightweight evolved coordinator to orchestrate multiple LLMs over several turns, assigning Thinker, Worker, or Verifier roles to adaptively delegate work across coding, math, reasoning, and knowledge tasks. TRINITY は、軽量な進化型コーディネーターが複数のLLMを複数ターンにわたって統括する仕組み。各モデルに「Thinker（思考役）」「Worker（実行役）」「Verifier（検証役）」の役割を割り当て、コーディング・数学・推論・知識といった幅広いタスクに応じて、作業を適応的に振り分ける。

PAPER

Learning to Orchestrate Agents in Natural Language with the Conductor Conductor による自然言語でのエージェント統率の学習

The Conductor is trained with reinforcement learning to discover natural-language coordination strategies, designing agent communication patterns and focused prompts that help diverse LLM pools outperform individual workers on challenging reasoning benchmarks. Conductor は強化学習によって訓練され、自然言語ベースの協調戦略を自ら見つけ出す。エージェント間のやり取りの型や、要点を絞ったプロンプトを設計することで、多様なLLMの集まりが、難度の高い推論ベンチマークで単体のモデルを上回る力を発揮。

How to Use

Unlock Multi-Agent Intelligence Through An API API を通じてマルチエージェント知能を解き放つ

Sakana Fugu comes in two models — Fugu and Fugu Ultra — both available through one OpenAI-compatible API. Pick the model that fits your workload, or switch between them without changing your integration.

Sakana Fugu には Fugu と Fugu Ultra の 2 つのモデルがあり、どちらも OpenAI 互換 API から利用できます。ワークロードに合うモデルを選んでも、連携を変えずに両者を切り替えてもかまいません。

Fugu

Balanced performance and latency 性能とレイテンシのバランス

Fugu balances strong performance with low latency, making it the ideal default for everyday work. Drop it into tools like Codex for coding and code review, or power responsive chatbot services — all behind a single endpoint. You can also opt specific agents out of its pool to meet data, privacy, and compliance constraints.

Sakana Fugu は高い性能と低レイテンシを両立し、日々の作業に最適な標準モデルです。Codex のようなツールに組み込んでコーディングやコードレビューに使ったり、応答性の高いチャットボットを動かしたり——すべてをひとつのエンドポイントで実現します。データ・プライバシー・コンプライアンスの制約に合わせて、プールから特定のエージェントを除外することもできます。

Fugu Ultra

Optimized for performance 性能に最適化

Fugu Ultra coordinates a deeper pool of expert agents to maximize answer quality on hard, high-stakes problems. Early users rely on it for Kaggle competitions, paper reproduction, cybersecurity analysis, and literature and patent investigations. Fugu Ultra は、より広い専門エージェントのプールを連携させ、難易度が高く重要な問題で回答品質を最大化します。先行ユーザーは、Kaggle コンペティション、論文の再現、サイバーセキュリティ分析、文献・特許調査などに活用しています。

Quantitative Results

Sakana Fugu の性能：定量評価

Our Fugu models surpass publicly accessible frontier models and are shoulder-to-shoulder with Fable 5 and Mythos Preview in various rigorous engineering, scientific, and reasoning benchmarks while delivering frontier capability without the risk of export controls. 二つのFuguモデルは、一般に利用できるフロンティアモデルを上回り、エンジニアリング・科学・推論のさまざまな難関ベンチマークでも、Fable 5やMythos Previewと肩を並べます。しかも、輸出規制のリスクを負うことなく、フロンティアレベルの実力を発揮します。

Performance comparison of Fugu models and baseline frontier models across a suite of coding, reasoning, scientific, and agentic benchmarks. For Fable 5 and Mythos Preview, we report the max of the two if both scores are available on the same benchmark. Neither of them is in Fugu’s agent pool as they are not publicly accessible. コーディング、リーズニング、科学、エージェント能力に関するベンチマーク群における、Fuguモデルとベースラインのフロンティアモデルの性能比較。Fable 5とMythos Previewについては、同一ベンチマークで両方のスコアが入手できる場合、その高い方を採用。なお、両モデルは一般提供されていないため、Fuguのエージェントプールには含まれていない。

Highest scores are shown in boldface; second-highest scores are underlined. 最高スコアは太字、2 番目に高いスコアは下線で示しています。

Benchmark Fugu Fugu Ultra

Opus 4.8 †

Gemini 3.1 Pro †

GPT 5.5 †

SWE Bench Pro *

59.0 73.7 69.2 54.2 58.6

TerminalBench 2.1 80.2 82.1 74.6 70.3 78.2

LiveCodeBench 92.9 93.2 87.8 88.5 85.3

LiveCodeBench Pro 87.8 90.8 84.8 82.9 88.4

Humanity’s Last Exam 47.2 50.0 49.8 44.4 41.4

CharXiv Reasoning 85.1 86.6 84.2 83.3 84.1

GPQA-D 95.5 95.5 92.0 94.3 93.6

SciCode 60.1 58.7 53.5 58.9 56.1

τ³ Banking 21.7 20.6 20.6 8.4 20.6

Long Context Reasoning 74.7 73.3 67.7 72.7 74.3

MRCRv2 86.6 93.6 87.9 84.9 94.8

We use the mini-swe-agent as the scaffolding for this task.
mini-swe-agent をスキャフォールドとして使用。

† We use model provider-reported scores for the baselines. † モデル提供元が公表したスコア。

Qualitative Results

Sakana Fugu の性能：定性的な例

These examples compare Sakana Fugu models with three frontier baselines — Gemini 3.1 Pro (high) , Opus 4.8 (max) , and GPT 5.5 (xhigh) . To keep the focus on behavior rather than brand-by-brand attribution, the baselines are anonymized as Model A , Model B , and Model C in each description. The mapping is intentionally not fixed across examples.

以下の例では、 Sakana Fugu を、 Gemini 3.1 Pro（high）、 Opus 4.8 （max）、 GPT 5.5（xhigh）の3つのフロンティアモデルと比較しています。個別モデルではなく挙動の違いに注目できるよう、ベースラインを Model A 、 Model B 、 Model C として匿名化しています。なお、どのモデルがA〜Cかは例ごとに変えています。

AutoResearch · BPB AutoResearch · BPB

⚠️

Video unavailable 動画を表示できません

demo0.mp4

Kana letters · Reading order かな文書 · 読み順推定

⚠️

Video unavailable 動画を表示できません

demo1.mp4

Rubik's Cube · Solver Rubik’s Cube · ソルバー生成

⚠️

Video unavailable 動画を表示できません

demo2.mp4

CAD · Mechanical iris CAD · メカニカルアイリス

⚠️

Video unavailable 動画を表示できません

demo3.mp4

Blindfold chess ブラインドフォールド・チェス

⚠️

Video unavailable 動画を表示できません

demo4.mp4

Time-series · Trading 時系列予測 · トレーディング

⚠️

Video unavailable 動画を表示できません

demo5.mp4

This experiment shows an AI agent autonomously improving a small GPT's training recipe. Using AutoResearch (Karpathy et al.) – which iteratively edits training code, runs experiments, and keeps only changes that lower validation bits-per-byte (BPB) – the agent ran 123 experiments over ~14 hours on a single H100 GPU. Each line traces a system's best BPB as experiments accumulate: Fugu-Ultra is in bold red (solid = mean over three seeds, dashed = best single run), with three frontier-model baselines (Model A, B, and C) faded behind it, and the callouts mark each new improvement the agent found on its own — spanning batch size, model depth, learning rates, and optimizer settings. Fugu-Ultra finishes with the best mean BPB (0.9774 ± 0.0019), ahead of Model C (0.9781), Model B (0.9793), and Model A (0.9822), and its best single run reaches 0.9748, leading every baseline. This suggests that orchestrating multiple strong models can outperform any individual frontier model on agentic ML research.

例1 — AutoResearch / LLM学習

AIエージェントに小規模なGPTの学習レシピを自律的に改善させる実験。学習コードを反復的に書き換え、実験を実行し、検証用 bits-per-byte（BPB）を下げた変更だけを残していくエージェント型フレームワーク AutoResearch（Karpathy et al.）を用い、エージェントは単一のH100 GPU上でおよそ14時間にわたり123回の実験を実施した。各線は、実験が積み重なるにつれて各システムが達成した最良のBPBの推移を表している。Fugu-Ultra は太い赤の線（実線＝3シードの平均、破線＝最良の単一実行）で示し、その背後に3つのフロンティアモデルのベースライン（Model A・B・C）を淡色で重ねている。吹き出しは、エージェントが自ら見つけた改善点をそれぞれ示しており、バッチサイズ、モデルの深さ、学習率、オプティマイザの設定など多岐にわたる。Fugu-Ultra は最終的に最良の平均BPB（0.9774 ± 0.0019）を達成し、Model C（0.9781）、Model B（0.9793）、Model A（0.9822）を上回った。最良の単一実行では 0.9748 に到達し、すべてのベースラインを上回っている。これらの結果は、複数の強力なモデルをオーケストレーションすることで、エージェント型のML研究において単体のフロンティアモデルを上回り得ることを示唆している。

This case study tests whether the reading order of classical Japanese kana letters (仮名消息) can be recovered — letters whose scattered chirashigaki ("scattered-writing") layout makes that genuinely hard even for trained readers of classical Japanese. Each model is given the character bounding boxes together with a rough set of reading-order rules, and writes code that outputs the order the characters should be read in; here it runs on a letter written in 1610 by Hōshun'in (芳春院, 1547–1617), scored by NED (a score based on normalized edit distance from an expert's ground-truth order, where 1.0 is a perfect match). Several frontier models were put through the identical pipeline, but none came close to Fugu-Ultra on this letter: Model A reached only NED 0.24 and Model B scored no better, both far below Fugu-Ultra's 0.80, while Model C produced no predictor at all. The clip shows the two extremes — each panel draws its predicted path in red over the expert's ground truth in green: Fugu-Ultra (top) traces the letter almost exactly, while Model A (bottom) jumps all over the page. (Letter held by the Keio Institute of Oriental Classics.)

例2 — 仮名消息の読み順推定

本ケーススタディは、仮名消息（古典日本語のかな書状）という歴史的資料における読み順の推定問題を対象とする。仮名消息は、文字を紙面に散らして記す「散らし書き」という形式で書かれているため、古文書を読み慣れた人でも文字の読み順を正しく判定することは難しい。そこで各モデルに対して、文字を囲む四角形（バウンディングボックス）と読み順の大まかなルールを与え、文字の読み順を推定するコードを出力させた。実験の対象には1610年に芳春院（ほうしゅんいん、1547–1617）が記した書状を選び、NED（専門家による正しい読み順との正規化編集距離にもとづくスコア。1.0が完全一致）で評価した。複数のフロンティアモデル（A-C）を同一のパイプラインに通したところ、Fugu-Ultraの結果は他のモデルを大きく引き離した。Model AはNED 0.24、Model Bもそれと大差なく、いずれもFugu-Ultraの0.80には遠く及ばない。さらにModel Cはまともなコードを一回も出力できなかった。モデルによる読み順の違いを可視化するために、専門家による正解の読み順（緑）の上に、推定した経路（赤）を描いて映像化した。Fugu-Ultra（上）が読み順をほぼ正確になぞる一方、Model A（下）は紙面全体をあちこち飛び回り、両者は大きく異なる結果を示している。図：芳春院消息（慶應義塾大学斯道文庫蔵）

In this benchmark, each of Fugu-Ultra and 3 frontier models is given a single prompt to write a Rubik's Cube solver from scratch in pure Python — no off-the-shelf solving libraries allowed — and the resulting program is run locally on a held-out set of 300 randomly scrambled cubes. Solution quality is measured by the number of moves a solution uses, where lower is better. Fugu-Ultra and the frontier Model A wrote solvers that ran and solved all 300 cubes, while Model B and Model C each shipped sop

[truncated for AI cost control]