待翻译:JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines
AI 服务暂时不可用,以下为来源摘要,待恢复后补全翻译:JetBrains releases Mellum2 under Apache 2.0 — a 12B MoE model trained on 10.6 trillion tokens for AI workflows. The post JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines appeared first on MarkTechPost.
AI 服务暂时不可用,以下为来源正文,待恢复后补全翻译。
JetBrains released Mellum2, open-sourcing the weights under the Apache 2.0 license. The first version of Mellum was a completion-focused 4B dense model. Mellum2 is its successor: a general-purpose model specialized in software engineering. It covers code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance. JetBrains team positions Mellum2 as a “focal model” — a fast, specialized component inside larger AI systems, not a standalone replacement for frontier models. Architecture Mellum2 uses a Mixture-of-Experts (MoE) architecture with 12B total parameters and 2.5B active parameters per token. In MoE models, only a subset of parameters runs on each token. Here, the model has 64 experts and activates 8 per token. This keeps per-token compute equivalent to a 2.5B dense model, while the total parameter count provides higher capacity for specialization. Key architectural details: Layers: 28 Hidden size: 2304 MoE experts: 64 total, 8 activated per token Attention: Grouped-Query Attention (GQA) with 32 query heads and 4 KV heads Sliding Window Attention (SWA): Applied to three of every four layers, with a window size of 1,024. Full attention runs on the remaining layer. Context length: 131,072 tokens Multi-Token Prediction (MTP) head: Serves as an auxiliary pre-training objective and as a built-in draft model for speculative decoding Precision: bfloat16 Vocabulary size: 98,304 The model handles natural language and code. It is not multimodal — there is no image or video input. Pre-Training Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum. The data mixture progressively shifts from diverse web content toward curated code and mathematical content across the three phases. Training used the Muon optimizer under FP8 hybrid precision with a Warmup-Hold-Decay learning rate schedule with linear decay to zero. After pre-training, the base model’s context window was extended to 128K tokens using a layer-selective YaRN method before post-training began. The Model Family JetBrains team released six checkpoints covering the full training pipeline: CheckpointDescription Mellum2-12B-A2.5B-Base-PretrainBase checkpoint before long-context extension Mellum2-12B-A2.5B-BaseFinal base model after context extension Mellum2-12B-A2.5B-Instruct-SFTSupervised fine-tuned instruction checkpoint Mellum2-12B-A2.5B-Thinking-SFTSupervised thinking checkpoint Mellum2-12B-A2.5B-InstructRL-tuned instruction model Mellum2-12B-A2.5B-ThinkingRL-tuned thinking model Post-training follows two stages: supervised fine-tuning (SFT), then reinforcement learning with verifiable rewards (RLVR) on math, executable coding, tool use, instruction following, reasoning, and knowledge tasks. The Instruct variant answers directly, without an externalized chain of thought. Use it for low-latency tasks: direct answers, tool use, and instruction following. The Thinking variant emits an explicit reasoning trace before its final answer. Use it for complex debugging, multi-step planning, or agentic flows where step-by-step reasoning matters. Benchmark Results All numbers below are self-reported by JetBrains. The comparison set is open-weight models in the 4B–14B range. Coding: BenchmarkMellum2 InstructQwen3.5 (4B)Qwen3.5 (9B)Ministral 3 (14B)OLMo-3 (7B)Seed-Coder (8B) LiveCodeBench v637.251.063.742.428.228.1 EvalPlus78.469.471.874.167.373.8 MultiPL-E67.151.067.171.536.177.0 Tool Use: BenchmarkMellum2 InstructQwen3.5 (4B)Qwen3.5 (9B)Ministral 3 (14B)OLMo-3 (7B) BFCL v366.364.170.552.741.9 BFCL v444.252.060.638.819.8 Math: BenchmarkMellum2 InstructQwen3.5 (4B)Qwen3.5 (9B)Ministral 3 (14B)OLMo-3 (7B) AIME 2025+202641.738.358.333.340.0 GSM-Plus80.585.287.986.685.8 Knowledge and Conversational: BenchmarkMellum2 InstructQwen3.5 (4B)Qwen3.5 (9B)Ministral 3 (14B)OLMo-3 (7B) MMLU-Redux78.187.591.185.971.8 GPQA Diamond40.976.879.858.640.9 IFEval75.882.183.967.383.2 MixEval62.265.971.171.259.4 Benchmark notes: EvalPlus is the mean of HumanEval+ and MBPP+ AIME is the mean of AIME 2025 and AIME 2026 (30 questions each) BFCL v4 is the macro-average of five subtasks: v1, v2, v3, web search, memory Seed-Coder (8B) does not support native tool calling; BFCL scores are not listed for it Use Cases JetBrains identifies four production scenarios where Mellum2’s latency and efficiency profile is relevant: Routing and orchestration: In a multi-model system, a router analyzes incoming prompts and selects the appropriate model or tool for each task. Mellum2’s low per-token compute makes it suitable for this high-frequency classification step. Low-latency RAG pipelines: Retrieval-Augmented Generation (RAG) systems retrieve relevant context, summarize it, and generate a response. Mellum2 handles retrieval summarization at lower latency than larger dense models. Sub-agents in complex workflows: Agent pipelines break tasks into steps: context gathering, planning, validation, and execution. Mellum2 can handle repetitive or latency-sensitive steps instead of routing every step through a single large frontier model. Private and local deployment: The Apache 2.0 license permits self-hosting without restrictions. Engineers can run Mellum2 on their own infrastructure, keeping code and data under their control. Strengths and Limitations Strengths: MoE design activates only 2.5B of 12B parameters per token — per-token compute equivalent to a 2.5B dense model MTP head enables speculative decoding without a separate draft model 131,072 token context window Full checkpoint set released: base pretrain, base, SFT, and RL-tuned variants for both Instruct and Thinking Apache 2.0 license — permits commercial use, self-hosting, and fine-tuning Strong EvalPlus (78.4) and BFCL v3 (66.3) scores relative to 4B–14B comparisons vLLM support, including optional tool-calling via --tool-call-parser hermes Limitations: Text and code only — no image or multimodal input LiveCodeBench v6 (37.2) trails Qwen3.5 9B (63.7) and Ministral 3 14B (42.4) GPQA Diamond (40.9) and MMLU-Redux (78.1) are below most models in the comparison set GSM-Plus (80.5) is below all comparable models listed Not designed for frontier-level tasks — JetBrains explicitly positions Mellum2 as a component model Marktechpost’s Visual Explainer Overview JetBrains Open-Sources Mellum2 A 12B Mixture-of-Experts model released under Apache 2.0 on June 2, 2026. Trained from scratch on ~10.6 trillion tokens for software engineering tasks. Total Params 12B Active / Token 2.5B License Apache 2.0 Context 131,072 tok Architecture MoE Pre-train Data ~10.6T tok Architecture How Mellum2 Is Built MoE activates 8 of 64 experts per token — per-token compute stays equivalent to a 2.5B dense model. An MTP head enables speculative decoding without a separate draft model. Layers 28 Hidden Size 2304 Experts (total / active) 64 / 8 GQA Heads (Q / KV) 32 / 4 SWA Window 1,024 (¾ layers) Vocabulary 98,304 Precision bfloat16 Modality Text + Code Pre-Training Training Pipeline Three-phase curriculum progressively shifts from diverse web data toward curated code and math. Context extended to 128K via layer-selective YaRN before post-training. Data: ~10.6 trillion tokens across three curriculum phases Optimizer: Muon under FP8 hybrid precision LR Schedule: Warmup-Hold-Decay with linear decay to zero Context Extension: Layer-selective YaRN to 128K tokens Post-Training: SFT → RLVR on coding, math, tool use, reasoning, knowledge Design Constraint: Inference efficiency on commodity GPUs validated by ablation Model Family Six Checkpoints Released Full pipeline from base pretrain through RL-tuned variants. Use Instruct for direct low-latency answers. Use Thinking for explicit step-by-step reasoning traces. BASEMellum2-12B-A2.5B-Base-PretrainBefore context extension BASEMellum2-12B-A2.5B-BaseAfter YaRN extension SFTMellum2-12B-A2.5B-Instruct-SFTSupervised instruction SFTMellum2-12B-A2.5B-Thinking-SFTSupervised thinking RLVRMellum2-12B-A2.5B-InstructRL-tuned, no CoT RLVRMellum2-12B-A2.5B-ThinkingRL-tuned, explicit CoT Benchmarks Evaluation Results (Instruct Variant) All numbers self-reported by JetBrains. Comparison set: open-weight models in the 4B–14B range. BenchmarkMellum2Qwen3.5 9BMinistral 3 14BOLMo-3 7B LiveCodeBench v637.263.742.428.2 EvalPlus78.471.874.167.3 MultiPL-E67.167.171.536.1 BFCL v366.370.552.741.9 AIME 2025+202641.758.333.340.0 IFEval75.883.967.383.2 Use Cases Where Mellum2 Fits in Production JetBrains positions Mellum2 as a “focal model” — handling high-frequency, latency-sensitive steps inside larger AI pipelines. Routing & Orchestration — Analyze prompts and select the right model or tool per task RAG Pipelines — Summarize retrieved context at low latency before response generation Sub-Agents — Handle repetitive steps in agent pipelines (context gathering, validation, planning) Private Deployment — Apache 2.0 permits full self-hosting with no external API calls required Strengths & Limitations What Works and What Doesn’t Mellum2 is designed for efficiency in component roles, not frontier-level capability across all benchmarks. ✓ Strengths 2.5B active params — compute of a dense 2.5B model MTP head enables built-in speculative decoding 131K token context window Strong EvalPlus (78.4) and BFCL v3 (66.3) Apache 2.0 — commercial use, fine-tuning, self-hosting vLLM support with tool-calling ✗ Limitations Text and code only — no multimodal input LiveCodeBench v6 (37.2) below Qwen3.5 9B (63.7) GPQA Diamond (40.9) below most comparisons GSM-Plus (80.5) trails all models listed Not a frontier replacement — component role only Quick Start Deploy with vLLM Install vLLM and serve the Instruct variant. Enable tool-calling with the hermes parser for function-calling workflows. pip install vllm # Basic serve vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct \ --max-model-len 131072 # With tool calling vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct \ --max-model-len 131072 \ --enable-auto-tool-choice \ --tool-call-parser hermes Model weights: huggingface.co/JetBrains/mellum-2 · Technical report: arXiv:2605.31268 Getting Started Serve Mellum2 with vLLM: Copy CodeCopiedUse a different Browser pip install vllm vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct --max-model-len 131072 With tool calling enabled: Copy CodeCopiedUse a different Browser vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct \ --max-model-len 131072 \ --enable-auto-tool-choice \ --tool-call-parser hermes Using the Hugging Face Transformers library: Copy CodeCopiedUse a different Browser from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("JetBrains/Mellum2-12B-A2.5B-Instruct") model = AutoModelForCausalLM.from_pretrained("JetBrains/Mellum2-12B-A2.5B-Instruct") messages = [{"role": "user", "content": "Write a Python function to reverse a string."}] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) Check out the Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines appeared first on MarkTechPost.