Qwen AI News

Qwen updates

PrismML Releases Bonsai 27B: 1-bit and Ternary Builds of Qwen3.6-27B That Run on Laptops and Phones

2026-07-14 22:51 UTC

PrismML just released Bonsai 27B. It is a low-bit representation of Qwen3.6-27B, not a new pretrain. The architecture is unchanged. Two variants ship under Apache 2.0. Ternary Bonsai 27B uses {−1, 0, +1} weights at a true 1.71 bits per weight. Its ideal size is 5.9GB. 1-bit Bonsai 27B uses binary {−1, +1} weights at 1.125 bits per weight, for 3.9GB. Performance: ternary retains 94.6% of FP16, binary retains 89.5%. Both are multimodal, context 262K tokens. PrismML claims the 1-bit build is the first 27B-class model to fit a phone.

Bonsai 27B is a low-bit representation of Qwen3.6-27B, not a new pretrain.
Two variants: ternary (1.71 bits/weight, 5.9GB) and binary (1.125 bits/weight, 3.9GB).

Cost of Reasoning in non-English Languages: A Case Study on Japanese

2026-07-14 04:00 UTC

This paper investigates the feasibility of training a reasoning language model in Japanese. By applying GRPO to a Japanese continually pretrained model based on Qwen-3-Swallow-8B, the authors find that reasoning-language control is achievable, yet performance at best matches English-reasoning baselines. On Japanese cultural benchmarks, the model performs worse, indicating that reasoning in Japanese does not automatically improve culturally relevant tasks.

Explores training a reasoning model to reason in Japanese.
Developed a Japanese-reasoning variant of Qwen-3-Swallow-8B using GRPO.

Fixed three bugs that made Qwen3.5-122B a daily driver on Mac Studio

2026-07-11 22:54 UTC

After fixing three bugs related to prefix caching, the author achieved sub-second prefill times for long-context conversations with Qwen3.5-122B on a Mac Studio, turning a multi-minute wait into a seamless experience. The bugs included a timestamp in system prompt, missing reply saves on interrupt, and junk checkpoint writes.

Qwen3.5-122B on Mac Studio had severe prefill latency due to hybrid attention's cache behavior.
Three bugs: timestamp in system prompt caused cache miss; interrupted replies not saved; junk checkpoints evicted good ones.

AI Models Overthink Problems—and It’s a Security Risk

2026-07-08 11:00 UTC

Research shows that large language models with reasoning capabilities can be tricked into 'overthinking' using logically inconsistent prompts, leading to a denial-of-service attack. Researchers from Zhejiang University and Alibaba developed an evolutionary algorithm that generates malicious prompts, causing outputs up to 26 times longer in leading models like DeepSeek-R1, Qwen3-Thinking, GPT-o3, and Gemini 2.5 Flash.

Researchers demonstrate a new attack exploiting 'overthinking' in AI reasoning models, causing excessive computation.
An evolutionary algorithm corrupts prompts to produce outputs up to 26 times longer than normal.

NAVER LABS System Re-implementation for the IWSLT 2026 Instruction-Following Task

2026-07-08 04:00 UTC

NAVER LABS re-implements its IWSLT 2025 instruction-following pipeline for the IWSLT 2026 Shared Task (constrained condition, short audio track), adapting to mandated components: SeamlessM4T-v2-large as speech encoder and Qwen3-4B-Instruct as LLM backbone. The three-stage approach (projector alignment, text-only LoRA pre-training, multimodal merging) is preserved. Additionally, 100k synthetic instruction-following examples across ten speech-centric task types (10k per task) are constructed. The primary model achieves COMET 0.781 on EN-ZH speech translation and BERTScore-F1 0.346 on English SQA on the MCIF benchmark.

Re-implements NAVER LABS IWSLT 2025 pipeline for IWSLT 2026
Uses SeamlessM4T-v2-large and Qwen3-4B-Instruct as core components

Liquid AI Open-Sources Antidoom: A Final Token Preference Optimization (FTPO) Method that Reduces Doom Loops in Reasoning Models

2026-07-07 16:50 UTC

Liquid AI released Antidoom, an open-source method targeting doom loops in reasoning models. Using FTPO, it retrains only the token that starts a loop, reducing loop rates from 10.2% to 1.4% on LFM2.5-2.6B and from 22.9% to 1% on Qwen3.5-4B.

Antidoom reduces doom loops by retraining only the first loop-start token.
FTPO spreads probability across multiple coherent alternatives.

Reinforcement Learning for Data-Efficient Code-Switched ASR

2026-07-07 04:00 UTC

Researchers propose a reinforcement learning with verifiable rewards (RLVR) method to adapt audio-language models for code-switched automatic speech recognition. Using only 10% of the data, RLVR matches the performance of full-dataset supervised fine-tuning on Qwen2-Audio across 10 language pairs, with gains transferring zero-shot to human-recorded speech.

New RLVR method combines error rate and script fidelity rewards for code-switching ASR.
Achieves full-dataset LoRA SFT performance with only 10% of the data.

Out-of-Distribution Generalization of Risk Aversion in Language Models

2026-07-07 04:00 UTC

This paper investigates whether risk aversion trained in low-stakes scenarios can generalize to astronomically high-stakes scenarios, as a potential failsafe against AI misalignment. Introducing the RiskAverseOOD benchmark, initial experiments on Qwen3-8B show that learned risk aversion can partially generalize across 98 orders of magnitude, boosting cooperation rates from 2% to 70% (SFT), 52% (DPO), and 39% (activation steering). However, consistency is insufficient for a reliable failsafe.

Introduces RiskAverseOOD benchmark for measuring out-of-distribution generalization of risk aversion.
Uses SFT, DPO, and activation steering to train risk aversion in language models.

Oyster-II: Reinforcement Learning for Constructive Safety Alignment in Large Language Models

2026-07-07 04:00 UTC

Large language models (LLMs) face a persistent challenge in balancing safety, helpfulness, and trustworthiness. Traditional refusal-oriented alignment strategies mitigate harmful content but often fail to serve legitimate user needs. Oyster-II proposes a reinforcement learning (RL)-based constructive safety alignment framework, employing a Zero-RL paradigm combined with a multi-stage RL strategy. It addresses two critical limitations of Oyster-I's Supervised Fine-Tuning (SFT) scheme: insufficient safety generalization to out-of-distribution scenarios and safety chain-of-thought (CoT) over-generalization. Evaluated on extensive benchmarks, Oyster-II comprehensively surpasses Qwen3-14B and Oyster-I on safety dimensions, achieving cross-scale performance comparable to Qwen3-Max and Qwen3.5-397B.

Oyster-II improves upon Oyster-I by using reinforcement learning instead of supervised fine-tuning for safety alignment.
It introduces a Zero-RL paradigm combined with multi-stage reinforcement learning.

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

2026-07-07 00:00 UTC

LensVLM is an inference framework and post-training recipe that enables Vision Language Models (VLMs) to scan compressed images and selectively expand only relevant images to their uncompressed form via learned tools. Built on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3× effective compression and outperforms retrieval-based, text- and visual-compression baselines up to 10.1× effective compression across seven text QA benchmarks. It also generalizes to multimodal document and code understanding tasks, with accuracy gains increasing as compression grows.

VLMs process text as rendered images but compression makes characters indistinguishable.
LensVLM scans compressed images and selectively expands relevant content.

China’s AI companion rules: what Beijing is really going after

2026-07-06 11:00 UTC

China's new AI companion regulations take effect July 15, requiring anti-addiction systems, mandatory notifications, and instant exit mechanisms. Major apps like Doubao and Qwen have shut down companion features to comply, sparking user backlash over data loss.

China's Interim Measures for AI Anthropomorphic Interactive Services begin July 15, targeting emotionally engaging AI companions.
ByteDance and Alibaba disabled companion features due to design conflicts; users face data loss.

Chinese LLMs Doubao, Qwen to shut down personalized AI agents on July 15

2026-07-06 06:23 UTC

ByteDance's Doubao and Alibaba's Qwen LLMs announced they will shut down personalized AI agent features on July 15 to comply with new government regulations. The move aims to enhance safety, prevent abuse by third parties, and reduce investment in less viable businesses. The shutdown coincides with the enforcement of interim measures for AI-powered personified interactive services, which require anti-addiction systems, minor identity verification, and strict content review. Despite the removal, the AI agent market is expected to grow explosively.

Doubao and Qwen will disable personalized AI agents on July 15, with data retention until October 15.
The shutdown is driven by regulatory compliance and business optimization.

Modern VLMs Explained: How GPT-4o, Gemini, Claude Vision, and Qwen-VL Work

2026-07-06 05:14 UTC

Modern Vision Language Models (VLMs) combine vision and language to understand images and text. This article explains how GPT-4o, Gemini, Claude Vision, and Qwen-VL work, their key differences, strengths, and limitations.

Modern VLMs go beyond simple image captioning to analyze documents, charts, and answer visual questions.
GPT-4o excels in real-time multimodal interaction across text, images, audio, and video.

Show HN: An unmetered LLM API–$6/month, no token tracking, no limits

2026-07-06 01:22 UTC

Yolo-Auto launches a flat-rate $6/month API for unlimited access to Qwen3.6-35B-A3B. The service is OpenAI-compatible, stores no prompts or responses, and is designed for coding agents, automation, and high-volume workflows without per-token anxiety.

$6/month for unlimited Qwen3.6-35B-A3B access with no token counting or request caps.
Fully OpenAI-compatible API, works with Cursor, LangChain, and other tools.

Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents

2026-07-05 02:31 UTC

Junyang Lin, former Qwen technical lead, critiques hybrid thinking in Qwen3 and advocates for agentic thinking. He explains the difficulties in merging thinking and non-thinking modes, and why agentic RL requires decoupled infrastructure and high-quality environments to avoid reward hacking.

Junyang Lin stepped down as Qwen lead on March 3, 2026, now writes as independent researcher.
Qwen3's hybrid thinking mode was challenging to implement; later variants re-separated Instruct and Thinking.

World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments

2026-07-03 04:00 UTC

The study audits MedAgentBench v1/v2, finds a 41.7% silent-finish ceiling, and constructs MAB-v3 (508 tasks, 8.9% ceiling). Training Qwen3-8B reveals two structural barriers: a capability ceiling and a format-knowledge barrier. Pure RL achieves 18.2% pass@1 vs. 34.1% for rule-based SFT, a 15.9 pp gap entirely attributable to these barriers. A decision/format-knowledge/lookup taxonomy predicts RL learnability.

MedAgentBench v1/v2 has a 41.7% silent-finish ceiling, making inaction the dominant RL strategy
New MAB-v3 benchmark reduces the ceiling to 8.9% with 508 tasks

FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts

2026-07-02 04:00 UTC

Parameter-efficient fine-tuning (PEFT) reparameterizes weight updates in a fixed basis: low-rank adapters operate in the spatial domain, while spectral methods operate in a fixed Fourier domain. This paper introduces Fractional-Fourier Mixture of Experts, where each expert has a learnable fractional-Fourier order that interpolates between spatial and Fourier domains. Routing tokens through experts of different orders allows low-rank updates to be placed in their most compact domain, and the experts are naturally decorrelated, reducing interference and improving multi-task composition. The method adds negligible cost and outperforms strong baselines on LLaMA-3.1-8B and Qwen2.5-7B across various benchmarks.

Proposes Fractional-Fourier Mixture of Experts with learnable fractional-Fourier order per expert, enabling interpolation between spatial and Fourier domains.
Routing tokens to experts of different orders places each low-rank update in its most compact domain, and mutual incoherence of fractional-Fourier operators naturally decorrelates experts.

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

2026-07-01 04:00 UTC

This paper presents a benchmark for Arabic--Russian scientific translation, including a hybrid parallel corpus of about 27,000 sentence pairs compiled from scientific abstracts and general-domain texts. Three multilingual language models were fine-tuned using LoRA with various ranks. The Qwen2.5-7B model with QLoRA (rank 8) achieved the best results: BLEU 23.15, chrF 43.89, BERTScore 0.906, COMET 0.758, outperforming zero-shot baseline by +4.36 BLEU and +0.051 COMET. Few-shot prompting did not improve performance, indicating the necessity of domain-specific fine-tuning. The models, corpus, and evaluation code are released publicly, aiming to lower language barriers for scientific knowledge exchange between Arabic and Russian speakers and contributing to UN SDGs 9 and 17.

Constructed a hybrid Arabic--Russian parallel corpus of ~27,000 sentence pairs from scientific abstracts and general texts.
Fine-tuned three multilingual models; Qwen2.5-7B with QLoRA (rank 8) achieved best translation performance.

The AI Model Accessibility Checker

2026-06-30 14:21 UTC

AIMAC, an initiative by the GAAD Foundation in partnership with ServiceNow, evaluated 37 top AI models on web page accessibility across 28 categories. OpenAI's GPT 5.4 Mini and GPT 5.3 Codex tied for first with a median AIMAC debt of 0.00. Alibaba's Qwen and Z.ai's GLM 4.7 Flash also performed well. Low contrast text is the most common accessibility issue in AI-generated pages, appearing in 84.2% of pages.

AIMAC evaluated 37 AI models on generating accessible web pages across 28 categories
OpenAI's GPT 5.4 Mini and GPT 5.3 Codex tied for first with 0.00 accessibility debt

Building Local AI Systems: Qwen3.6 + MCPs

2026-06-30 14:00 UTC

This article introduces how to build local AI systems using the Qwen3.6-35B-A3B model and the Model Context Protocol (MCP), covering model architecture, hardware requirements, deployment, and a practical GitHub developer assistant example.

MCP is an open protocol that allows AI models to call external tools through a standard interface, eliminating the need for custom integration code per model.
The Qwen3.6-35B-A3B uses a Mixture of Experts architecture with only 3B activated parameters, making it suitable for local deployment.

Ornith-1.0: self-improving open-source models for agentic coding

2026-06-29 17:16 UTC

Ornith-1.0 is a family of open-source agentic coding models post-trained on Gemma 4 and Qwen 3.5, using reinforcement learning to jointly optimize scaffold and solution rollouts. Available in 9B, 35B MoE, and 397B MoE sizes, it achieves state-of-the-art results on coding benchmarks like Terminal-Bench, SWE-Bench, NL2Repo, and OpenClaw. MIT licensed, supports OpenAI-compatible API and tool calling.

Ornith-1.0 offers 9B (dense), 35B (MoE), and 397B (MoE) variants, achieving best-in-class performance among open-source models on multiple coding benchmarks.
Its self-improving RL framework jointly trains search scaffold and solution generation, enhancing search trajectory quality.

Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding

2026-06-29 16:17 UTC

DeepReinforce releases Ornith-1.0, an open-weights (MIT) model series based on Gemma 4 and Qwen 3.5, achieving state-of-the-art performance on coding benchmarks among open-source models of comparable size. The author tests the 35B MoE variant with LM Studio and Pi, finding it proficient at handling multiple tool calls for agentic coding tasks.

Open-weights (MIT) model from DeepReinforce
Built on Gemma 4 and Qwen 3.5 with variants from 9B to 397B

Empero-AI/Qwythos-9B-Claude-Mythos-5-1M: A 1M-Context Reasoning Model Based on Qwen3.5

2026-06-29 05:53 UTC

Qwythos-9B is a full-parameter reasoning model developed by Empero AI, built on a deeply uncensored Qwen3.5-9B base and post-trained on over 500 million tokens of high-quality Claude Mythos and Fable traces with chain-of-thought generated in-house. It features a 1,048,576-token context window, significant improvements over the base model on MMLU and GSM8K (up to +34 points), native function calling, and tool-assisted self-correction. The model is deliberately uncensored and targets technically demanding domains such as cybersecurity, red-teaming, and biomedical fields.

Full-parameter fine-tune of Qwen3.5-9B with 500M+ tokens of high-quality post-training data.
Supports 1,048,576-token context window for whole-codebase reasoning and multi-document research.

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

2026-06-29 04:00 UTC

DMV-Bench is the first interactive benchmark for multimodal-agent visual memory, built on a home-furnishing e-commerce catalog of 1,000 product variants. Each product image carries a unique incidental cue; agents must recall cued products after long shopping sessions. The proposed DualMem architecture, maintaining parallel visual and verbal codes, outperforms baselines on Gemini 2.5 Flash and Qwen2.5-VL-7B.

DMV-Bench is the first interactive benchmark for visual memory, using incidental cues in 1,000 product images
DualMem architecture maintains visual and verbal codes in parallel, excelling in long-horizon tasks

Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

2026-06-29 04:00 UTC

Large language model (LLM) agents struggle to update facts in long-term interactions. Replacing full context with bounded memory drops accuracy from 92% to 77% even on frontier models. The gap scales with conversation length, not memory size. The authors introduce Supersede, a reinforcement learning environment that trains agents to prioritize current facts over superseded ones. Fine-tuning Qwen2.5-3B in this environment nearly doubles held-out accuracy (9.0% to 16.7%).

LLM agents fail to update facts in long-term memory, causing significant accuracy drops.
The memory-update gap is not due to model scale or memory capacity but grows with conversation length.

Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference

2026-06-28 04:58 UTC

Liquid AI released LFM2.5-230M, its smallest model yet. The 230M-parameter, open-weight model runs on-device at 213 tok/s on a Galaxy S25 Ultra and 42 on a Raspberry Pi 5. Built on the LFM2 architecture, it targets tool use and data extraction, beating larger models like Qwen3.5-0.8B and Gemma 3 1B on instruction following.

Liquid AI's smallest model: LFM2.5-230M, 230M params, open-weight, based on LFM2 architecture.
On-device performance: 213 tok/s on Galaxy S25 Ultra, 42 on Raspberry Pi 5.

Using Local Coding Agents

2026-06-27 11:21 UTC

A tutorial on setting up a production-ready local coding agent using open-source tools and open-weight LLMs, with a focus on Qwen3.6 and Qwen-Code harness, covering motivation, setup, performance assessment, and alternatives.

Local coding agents offer transparency, privacy, and no subscription costs compared to services like Claude Code and Codex.
Qwen3.6 35B-A3B with Qwen-Code harness is a top-performing local setup.

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

2026-06-26 04:00 UTC

A new benchmark, Know2Guess, aims to evaluate LLMs' ability to distinguish between knowledge-based answering and guessing, considering data contamination. It includes 1,200 items across five domains and tests models like FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct. Qwen2.5-3B-Instruct shows best reliability but still has calibration issues.

Know2Guess benchmark contains 1,200 items across five domains with contamination metadata
Evaluation shows incomplete transition from answering to abstaining

Refusal Lives Downstream of Persona in Chat Models

2026-06-26 04:00 UTC

This paper shows that refusal in instruction-tuned chat models is gated by a compliant persona direction. Interventions on Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct demonstrate that steering compliant persona suppresses refusal (e.g., Llama's refusal rate drops from 97% to 2%), and refusal direction only partially restores refusal in late layers. The findings indicate refusal is expressed downstream of persona computation.

Compliant persona steering reduces refusal rates drastically (97% to 2% in Llama).
Refusal direction partially restores refusal only in late layers, not early ones.

DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds

2026-06-25 17:11 UTC

DeepReinforce released Ornith-1.0, an open-source coding model family built on Gemma 4 and Qwen 3.5. Instead of a fixed harness, the model learns its own scaffold during reinforcement learning. The 397B flagship reports 82.4 on SWE-Bench Verified, with all weights under the MIT license.

Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes under MIT, built on Gemma 4 and Qwen 3.5.
The model learns its own scaffold during RL, jointly optimizing the harness and the solution.

Beyond Fable: Can a Local LLM Replace Cloud AI for Security Code Reviews

2026-06-25 12:05 UTC

Research shows that with proper scaffolding, a local LLM like Qwen3.6-35B-A3B can produce security findings comparable to frontier cloud models, but works best in a Source-local pipeline where the cloud designs and consolidates while the local model executes, keeping source code on-premises.

A local LLM (Qwen3.6-35B-A3B) found comparable vulnerability sets to cloud frontier models in under 90 minutes with zero human nudges.
The best practice is a Source-local pipeline: cloud for prompt engineering and consolidation, local for execution.

The Sequence AI of the Week #883: Qwen is Getting Into Robotics

2026-06-25 11:01 UTC

One of the main frontier AI models is adding embodied AI capabilities. Alibaba's Qwen-Robot Suite aims to bridge the gap between perception and action with three specialized models.

Qwen models have been confined to software with no physical interaction.
Alibaba launched Qwen-Robot Suite with three models for navigation, manipulation, and world modeling.

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

2026-06-25 04:00 UTC

Dustin is a sparse verification framework for long-context speculative decoding that combines draft model lookahead signals with target model historical attention to identify critical tokens, achieving 27.85x self-attention speedup and 9.17x end-to-end decoding speedup at 32k sequence length on Qwen2.5-72B with negligible accuracy loss.

Speculative decoding for long-context LLMs is bottlenecked by KV cache loading during verification
Existing compression methods (static eviction or dynamic selection) fail to balance efficiency and accuracy

[AINews] It's Meta-Harness Summer

2026-06-25 02:14 UTC

A comprehensive roundup of AI developments, including the rise of meta-harness architectures, OpenAI's custom inference chip Jalapeño, the shift in agent UX from tool to coworker, Qwen-AgentWorld's open world models, progress in Chinese open models like GLM-5.2, and policy and talent dynamics reshaping the competitive landscape.

Meta-harness architectures gain attention, with Omnigent offering a standardized, pluggable open-source solution.
OpenAI announces Jalapeño, its first custom AI inference chip, accelerating vertical integration.

Qwen-AgentWorld Models

2026-06-24 13:57 UTC

Introduction of Qwen-AgentWorld models.

Overview of Qwen-AgentWorld models

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

2026-06-24 07:21 UTC

UC San Diego's DFlash replaces autoregressive drafting with a lightweight block diffusion model for speculative decoding. It drafts whole token blocks in a single forward pass and conditions on target hidden features through KV injection. The paper reports up to 6.08x lossless speedup on Qwen3-8B, while NVIDIA reports up to 15x throughput on Blackwell at fixed interactivity. DFlash ships 20 checkpoints and supports SGLang, vLLM, and TensorRT-LLM.

DFlash drafts entire token blocks in a single forward pass, not one token at a time.
It injects target hidden features into every draft layer's KV cache, scaling acceptance length with depth.

Weight-Space Geometry of Offline Reasoning Training

2026-06-24 04:00 UTC

This paper investigates whether six offline RL losses (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) are mechanistically distinct in weight-space geometry when used for reasoning distillation. Using identical math rollouts from Qwen3-4B, they find SFT, RFT, and RIFT have nearly colinear deltas; DFT diverges; Offline GRPO adds orthogonal components; and DPO lies in a near-orthogonal subspace with highest accuracy but a mode-connectivity barrier.

SFT, RFT, and RIFT have cosine similarity >= 0.97 and comparable GSM8K accuracy (~87-88%).
DFT's update direction diverges more than any reward-weighted method.

We got local models to triage the OpenClaw repo for FREE!*

2026-06-22 00:00 UTC

A maintainer of OpenClaw built a system using local open-weight models (Gemma, Qwen) in an agent harness to triage issues and pull requests in real-time, achieving competitive performance with closed models while running on local hardware for minimal cost.

Local models like Gemma and Qwen can effectively classify GitHub issues and PRs for triage.
The system uses an agent harness with a read-only shell (reposhell) to safely inspect code.

VibeThinker-3B: A 3B Dense Reasoning Model Built on Qwen2.5-Coder-3B With the Spectrum-to-Signal Post-Training Pipeline

2026-06-19 22:06 UTC

VibeThinker-3B is a compact 3B-parameter reasoning model that matches large models like DeepSeek V3.2 on math and code benchmarks, using an efficient post-training pipeline and test-time scaling.

VibeThinker-3B is a 3B dense model, MIT-licensed, built on Qwen2.5-Coder-3B for verifiable reasoning.
It scores 94.3 on AIME26, comparable to DeepSeek V3.2 (671B) and Kimi K2.5 (1T).

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

2026-06-19 04:00 UTC

A study comparing Qwen 2.5 7B and XGBoost on clinical prediction reveals that LLM verbalized confidence is epistemically vacuous, an inverse difficulty effect exists, few-shot and SHAP interventions improve accuracy, and a cross-model calibrator reduces calibration error.

LLM verbalized confidence is nearly constant (0.856-0.937) regardless of accuracy, tracking prompt format.
An inverse difficulty effect: LLM accuracy drops when XGBoost is highly confident, but matches it at moderate uncertainty.

Speculation Is All You Need

2026-06-19 00:00 UTC

Modal is all-in on speculative decoding, arguing it's the single most important inference optimization, delivering 2-3x speedups. They released state-of-the-art DFlash speculators for Qwen models, achieving 5-20% extra speedups, and explain the theory, simulation, and math behind the acceleration.

Speculative decoding is the only engine optimization that matters for high-interactivity inference, delivering integral speedups.
Modal released new DFlash speculators for Qwen models, improving speed by 5-20% over strong baselines.

We Got Anthropic's Glasswing at Home (Who Needs Mythos 5 or Fable 5?)

2026-06-18 13:49 UTC

Inspired by Anthropic's Glasswing, the author built Lucent, a staged source-code bug-hunter that runs on a local 27B Qwen model on a single RTX 3090. First run against hermes-agent: 1,342 static hits narrowed to 126 leads, then to 15, and finally 2 real bugs. Local read cost ~$1.62. The best moment: a reviewer agent caught that three earlier exploits were scored against an outdated threat model. Detailed pipeline, hardware, and lessons included.

Lucent: a staged pipeline (Rank, Hunt, Verify, Exploit) running on a single RTX 3090 with local models.
Speculative decoding via Lucebox achieves ~3.4× speedup on code (130 tok/s vs 38).

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

2026-06-18 04:00 UTC

JetFlow is a head-based speculative decoding framework that combines one-forward drafting efficiency with branch-wise causal conditioning, enabling larger draft budgets to yield longer accepted prefixes and higher speedups. It achieves up to 9.64x speedup on MATH-500 and 4.58x on conversational tasks with Qwen3 models.

JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, aligning candidate tree scores with autoregressive factorization.
It resolves the causality-efficiency dilemma by combining efficient one-forward propagation with branch-wise causal conditioning.

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

2026-06-18 04:00 UTC

This paper proposes a structural pruning framework for Mixture-of-Experts models by reformulating prune-ratio allocation as a channel-score coverage maximization problem, solved efficiently via attribution-based approximation. Experiments on DeepSeek and Qwen MoE models show accuracy preservation under 50% or 25% structured pruning with 4-bit quantization, achieving 5.27× memory reduction on Qwen3-30B-A3B and outperforming baselines.

Observation: information within MoE experts is highly concentrated in a small subset of channels, leaving substantial redundancy even in important experts
Proposes a channel-level structural pruning framework that models prune-ratio allocation as a coverage maximization problem

Local Qwen isn't a worse Opus, it's a different tool

2026-06-18 03:04 UTC

The author, a founder of a small software business, shares real-world experience with local models like Qwen. He argues that while local models lag behind frontier models on benchmarks, they offer unique value in privacy, fixed costs, and vendor risk avoidance. He also candidly discusses limitations like infinite loops and hallucinations, warning against using them for unsupervised long-horizon tasks.

Local models and frontier models are different tools for different contexts.
The author demonstrates economic and privacy advantages through actual business use cases.

VL-MemKnG: Hybrid Memory with a Spatio-Temporal Knowledge Graph for Question Answering over Long Egocentric Navigation Trajectories

2026-06-17 04:00 UTC

This paper presents VL-MemKnG, a hybrid memory framework that combines a spatio-temporal knowledge graph with persistent segment-level contextual memory for question answering over long egocentric navigation videos. It improves Top-1 retrieval accuracy from 58% to 67% and Recall@1 from 34.50% to 40.55% on the WalkieKnowledgeT+ benchmark, outperforming methods including Gemini 2.5 Pro and Qwen 3.5+.

Proposes VL-MemKnG, a hybrid memory framework integrating spatio-temporal knowledge graph and segment-level contextual memory for long egocentric navigation video QA.
Introduces WalkieKnowledgeT+ benchmark with temporally distributed reasoning tasks requiring evidence aggregation across non-cooccurring moments.

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

2026-06-17 04:00 UTC

This paper describes the MLLP-VRAIN UPV system for IWSLT 2026 Simultaneous Speech Translation, using Parakeet and Qwen 3.5 models with adaptive black-box policies to improve quality-latency trade-offs. The system participates in all language directions and introduces a context track for En→De, It, Zh using ASR word-boosting and RAG. Results show a +5.82 XCOMET-XL improvement on MCIF En→De, with additional +1.03 from context processing.

MLLP-VRAIN UPV system uses Parakeet and Qwen 3.5 for cascaded SimulST.
Adaptive black-box policies with relaxations optimize quality-latency trade-offs.

Native Coding Agent Optimized for Local LLM and DeepSeek v4 with Vector Memory

2026-06-16 22:36 UTC

cwcode is a Go-based terminal coding agent leveraging DeepSeek V4 Pro, Qwen3.6-27B, and more. It offers file editing, sub-agents, semantic memory, and autonomous recovery. Key features: low cost (~$0.40/hour), high cache hit ratio (>85%), hash-anchored edits, checkpoint/rewind, and no SaaS lock-in.

Go-based terminal coding agent supporting DeepSeek V4 Pro, Qwen3.6-27B, etc.
Hash-anchored edits and sticky prefix cache reduce token usage and cost

Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation

2026-06-16 16:51 UTC

The Qwen team has released Qwen-RobotSuite, a set of three embodied AI models targeting manipulation, world modeling, and navigation. RobotManip is a Vision-Language-Action model built on Qwen3.5-4B that uses a unified alignment framework to scale manipulation data. RobotWorld is a language-conditioned video world model with a 60-layer MMDiT that predicts future video frames. RobotNav is a navigation model built on Qwen3-VL with a parameterized interface for multiple task modes. The suite achieves state-of-the-art results across several benchmarks.

Qwen-RobotSuite comprises three independent models: RobotManip, RobotWorld, and RobotNav.
RobotManip addresses heterogeneous manipulation data via a unified alignment framework, achieving SOTA on OOD benchmarks like LIBERO-Plus and RoboTwin-C2R Hard.

Show HN: HashMeterAi – Private AI Token Real Usage Meter for All Models

2026-06-15 23:33 UTC

HashMeterAi is a local-first, private usage meter for AI coding tools. It tracks token usage across multiple tools like Claude Code, Codex, Kimi, Qwen CLI, and others, providing a unified dashboard with metrics like cost, processed tokens, focus time, and AI persona. It is 100% offline and privacy-focused, never sending data. Supports several tools via local transcripts.

Private, local-first usage meter for AI coding tools with zero network access.
Tracks real token usage across multiple tools (Claude Code, Codex, Kimi, etc.) from local logs.

Qwen

Related topics

Qwen updates

PrismML Releases Bonsai 27B: 1-bit and Ternary Builds of Qwen3.6-27B That Run on Laptops and Phones

Cost of Reasoning in non-English Languages: A Case Study on Japanese

Fixed three bugs that made Qwen3.5-122B a daily driver on Mac Studio

AI Models Overthink Problems—and It’s a Security Risk

NAVER LABS System Re-implementation for the IWSLT 2026 Instruction-Following Task

Liquid AI Open-Sources Antidoom: A Final Token Preference Optimization (FTPO) Method that Reduces Doom Loops in Reasoning Models

Reinforcement Learning for Data-Efficient Code-Switched ASR

Out-of-Distribution Generalization of Risk Aversion in Language Models

Oyster-II: Reinforcement Learning for Constructive Safety Alignment in Large Language Models

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

China’s AI companion rules: what Beijing is really going after

Chinese LLMs Doubao, Qwen to shut down personalized AI agents on July 15

Modern VLMs Explained: How GPT-4o, Gemini, Claude Vision, and Qwen-VL Work

Show HN: An unmetered LLM API–$6/month, no token tracking, no limits

Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents

World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments

FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

The AI Model Accessibility Checker

Building Local AI Systems: Qwen3.6 + MCPs

Ornith-1.0: self-improving open-source models for agentic coding

Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding

Empero-AI/Qwythos-9B-Claude-Mythos-5-1M: A 1M-Context Reasoning Model Based on Qwen3.5

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference

Using Local Coding Agents

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

Refusal Lives Downstream of Persona in Chat Models

DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds

Beyond Fable: Can a Local LLM Replace Cloud AI for Security Code Reviews

The Sequence AI of the Week #883: Qwen is Getting Into Robotics

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

[AINews] It's Meta-Harness Summer

Qwen-AgentWorld Models

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

Weight-Space Geometry of Offline Reasoning Training

We got local models to triage the OpenClaw repo for FREE!*

VibeThinker-3B: A 3B Dense Reasoning Model Built on Qwen2.5-Coder-3B With the Spectrum-to-Signal Post-Training Pipeline

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

Speculation Is All You Need

We Got Anthropic's Glasswing at Home (Who Needs Mythos 5 or Fable 5?)

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

Local Qwen isn't a worse Opus, it's a different tool

VL-MemKnG: Hybrid Memory with a Spatio-Temporal Knowledge Graph for Question Answering over Long Egocentric Navigation Trajectories

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

Native Coding Agent Optimized for Local LLM and DeepSeek v4 with Vector Memory

Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation

Show HN: HashMeterAi – Private AI Token Real Usage Meter for All Models

More growth tags

AI Coding

MCP

Open Source Models

Inference Cost

Agent Frameworks

China AI

GPU Infrastructure

Model Pricing

DeepSeek