Within days of each other, Google and OpenAI separately exposed operations allegedly originating in China that use AI for fraud and covert influence campaigns. Both target US infrastructure and political debates.
Google and FBI jointly sue Chinese cybercrime network for using Gemini AI to defraud Americans.
OpenAI bans two ChatGPT clusters linked to China for manipulating US tech policy debates.
Moonshot AI has introduced Kimi Work, a local desktop AI agent for macOS and Windows that runs a swarm of up to 300 sub-agents on your machine. It drives your logged-in browser via WebBridge, reads local files, and schedules background jobs with a built-in cron engine. Based on the Kimi K2.6 MoE model (≈32B active parameters, 256K context), it targets knowledge workers by keeping data and execution local.
Kimi Work is a downloadable local desktop agent, not a cloud service, that directly accesses your files and browser sessions.
It supports up to 300 parallel sub-agents coordinated by the Kimi K2.6 model.
Pythagoras-Prover is a compute-efficient family of open-source Lean theorem provers, featuring autoregressive models (4B and 32B) and a diffusion-based prover (4B). It uses curriculum SFT with stratified data and dynamic proof filtering for training efficiency, and introduces Augmented Lean Formalisation (ALF) to expand verified corpora via self-distillation. The 4B model outperforms DeepSeek-Prover-V2-671B on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while the 32B model sets a new open-source SOTA at 93.0% and solves 93 PutnamBench problems.
Pythagoras-Prover includes autoregressive models at 4B and 32B parameters and a 4B diffusion-based prover that refines proofs iteratively.
Training efficiency is achieved via curriculum SFT with stratified difficulty levels and dynamic proof reasoning filtering within an 8k-token context.
The author developed Pakistan Notice Helper, a safety-focused AI tool for the Hugging Face Build Small Hackathon, designed to help people in Pakistan understand suspicious messages. The tool uses a small model (Qwen3.5 4B) to analyze text or screenshots, providing risk labels, explanations, and safe next steps. It supports English and Urdu, with the Urdu mode featuring a right-to-left layout and Urdu-language assessments. The article shares lessons on model selection, prompting, Urdu UX, and using Codex for rapid development.
Pakistan Notice Helper is a local AI safety tool for suspicious messages in Pakistan, supporting text and screenshots.
The final model choice was Qwen3.5 4B Q8 via llama.cpp, passing all high-risk scam and screenshot test cases.
Moonshot AI, the Chinese company behind the Kimi chatbot, is seeking a valuation of up to $30 billion in a new funding round, more than six times its valuation from late 2025.
Seedream AI Studio integrates ByteDance's Seedream image generation models (4.5/5.0/5.0 Lite/4.0) with Kling 2.1 video animation, offering a one-stop text-to-image and image-to-video creation experience. Try for free without sign-up, with multiple pricing plans suitable for e-commerce, social media, and creative professionals.
Supports Seedream 4.5/5.0/5.0 Lite/4.0 with one-click switching
Generated images can be directly animated into 5-15s videos via Kling 2.1
Large language models trained on English data often fail to express world knowledge reliably in other languages, known as cross-lingual factual inconsistency. This paper introduces PolyFact, a large-scale parallel multilingual factual QA dataset with 100K Wikidata-grounded facts across 12 languages. Comparing continual pretraining, supervised fine-tuning, and GRPO-based reinforcement learning on Qwen-2.5-7B and OLMo-2-1124-7B, GRPO consistently outperforms other methods, improving cross-lingual consistency and generalization to unseen languages. Mechanistic analyses show GRPO reduces language specialization in MLP layers and attention heads, promoting shared representations. Code, models, and dataset are released.
PolyFact dataset: 100K Wikidata facts across 12 languages for cross-lingual QA.
GRPO reinforcement learning outperforms SFT and CPT for cross-lingual factual recall.
A scathing critique of the American AI industry, comparing it to an 'OnlyFans economy' where investors and companies blindly worship overhyped and overpriced models. The author argues that Chinese open-source models like Qwen 3.7 Max offer superior value and performance, urging developers to vote with their wallets and avoid paying the 'multiplier' for US frontier models.
The author criticizes the hypocrisy and hubris of US AI companies, especially Anthropic and OpenAI.
Chinese models like Qwen 3.7 Max match or exceed US models in practical use at a fraction of the cost.
This article details the configuration and memory calibration required to run the Qwen 3.6 35B MoE model at a 450,000 token context window on a single 32GB VRAM GPU (NVIDIA RTX 5090) using llama.cpp with TurboQuant and YaRN scaling. It covers model selection, quantization trade-offs, KV cache quantization, RoPE scaling, multimodal setup, replication guide, VRAM lifecycle management, and performance evaluation.
Run Qwen3.6-35B-A3B-Q6_K on a single RTX 5090 with 450K context using llama.cpp TurboQuant fork and YaRN scaling.
Achieve 450K context by compressing KV cache to 3-bit (turbo3) and extending RoPE beyond native 262K with YaRN, but at cost of perplexity and retrieval accuracy.
This article is a field report from the second Build Small Hackathon, describing v2 of the 'Thousand Token Wood' simulation. In this version, each of the five woodland creature agents is powered by a different small language model (from OpenAI, OpenBMB, NVIDIA, and a fine-tuned Qwen). The player takes on the role of a shadow financier, able to lend, tip (truthfully or falsely), short, bribe, and broker alliances. The article details engineering challenges: serving layer heterogeneity (vLLM, CUDA toolkit), per-model quirks, a tolerant JSON parser, and a critical information asymmetry firewall to prevent secret flags from leaking into agent prompts. Persistent memory is handled via bounded summaries rather than raw history to avoid prompt inflation. Results show zero leaks, reliable fine-tuned 0.5B performance, and emergent behaviors from heterogeneous agents. Key takeaways: small models are reliable format generators but unreliable reasoners; heterogeneity adds value with manageable cost; secret information requires data-flow-level firewall; bounded memory keeps agents alive without compromising reasoning.
Each agent uses a different small model from different labs, making market behavior more realistic and emergent.
Information asymmetry is protected by a firewall design; tests prove the hidden truth flag never leaks into agent prompts.
Job Searcher is an AI-powered job search assistant for new grads. It analyzes resumes, generates LinkedIn search queries, and scores job postings across five dimensions: skills, experience, education, industry, and seniority. Built with a teacher-student model (DeepSeek V4 Pro and Qwen3-8B), it uses a curated dataset of 2,500 resumes and 10,000 job postings. Open-source and available on HuggingFace Spaces.
Automates LinkedIn job search with resume-based queries and multi-dimension scoring
Uses DeepSeek V4 Pro as teacher and Qwen3-8B as student
Unlike GPT-4o or Qwen3.5-Omni, Audio Interaction doesn't wait for a recording to end: it translates, transcribes, chats, and picks up everyday noises like coughing in a single stream. Code, model weights, and download instructions are available on GitHub under the Apache 2.0 open-source license, with the training data to follow.
The Audio Interaction model continuously listens to audio streams, making decisions every 0.4 seconds.
It can translate, transcribe, chat, and recognize everyday noises in a single stream.
OpenClaw, an open-source AI agent project, improved its security through transparency and community contributions, despite facing many false vulnerability reports. It details changes like trust model documentation, hardening, plugin architecture, and partnerships with companies like NVIDIA, Microsoft, and Tencent.
A systematic study of parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA) applied to Qwen2.5-3B for building a domain-specific conversational assistant in telecommunications customer support. The research introduces a combinatorial synthetic data generation approach and evaluates 16 LoRA configurations, revealing a divergence between quantitative validation loss and qualitative human-aligned rankings, and provides an energy-performance trade-off analysis.
Combinatorial synthetic data generation using 52 industry terms produced 30,000 training examples across 1,560 scenarios.
Evaluation of 16 LoRA configurations showed that lowest validation loss (0.5024) ranked only 6th-7th in qualitative assessment, while highest loss (0.6807) ranked first.
This paper proposes a Variance-Aware Reward Framework using Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering. The method replaces weighted binary criterion aggregation and single Likert scoring with continuous analytical reward functions, providing richer optimization signals. On the heart subset of HealthBench, the best variant improves accuracy from 0.362 to 0.502 and F1 from 0.532 to 0.668 over the Qwen3-14B base model, remaining competitive with GPT-OSS-120B.
Proposes a Variance-Aware Reward Framework with GRPO for heart-focused medical QA post-training.
Replaces binary criterion aggregation and Likert scoring with continuous analytical reward functions.
Researchers localized a neural subgraph responsible for temporal preference in a distilled LLM (Qwen3-4B-Instruct-2507), finding that models discount the future less steeply than humans and that this preference is unstable across contexts, with steering vectors capable of modulating it.
Localized temporal preference subgraph in mid-to-upper layers
An audit of the DeepSWE benchmark reveals that deepseek-v4-pro's reported results (8% solve rate, $4.22 avg cost) are invalid due to multiple issues: cost inflated ~5x by ignoring cache pricing, all three reported failures were solved with the same model, OpenRouter privacy settings silently block DeepSeek, and the model received no reasoning/effort tuning unlike competitors.
Cost inflated ~5x: benchmark bills all input tokens at cache-miss rate, ignoring 78% cache hits at 99.2% discount.
All three 'failed' tasks solved with same model deepseek-v4-pro for ~$0.86 total.
SMAC-Talk extends the StarCraft Multi-Agent Challenge with a natural language communication channel to evaluate LLM-based agents in cooperative multi-agent settings. It features decentralized control, partial observability, long-horizon decision making, and scenarios with deceptive communicators. Benchmarking using Qwen3.5 models reveals how reasoning, memory, and scale affect coordination.
SMAC-Talk introduces a natural language channel for evaluating LLM agent coordination.
Includes deceptive communicator scenarios to test trust and robustness.
Qwen 3.7 Plus is Alibaba's proprietary reasoning model released in June 2026, scoring 53 on the Artificial Analysis Intelligence Index, far above average. However, it is expensive, slow, and very verbose. The model supports text, image, and video input with a 1M-token context window.
Intelligence score of 53, well above the average of 23 for comparable models.
Priced at $0.40/M input tokens and $1.16/M output tokens, placing it in the expensive range.
DigitalOcean announced on X that it is now a model provider on OpenRouter, offering DeepSeek V3.2, Kimi K2.6, and DeepSeek V4 Flash. The move signals the company's expansion from cloud infrastructure into AI inference.
DigitalOcean announced on X that it has become a model provider on OpenRouter
Initial models include DeepSeek V3.2, Kimi K2.6, and DeepSeek V4 Flash
A study probing Qwen3-14B hidden states shows that linear probes achieving 100% accuracy in classifying reasoning types (deductive, inductive, abductive) actually detect task format confounds (source, option count, response length) rather than genuine reasoning modes. After deconfounding, accuracy drops to chance, and causal steering shows no functional link. The findings urge routine format deconfounding in mechanistic interpretability.
Linear probes on LLM hidden states can achieve 100% accuracy in distinguishing reasoning types.
This accuracy disappears after controlling for task format confounds like source identity and option count.
Dropstone 1.5 is an AI coding agent for the terminal, offering roughly 450 deep coding sessions per week for $15/month—about twice what Claude Code Pro delivers for $20. It runs on DeepSeek and Kimi models hosted in the US, with no data stored. Safety features require permission for file writes, shell commands, and network calls.
$15/month for ~450 deep coding sessions per week, 2x Claude Code Pro's usage.
Uses DeepSeek V4 Flash, V4 Pro, and Kimi K2.6 models hosted in the US, no data stored.
Titan Network aggregates unused computing power from consumers' connected devices into a decentralized cloud, offering AI firms infrastructure at up to 75% lower cost. Clients include Tencent, Alibaba, and Kling AI. The company pays 80% of revenue from data tasks to individuals who share their devices and bandwidth.
Titan Network uses crowdsourced home devices for decentralized cloud AI.
Offers up to 75% cost savings over traditional cloud providers.
Alibaba's Qwen team released Qwen3.7-Plus, a multimodal agent model available via API on Bailian (Model Studio). It understands images and video, and adds capabilities including deep reasoning, self-programming, tool invocation, verification/testing, and autonomous iteration. Its preview ranked #16 in Vision Arena, making Alibaba the #5 vision lab.
Alibaba's Qwen team launched Qwen3.7-Plus, a multimodal agent model on the Bailian platform (Model Studio).
The model understands images and video and includes five agentic features: deep reasoning, self-programming, tool invocation, verification/testing, and autonomous iteration.
Proposes SENSE, which uses target model hidden states for semantic retrieval and soft-gated evaluation to improve robustness and efficiency of retrieval-based speculative decoding, achieving up to 4.09 mean acceptance length and 3.26x speedup on LLaMA and Qwen.
SENSE anchors retrieval on hidden states of target model for semantic alignment.
Soft-gated Evaluation validates semantic equivalence instead of surface forms.
BitsMoE is an efficient quantization framework for Mixture-of-Experts (MoE) large language models. It uses SVD to decompose each MoE layer into a shared basis and expert-specific spectral factors, preserving the shared basis without quantization to maintain cross-expert structure. An integer linear programming formulation minimizes reconstruction loss under a fixed bit budget. Experiments show that BitsMoE significantly reduces accuracy degradation in ultra-low-bit regimes, achieving 12.3× quantization speedup, 27.83 percentage point average accuracy improvement, and 1.76× decoding speedup over GPTQ on Qwen3-30B-A3B-Base at 2 bits.
Proposes BitsMoE, which leverages SVD decomposition of MoE layers for fine-grained quantization.
Uses integer linear programming for activation-aware mixed-precision bit allocation to minimize reconstruction loss.
NVIDIA launched Cosmos 3 (unified multimodal world model), Nemotron 3 Ultra (efficient 550B LLM), and RTX Spark (personal AI superchip). Also covered: MiniMax M3, Qwen3.7-Plus, JetBrains Mellum2, agent ecosystems, and infrastructure updates.
NVIDIA's Cosmos 3 uses a Mixture-of-Transformers architecture to unify language, image, video, audio, and action. Nemotron 3 Ultra is a 550B open-weight LLM claiming US SOTA with fast inference. RTX Spark is a personal AI computer with Grace+Blackwell at 1 petaflop FP4.
MiniMax M3 launched as an open-weight multimodal agent model with 1M context and strong coding benchmarks. Qwen3.7-Plus from Alibaba is a hybrid agent unifying GUI/CLI. JetBrains Mellum2 is a 12B MoE for ultra-low-latency developer workflows.
Together AI optimizes MiniMax M3 serving with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway, achieving 81–125% throughput improvements across concurrency levels.
MiniMax M3 combines coding, agentic workflows, and multimodal reasoning with a 1M-token context window.
Together AI's kernel team developed KV-block-major sparse attention and integrated MSA with paged attention.
MiniMax officially released MiniMax M3 on June 1, 2026, featuring MiniMax Sparse Attention (MSA) for a 1M-token context window, native image/video input, and desktop computer operation. The API is live now.
M3 introduces MSA, achieving >9× prefill and >15× decoding speedup at 1M-token context versus M2, with 1/20th per-token compute.
Scores 59.0% on SWE-Bench Pro, surpassing GPT-5.5 and Gemini 3.1 Pro.
Chinese AI company MiniMax has released its new model M3. It's billed as the first open-weight model to combine top-tier coding performance, a one-million-token context window, and native multimodality.
MiniMax releases M3, the first open-weight model combining top coding performance, 1M-token context, and native multimodality.
The model challenges proprietary leaders in AI performance.
Chinese AI startup MiniMax released its flagship model M3, designed for coding agents and automated workflows. It processes up to 1M tokens, reduces computational costs by 20x, and outperforms OpenAI GPT-5.5 and Google Gemini on SWE-Bench Pro. The company also prepares for a Shanghai IPO and partners with Ant Group's Alipay for AI payment infrastructure.
MiniMax unveils M3 with 1M-token context and 20x cost reduction.
M3 beats OpenAI GPT-5.5 and Google Gemini 3.1 Pro on SWE-Bench Pro.
At the 2026 China AIGC Industry Summit, Wang Xiaoye, Technical Director of AWS Product Technology, pointed out that 87% of enterprises claim to have deployed AI at scale, but only 10% have gained actual value. He emphasized the huge gap between personal and enterprise-level agent deployment, and proposed that enterprises need to focus on five layers: compute, models, data & knowledge, agentic platform, and applications. He also noted that token costs are often high because too much useless information is fed to the model.
87% of enterprises have deployed AI, but only 10% see value
Personal and enterprise agent deployment are fundamentally different
This study introduces a multi-model paradigm to study synthetic deception via LoRA fine-tuning of five transformer models. Linear probes detect deception with near-perfect AUC in early layers, and logistic regression probes outperform MLP probes, supporting the Linear Representation Hypothesis. Probes generalize across domains with minimal loss. Different models exhibit distinct representational regimes: collapse in Pythia/Llama/Qwen versus high-dimensional preservation in Gemma-2. The results show that robust, domain-invariant deception representations can be rapidly entrenched through modest supervised fine-tuning, with implications for activation-based monitoring.
Linear probes on mean-pooled hidden states detect synthetic dishonesty with near-perfect AUC (≥0.99) as early as layers 1-3 in four architectures. Logistic regression consistently matches or outperforms MLP probes.
Probes trained on TruthfulQA generalize with near-zero loss (ΔAUC≈0) to held-out MMLU subjects. Late-layer representations show strong robustness to Gaussian noise.
PhyDrawGen is a neuro-symbolic pipeline that generates physically accurate diagrams from text. It uses an LLM to extract a scene graph, a deterministic solver to encode physics constraints, and a fine-tuned Qwen-VL model to iteratively correct violations. Evaluated on 1,449 problems, it outperforms GPT-5-image and Gemini models.
PhyDrawGen combines LLM, deterministic solver, and vision model for physically accurate diagram generation.
It addresses hallucinations of force vectors and conservation law violations.
At the 2026 China AIGC Industry Summit, Wang Xiaoye, Technical Director of Amazon Web Services, pointed out that 87% of enterprises claim to have deployed AI at scale, but only 10% have gained real production value. He emphasized that enterprise-grade Agent deployment must bridge four major gaps: model selection, construction complexity, usage threshold, and talent shortage. He introduced AWS's five-layer architecture—compute, model, data, harness platform, and agent applications—and products like Quick to help enterprises move from demo to production.
87% of enterprises deploy AI, but only 10% gain production value.
Enterprise-grade agents differ vastly from personal ones, requiring solutions for security, stability, and trust.
The article argues Chinese AI labs open source models not as a national strategy but as a commercial strategy to gain global attention and trust. Using DJI and Insta360 as examples, it emphasizes the importance of marketing on YouTube. Chinese labs lack international marketing capabilities, so open source is their only way into the global conversation. Future releases will include proprietary open source models and fine-tuned variants to set standards.
Chinese AI labs open source for global visibility and engagement, not due to government mandate.
They lack international marketing presence, so open source serves as PR and trust-building.
MiniMax, an AI startup focusing on multimodal models, went public on the Hong Kong Stock Exchange in January 2026. The company adheres to a dual strategy of large models + applications and ToC + ToB. Internally, it provides unlimited tokens to all employees, uses agents to automate workflows, and targets high-value tasks that humans dislike, significantly improving efficiency and flattening the organization. In the next 2-3 years, AI will deeply integrate with various industries.
MiniMax has been committed to next-generation AI since its founding, advocating 'Intelligence with Everyone' and dual driving of models/applications and ToC/ToB.
Internal practices: unlimited tokens for all, agent-assisted HR and coding, flatter organization, and 30% R&D efficiency boost.
A project demonstrates boosting Qwen3-30B inference speed from 0.09 to 14.03 tok/s on a 2017 MacBook Air by combining a human experimenter, Codex, llama.cpp, a local database, and IBM Quantum sampling. The QPU is used for candidate selection, not for running the model directly.
Runs Qwen3-30B on 2017 MacBook Air (8GB RAM, CPU-only)
Hybrid quantum-classical optimization loop achieves 14.03 tok/s from 0.09 baseline
A new review paper argues that the real bottleneck for autonomous AI agents is the software layer around the language model—tools, memory, testing, and permissions. DeepSeek is building a dedicated 'Harness' team in Beijing, confirming the formula: model + harness = AI agent.
The paper claims the bottleneck for AI agents is the software harness, not the model.
Key components include tools, memory, testing, and permission boundaries.
LightSail Technology announced a strategic partnership with Tencent Travel Services to integrate its AI full-sensing wearable device into the mobility platform. The device previously topped JD.com's bestseller list and sold out; now a new pre-sale round is open with discounts.
LightSail Technology and Tencent Travel Services partner to integrate AI wearable into travel services.
The LightSail AI wearable topped JD.com's bestseller list for 8 consecutive days and sold out.
PPIO has been named to the '2026 Global AI 100' list by FeiFan Research, recognized at the FeiFan Awards – Annual AI Globalization Summit. The list honors AI-native companies with global vision. PPIO offers a global distributed computing infrastructure, full-stack cloud services, a model platform supporting DeepSeek, GLM, MiniMax, Kimi, Qwen, and an innovative Agent Sandbox. As of April 2026, PPIO has integrated over 4,800 distributed nodes, with daily token calls exceeding 1 trillion, over 570,000 developers, and Agent Sandbox business growing more than 50x since launch. PPIO was also designated as a pilot unit for Shanghai's Digital Overseas Service Platform and a GDA Pilot Service Station.
PPIO selected for '2026 Global AI 100', highlighting its leadership in AI globalization.
Provides global distributed computing infrastructure with full GPU coverage for training and inference.
The next wave of AI creation is hitting gaming. Tencent has unveiled 'Project Craft', an AI-powered game creation platform that lets users generate playable games through natural language, supports 2D and 3D, and comes with AIGC tools and free assets to slash the barrier to game development.
Tencent launches 'Project Craft', an AI game creation platform that generates playable games from natural language prompts
Supports both 2D and 3D games, with a full AIGC pipeline and over 20,000 free assets
Tencent has released Miora, an AI-powered creative studio that integrates image, video, UI/UX, and 3D generation. It features a memory system, multi-modal canvas, and customizable Skills, aiming to enable one person to have a whole creative studio.
Tencent launches Miora, a creative AI agent studio
Supports generation of images, videos, UI/UX, and 3D content
A comprehensive evaluation of 14 open-source safety guard models on a benchmark of 79,331 samples reveals that Qwen Guard (4B parameters) achieves the highest recall (83.97%), while larger models like Llama Guard (12B) miss up to 75% of unsafe content. Model size does not correlate with safety performance, and general-purpose guard models outperform specialized ones.
Qwen Guard (4B parameters) achieves the highest recall (83.97%) among 14 open-source safety guard models.
Larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content.
This paper presents RightNow-Arabic-0.5B-Turbo, a 518M-parameter Arabic-specialized LLM built on Qwen2.5-0.5B using vocabulary injection and edge-first deployment. It achieves 35.9% mean accuracy on Arabic benchmarks, outperforming all same-class open models, and ties Falcon-H1-1.5B on COPA-ar at one-third the size. The quantized model is 398 MB and delivers 635 tokens/s on a single H100, enabling efficient edge deployment.
518M-parameter Arabic LLM built on Qwen2.5-0.5B with vocabulary injection of 27,032 Arabic tokens.
Achieves 35.9% mean accuracy on three Arabic benchmarks, surpassing all same-class open-source models.
Recent work shows RL retains prior capabilities more effectively than SFT. This paper extends to the mechanistic level, introducing differential circuit vulnerability to measure circuit degradation. On Qwen2.5-3B-Instruct for scientific QA, SFT adapts faster but causes greater circuit disruption and forgetting, while RL preserves circuits at the cost of slower adaptation. Results suggest circuit preservation explains RL's robustness against catastrophic forgetting.
SFT adapts quickly but disrupts internal circuits, leading to catastrophic forgetting.
RL preserves more of the base model's circuits, resulting in less forgetting but slower task adaptation.
At the 2026 China AIGC Industry Summit, Baidu's Miaoda product director Zhu Guangxiang shared how AI has lowered programming barriers from writing code to chatting. 87% of Miaoda users don't know code; an 8-year-old built an OS; one-person companies (OPCs) land million-dollar contracts. Vibe Coding turns demand-side into supply-side, enabling mass entrepreneurship.
Fourth programming revolution: natural language programming, massively expanding creators
87% of Miaoda users have no coding skills; OPCs are the largest user group (16% entrepreneurs)
NVIDIA researchers have introduced Polar, a rollout framework that trains language agents using reinforcement learning without modifying their agent harnesses. Polar places a model API proxy between the harness and the inference server, capturing token-level interactions and reconstructing trainer-ready trajectories. Using GRPO on a Qwen3.5-4B base model, Polar improves SWE-Bench Verified pass@1 by 22.6 points under the Codex harness, 4.8 points under Claude Code, and 6.2 points under Pi. The framework is registered as a NeMo Gym environment and released under the ProRL Agent Server repository.
Polar enables RL training on any agent harness via a model API proxy without modifying the harness code
Achieves up to 22.6 point improvement on SWE-Bench Verified using GRPO on Qwen3.5-4B across four coding harnesses
South Africa holds 88% of global platinum-group metals, hosts Africa's largest data center market, and sits at the center of a US-China AI infrastructure contest. Yet its draft AI policy, withdrawn after hallucinated references, fails to leverage these advantages for favorable terms. The article examines South Africa's structural leverage, three possible AI infrastructure futures (Chinese, US, local open-weight), and the need for binding governance provisions.
South Africa's platinum metals and renewable energy give it unique AI leverage, but the draft policy lacks minimum terms for hyperscalers, data sovereignty, or tech transfer conditions.
US and Chinese tech companies (Microsoft, Huawei) compete for AI infrastructure control in South Africa, while the policy does not specify what South Africa demands in return.
A new method called Self-Verified Distillation (SVD) enables LLMs to self-improve using only unlabeled prompts, without external feedback. The model generates candidate solutions, filters them through a three-stage verification cascade, and trains on the curated data. Experiments on Qwen3 models show significant gains across math, science, and coding benchmarks.
SVD uses cycle-consistency, factuality, and correctness checks to filter self-generated solutions.
More candidate samples and larger verification budgets yield higher-quality training data.