Perplexity AI open-sourced a Rust reimplementation of their Unigram tokenizer, achieving 5x lower latency than Hugging Face's tokenizers crate and reducing CPU utilization by 5-6x in production. The optimizations include double-array trie, bitmap packing, and huge pages.
Perplexity AI rewrote the Unigram tokenizer in Rust, achieving 5x lower p50 latency vs Hugging Face tokenizers crate.
Three optimizations: double-array trie, bitmap and cache-line packing, and huge pages.
Artificial Analysis and IBM launch ITBench-AA, a benchmark for agentic enterprise IT tasks focusing on Site Reliability Engineering. Frontier models score below 50%, with Claude Opus 4.7 leading at 47%. The benchmark evaluates models on Kubernetes incident response, requiring diagnosis from logs and traces.
Claude Opus 4.7 leads at 47%, with GPT-5.5 at 46% and Qwen3.7 Max at 42%.
All frontier models score below 50%, making ITBench-AA one of the least saturated agentic benchmarks.
This article details how to deploy a fully local voice conversation pipeline for the Reachy Mini robot, eliminating the need for cloud servers or API keys. It uses a cascaded approach combining VAD, STT, LLM, and TTS, with recommended defaults: llama.cpp with Gemma 4, Silero VAD, Parakeet-TDT 0.6B v3 STT, and Qwen3-TTS. Various LLM options are provided, including local MLX, Transformers, vLLM, or remote Responses API.
Reachy Mini can now run conversations fully locally without a server.
The cascaded pipeline includes VAD, STT, LLM, and TTS, with swappable components.
This article clarifies often-confused AI agent terms like 'harness' (execution layer) and 'scaffold' (behavior-defining layer), explaining model, agent, tool use, sub-agents, and training concepts.
AI Agent = Model + Harness, where harness handles model calls and tool execution.
Scaffold is the behavior-defining layer around the model: prompts, tool descriptions, etc.
NVIDIA introduces Nemotron-Labs Diffusion language models that achieve up to 6.4x faster inference than autoregressive models while maintaining high accuracy by generating tokens in parallel and refining them iteratively. The models support three modes: autoregressive, diffusion, and self-speculation. The 8B model outperforms Qwen3 8B by 1.2% accuracy.
Nemotron-Labs Diffusion models offer three generation modes: autoregressive, diffusion, and self-speculation.
The 8B model achieves 2.6x TPF in diffusion mode and up to 6.4x with self-speculation.
A 3-billion-parameter specialized model outperformed all commercial frontier APIs on quality, cost, and stability in an enterprise OCR benchmark, at roughly fifty times lower cost. The result challenges the default assumption that larger models are always better, highlighting distributional alignment as a more decisive performance variable than parameter count.
A 3B parameter specialized model scored 0.911 on a domain-specific OCR benchmark, beating Claude Opus 4.6's 0.833.
The specialized model cost ~52x less to run than the closest frontier API.
The open-source movement is bringing AI breakthroughs to robotics, lowering barriers to entry. From the ROS framework to models from Nvidia, Hugging Face, and Alibaba, robots' ability to reason, decide, and act is becoming accessible to more people. However, tensions between commercial incentives and academic ideals present new challenges.
Open-source robotics software has evolved over decades; ROS set the infrastructure, and now open-source AI models are driving the evolution of robot 'brains'.
Companies like Nvidia, Hugging Face, and Alibaba have released open-source robotic AI tools and models, significantly lowering the entry barrier.
Allen AI released OlmoEarth v1.1, reducing compute costs up to 3x while maintaining v1 performance by merging tokens across resolutions. The new model family is designed for large-scale remote sensing analysis and is already deployed globally by partners.
OlmoEarth v1.1 cuts compute costs up to 3x compared to v1 with similar performance.
Efficiency gain comes from merging tokens for different resolutions into one, reducing sequence length.
Six new Sentence Transformers CrossEncoder rerankers, state-of-the-art at their respective sizes, built on Ettin ModernBERT encoders with a distillation recipe. The smallest (17M) to largest (1B) all outperform prior models on MTEB and NanoBEIR. Full training recipe, dataset, and models are open-sourced.
Six rerankers from 17M to 1B parameters, all SOTA for their size
Trained with pointwise MSE distillation from mxbai-rerank-large-v2
This article presents a parameter-efficient fine-tuning approach using LoRA and DoRA to adapt NVIDIA Cosmos Predict 2.5 for robot video generation on a single GPU. It covers data preparation, adapter initialization, training with rectified flow loss, inference, and evaluation metrics.
LoRA and DoRA enable efficient fine-tuning of large world models by injecting small trainable adapters, reducing memory and avoiding catastrophic forgetting.
Training uses 92 robot manipulation videos with rectified flow loss and MSE loss on non-conditioned frames.
IBM Research launches the Open Agent Leaderboard, an open benchmark for comparing full agent systems (not just models). It evaluates generality across six diverse benchmarks, reporting both quality and cost. Early results show general agents are competitive with specialized ones, and agent architecture is increasingly impactful. All code, data, and paper are open-source.
The Open Agent Leaderboard measures complete agent systems, including tools, planning, memory, and error recovery, not just the underlying model.
Six benchmarks span coding, customer service, technical support, and research tasks.
IBM releases Granite Embedding Multilingual R2, two multilingual embedding models (97M and 311M parameters) built on ModernBERT, supporting 32K token context, covering 200+ languages, and achieving top scores on MTEB multilingual retrieval benchmarks. The 97M model is the best sub-100M open model, and the 311M model ranks #2 among models under 500M parameters.
97M model scores 60.3 on MTEB Multilingual Retrieval, best among sub-100M models; 311M model scores 65.2, #2 under 500M parameters.
32K token context (64x R1), 200+ languages supported, with 52 languages and 9 programming languages explicitly tuned for retrieval.
A new open-source model GLiNER2-PII with 0.3B parameters achieves state-of-the-art performance on PII detection, surpassing OpenAI's Privacy Filter on the SPY benchmark. It recognizes 42 entity types and is trained on a multilingual synthetic corpus. The model is publicly available on Hugging Face.
Open-source 0.3B parameter model for PII detection
Outperforms OpenAI Privacy Filter on SPY benchmark
Understanding modern AI architectures is harder than ever. This article introduces a simple trick: replace 'huggingface.co' with 'hfviewer.com' in any Hugging Face model URL to instantly get an interactive visualization of the model's architecture. The tool, hfviewer, supports transformers, vision, and multimodal models with zero setup. Terminal commands and a browser extension offer even faster access.
Replace huggingface.co with hfviewer.com in any Hugging Face model URL to visualize architecture.
hfviewer converts model structures into interactive graphs, supporting multiple architectures.
This article explains how to separate CPU and GPU workloads to achieve a massive performance boost for inference. Continuous batching improves GPU utilization by tightly packing batches, but synchronous operation causes idle gaps where CPU and GPU wait for each other, accounting for nearly a quarter of total runtime. By using non-default CUDA streams and events for asynchronous batching, the CPU and GPU can work in parallel, eliminating idle time and providing a free 24% speedup. The article details CUDA streams, events, and their application to continuous batching, with reference to the implementation in the transformers library.
Synchronous continuous batching wastes about 24% of time due to GPU waiting for CPU.
Asynchronous batching uses non-default CUDA streams and events to parallelize CPU and GPU.
A malicious Hugging Face repository imitating an OpenAI release delivered infostealer malware to Windows machines, amassing around 244,000 downloads before removal. Researchers warn that public AI model registries pose supply chain risks as developers clone models into corporate environments.
A fake 'Open-OSS/privacy-filter' repository on Hugging Face imitated OpenAI's Privacy Filter, containing a malicious loader.py that installed credential-stealing malware.
The repository reached the trending top with 667 likes in under 18 hours, but downloads may have been artificially inflated by attackers.
This article examines AWS's infrastructure components for foundation model pre-training, post-training, and inference, including GPU instances, Elastic Fabric Adapter (EFA), Lustre file system, and UltraCluster/UltraServer architectures, emphasizing the role of open-source software in resource management and monitoring.
Foundation model scaling has evolved from pre-training alone to three regimes: pre-training, post-training, and test-time compute.
AWS offers multiple GPU generations from H100 to B300 with NVLink and EFA networking.
Unsloth, an open-source AI optimization library, has been officially welcomed into the PyTorch Ecosystem Landscape. The announcement highlights Unsloth's contributions to model training, quantization, and community impact, including over 250M model downloads and being the 10th most-followed organization on Hugging Face. Unsloth will continue its open-source work with closer collaboration with PyTorch.
Unsloth joins PyTorch Ecosystem, recognized for technical merit and community impact.
Unsloth offers tools for faster training (2x speed, 70% less VRAM), quantized models, and Unsloth Studio UI.
MachinaCheck is a multi-agent AI system built on AMD MI300X that generates CNC manufacturability reports from STEP files in 30 seconds. It runs fully on-premise to protect intellectual property, combining geometric parsing with LLM reasoning.
Traditional manual evaluation takes 30-60 minutes per drawing; MachinaCheck does it in 30 seconds
Leverages AMD MI300X's 192GB VRAM for fully local inference, ensuring customer IP stays private
Crusoe and NVIDIA Dynamo developed fastokens, an open-source Rust BPE tokenizer that achieves a 9.1× average speedup over HuggingFace and up to 40% faster TTFT on long-context workloads.
fastokens delivers 9.1× average speedup, up to 31× on long prompts.
Optimizations include parallel pre-tokenization, two-level caching, and dynamic memory management.
OncoAgent is an open-source, privacy-preserving clinical decision support system for oncology. It features a dual-tier LLM architecture (9B speed vs 27B deep reasoning), multi-agent LangGraph topology, Corrective RAG pipeline over 70+ NCCN and ESMO guidelines, and a three-layer reflexion safety validator with Zero-PHI policy. The system routes queries via complexity scoring and was fine-tuned on AMD Instinct MI300X, achieving 56x throughput acceleration. It supports on-premises deployment to ensure data sovereignty.
Open-source, privacy-preserving oncology decision support system for on-premises deployment.
Dual-tier LLM: 9B speed-optimized and 27B deep-reasoning models, routed via additive complexity scorer.
CyberSecQwen-4B is a small specialized model fine-tuned from Qwen3-4B-Instruct for defensive cybersecurity, addressing data privacy, cost, and offline deployment needs. It matches or surpasses the 8B Cisco Foundation-Sec-Instruct model on CTI-Bench benchmarks while running on a single consumer GPU. The article details training methodology, data sources, benchmark results, and future directions.
CyberSecQwen-4B achieves +8.7 points on CTI-MCQ and retains 97.3% of CTI-RCM accuracy compared to the 8B Cisco model, with half the parameters.
Runs on a single 12 GB consumer GPU, enabling sensitive data to stay on-premises, reducing API costs, and supporting air-gapped environments.
Allen AI releases EMO, a mixture-of-experts model pretrained end-to-end so that modular structure emerges directly from data without human-defined priors. EMO allows using only 12.5% of experts per task while retaining near full-model performance, and works as a strong general-purpose model when all experts are used. Unlike standard MoE, EMO's expert subsets degrade only slightly when selectively used.
EMO is a 1B-active, 14B-total-parameter MoE with 128 experts, 8 active per token.
Document-level routing constraints encourage expert clusters to form semantic domains (e.g., health, news) rather than low-level syntactic patterns.
gnucleus-ai has released an open-source FreeCAD dataset on Hugging Face, containing 100 parametric CAD models (shafts, bearings, flanges, etc.) with key parameters, images, and FCStd files, suitable for CAD generation tasks. The dataset is licensed under Apache-2.0, includes various mechanical parts, and supports 3D, image, and text modalities.
A detailed walkthrough of LoRA fine-tuning Qwen3-1.7B on the MedMCQA dataset using AMD MI300X with ROCm. The entire pipeline runs without CUDA, training takes ~5 minutes, and the model outputs both answer and explanation.
Leverages 192GB HBM3 memory of AMD MI300X for full fp16 training without quantization.
LoRA updates only ~0.14% of parameters (2.2M), training completes in ~5 minutes.
Learn how to deploy any HuggingFace model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment on release day.
Use Goose and Together's Dedicated Container Inference to deploy models with zero lag on release day.
Author deployed Netflix's void-model with a single session and prompt.
ServiceNow AI team's migration from vLLM V0 to V1 for RL training identified four backend fixes: processed logprobs, runtime defaults, inflight weight updates, and fp32 lm_head. They prioritized backend correctness before applying objective-side corrections, achieving full parity with V0 reference.
Migration objective: verify V1 returns expected logprobs and compare with V0 baseline
This article reviews ML Intern, an open-source ML assistant that goes beyond AutoML by supporting the entire workflow from dataset research to model deployment. It demonstrates a practical project: building a text classification model for customer support tickets, covering steps like dataset selection, smoke testing, and training plan approval.
ML Intern is an open-source assistant for the Hugging Face ecosystem, aiding in the full ML workflow.
The tool was tested on a customer support ticket classification task, showing dataset research, smoke testing, and training plan creation.
The article explores the rising cost of AI evaluation, especially for agent benchmarks, showing that evaluation has become a new compute bottleneck. Static benchmarks can be compressed 100-200x, but agent and training-in-the-loop benchmarks resist compression. Reliability demands repeated runs, multiplying costs. High evaluation costs risk concentrating validation power in well-funded labs.
AI evaluation costs have crossed an affordability threshold, with a single agent evaluation potentially costing tens of thousands of dollars.
Static benchmarks can be drastically compressed, but agent benchmarks achieve only 2-3.5x compression.
DeepInfra joins Hugging Face Hub as an Inference Provider, offering cost-effective serverless inference on over 100 models, starting with conversational and text-generation tasks, accessible via UI and SDKs.
DeepInfra is now an Inference Provider on Hugging Face Hub, providing serverless inference for 100+ models.
Initial support for models like DeepSeek V4, Kimi-K2.6, GLM-5.1, with more tasks (image, video) coming soon.
NVIDIA announces Nemotron 3 Nano Omni, a new omni-modal understanding model that processes text, images, video, and audio. Built on a hybrid Mamba-Transformer-MoE backbone with C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder, it achieves top benchmarks in document understanding, ASR, video understanding, and efficiency. Designed for real-world document analysis, automatic speech recognition, long audio-video understanding, agentic computer use, and general multimodal reasoning.
Nemotron 3 Nano Omni is a unified multimodal model supporting text, image, video, and audio input.
It uses a hybrid Mamba-Transformer-MoE architecture for efficient long-context processing.
This article explains how to use Scikit-LLM's text summarization feature to handle large volumes of text in machine learning pipelines. It covers building a custom Hugging Face summarizer transformer, integrating it into a scikit-learn pipeline with TF-IDF vectorization and a classifier, and demonstrates the process with code examples.
Scikit-LLM bridges traditional ML and LLMs, offering zero-shot classification and text summarization.
A custom HuggingFaceSummarizer class inherits from BaseEstimator and TransformerMixin to load a pretrained model and produce summaries.
This article demonstrates building three scalable web applications using OpenAI's newly released open-source Privacy Filter: Document Privacy Explorer, Image Anonymizer, and SmartRedact Paste. Each app showcases different capabilities of the model and leverages gradio.Server for efficient backend processing and custom frontends.
OpenAI released Privacy Filter, an open-source PII detector supporting 128k context and eight categories.
Three example apps: Document Privacy Explorer, Image Anonymizer, SmartRedact Paste.
DeepSeek released V4 with a 1M-token context window, optimized for long-running agentic workloads. Its hybrid attention mechanism (CSA and HCA) reduces KV cache to 2% of traditional GQA. Post-training additions include interleaved thinking across tool calls, a dedicated tool-call schema (|DSML|), and DSec sandbox for RL rollouts. The model achieves competitive agent benchmarks.
DeepSeek-V4 comes in two MoE checkpoints: Pro (1.6T total, 49B active) and Flash (284B total, 13B active), both with 1M context.
QIMMA (Arabic for 'summit') is a quality-first Arabic LLM leaderboard that validates benchmarks before evaluation, revealing systematic quality issues in widely-used Arabic benchmarks. It consolidates 109 subsets from 14 benchmarks across 7 domains, applies multi-model automated assessment and human review, and ranks models with a focus on native Arabic capability. The leaderboard is the first to include code evaluation for Arabic LLMs.
QIMMA applies rigorous quality validation to Arabic benchmarks before model evaluation, uncovering significant errors and cultural biases.
The leaderboard consolidates over 52,000 samples from 14 benchmarks, spanning cultural, STEM, legal, medical, safety, poetry, and coding domains.
This article examines the role of AI in cybersecurity, particularly how the Mythos model uses system-level capabilities to find and patch vulnerabilities. It emphasizes the structural advantage of openness for defense, advocates for semi-autonomous AI agents with human oversight, and notes that open ecosystems are better positioned to counter evolving attacks than proprietary systems.
Mythos demonstrates that combining large models, system scaffolding, and speed can effectively discover and patch software vulnerabilities.
Open code and tooling distribute defense tasks across a community, avoiding single points of failure.
Ecom-RLVE extends the RLVE framework from single-turn reasoning to multi-turn, tool-augmented e-commerce conversations, providing 8 verifiable environments (product discovery, substitution, cart building, returns, order tracking, policy QA, bundle planning, multi-intent journeys) with procedural problem generation, a 12-axis difficulty curriculum, and algorithmically verifiable rewards. A Qwen 3 8B model trained with DAPO over 300 steps shows early results that environment scaling and adaptive difficulty transfer to agentic, real-world task completion.
8 verifiable environments cover real-world shopping scenarios with rewards computed programmatically, not by humans or LLMs.
Adaptive difficulty curriculum adjusts 12 independent dimensions dynamically, keeping the agent at its capability frontier.
DeepSeek releases V2.5-1210 as the final version of the V2.5 series, introducing internet search, improved benchmarks in math, coding, writing, and roleplay, and open-sourcing the model on Hugging Face. The team thanks users and hints at next-gen foundation models.
DeepSeek V2.5-1210 marks the end of the V2.5 series with a significant update.
Internet search is now live on the web interface for real-time answers.
DeepSeek officially launched DeepSeek-V2.5, merging DeepSeek-V2-0628's general conversational abilities with DeepSeek-Coder-V2-0724's robust code processing. The model shows significant improvements in writing, instruction-following, and safety alignment, and is now available via web, API, and open-source on HuggingFace.
DeepSeek-V2.5 merges the general and code models into one, offering a streamlined experience.
Outperforms predecessors on most benchmarks, especially in Chinese content creation and Q&A.