Spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second. Covers the full process from launch, querying, cleanup, scaling to larger models, creating a chat UI, SSH debugging, and using as a coding agent backend, with a comparison to Inference Endpoints.
Use the 'hf jobs run' command with the vLLM Docker image and --expose 8000 to run a vLLM server on HF Jobs.
Endpoints are authenticated via Hugging Face tokens, requiring read access to the job's namespace, and support querying via curl or OpenAI Python client.
Ai2 compares its 7B transformer Olmo 3 and hybrid Olmo Hybrid, finding the hybrid excels on content words (nouns, verbs, adjectives) and tokens requiring context, but loses advantage on repeated tokens and closing brackets. Token-level loss filtering reveals architectural differences.
Hybrid models predict meaningful tokens (e.g., content words) better, but not repeated tokens.
Hybrids replace some attention layers with recurrent layers, which have fixed-size memory suited for tracking sequential state.
NVIDIA NeMo AutoModel builds on HuggingFace Transformers v5, adding Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels to achieve 3.4-3.7x higher training throughput and 29-32% less GPU memory for fine-tuning MoE models, with no API changes.
NeMo AutoModel subclasses AutoModelForCausalLM, requiring only one import line change for performance gains.
On a 550B model, Expert Parallelism enables full fine-tuning across 16 nodes of H100s, where Transformers v5 runs out of memory.
CUGA is IBM's open-source agent harness that handles the plumbing of building agentic apps, leaving developers to write only a tool list and a prompt. This article walks through one example — an IBM Cloud advisor app — and explains how CUGA's planning, reflection, and policy system enable robust, production-ready agents.
CUGA abstracts away orchestration, state, and tool calls, letting developers focus on tools and prompts.
The cuga-apps repository contains two dozen single-file apps, each a working example that can be read and copied.
This article explores the Cross-Origin Storage (COS) API proposal, which enables web apps to share large files (like AI models and Wasm runtimes) across origins using cryptographic hashes instead of URLs. Using Transformers.js as an example, it highlights the redundancy caused by current cache partitioning and how COS addresses it with hash-based identification, flexible access control, and integrity verification.
Current browser caches are partitioned by origin, leading to redundant downloads of shared AI resources across different apps.
The Cross-Origin Storage (COS) API identifies files by cryptographic hash, enabling cross-origin sharing.
Hugging Face revamped the release process for huggingface_hub, using AI and open tools to ship weekly releases instead of monthly, while keeping a human in the loop for final review. The new pipeline costs about $0.25 per release and has improved release note quality and discovery of integration issues.
Release cadence improved from 4-6 weeks to weekly
AI drafts release notes, but deterministic verification ensures accuracy
PP-OCRv6 is PaddleOCR's latest universal OCR model family, scaling from 1.5M to 34.5M parameters across three tiers, supporting 50 languages. It delivers a +4.6 percentage point improvement in text detection Hmean and +5.1 in recognition accuracy over PP-OCRv5_server. New architecture includes PPLCNetV4 backbone, RepLKFPN for detection, and EncoderWithLightSVTR for recognition. Supports multiple inference backends: Paddle Inference, Transformers, and ONNX Runtime.
Three model tiers: tiny (1.5M), small (7.7M), medium (34.5M) for various deployment scenarios.
Supports 50 languages including Chinese, English, Japanese, and 46 Latin-script languages.
A maintainer of OpenClaw built a system using local open-weight models (Gemma, Qwen) in an agent harness to triage issues and pull requests in real-time, achieving competitive performance with closed models while running on local hardware for minimal cost.
Local models like Gemma and Qwen can effectively classify GitHub issues and PRs for triage.
The system uses an agent harness with a read-only shell (reposhell) to safely inspect code.
Deep research agents that combine private documents with web search can inadvertently leak sensitive information through their query logs. The MosaicLeaks benchmark quantifies this privacy risk and proposes a training method called Privacy-Aware Deep Research (PA-DR) that reduces information leakage by over 3x while maintaining task performance.
MosaicLeaks introduces a benchmark of multi-hop research chains that interleave private local documents and public web queries, measuring three levels of leakage: intent, answer, and full-information.
Standard training for task performance increases both success rate and leakage; training with PA-DR reduces answer/full-information leakage from 34.0% to 9.9% while keeping strict chain success at 58.7%.
LoRA is the most popular parameter-efficient fine-tuning (PEFT) technique, but research shows other methods can outperform it on certain tasks. This article introduces Hugging Face's PEFT library and its benchmarks, discussing how to choose the right PEFT technique based on specific needs, and points out that LoRA is not always the best choice.
LoRA dominates PEFT techniques but may not be optimal.
Hugging Face's PEFT library provides a unified API and benchmarks to help users choose.
A new benchmark harness evaluates the entire process of AI agents using software libraries, using Hugging Face Transformers as a case study. By measuring token usage, time, and error rates across different models and tooling tiers, the authors uncover tradeoffs between ease of use and resource consumption, providing insights for library maintainers and agent users.
Standard benchmarks only check final answers; this harness measures the entire process including token cost and errors.
Three tiers tested: bare install, cloned source, and packaged Skill – each with different overhead.
MolmoMotion is a new 3D motion forecasting model that predicts future 3D point trajectories of objects given a video frame, 3D points on an object, and a language instruction. It outperforms existing methods in robotics planning and controllable video generation. The model is accompanied by the MolmoMotion-1M dataset and PointMotionBench benchmark.
MolmoMotion uses language instructions to guide 3D motion forecasting, outperforming existing methods.
It offers autoregressive and flow-matching variants for deterministic and uncertain scenarios.
AWS's open-source SDK Strands Robots integrates LeRobot, enabling developers to train from Hub datasets and deploy policies on simulated or real robots through a single Agent workflow. This post walks through five steps with a runnable example on a laptop.
Strands Robots SDK exposes LeRobot as composable AgentTools, enabling end-to-end control from dataset to robot hardware.
Simulation and hardware share the same DatasetRecorder and LeRobotDataset format for seamless compatibility.
Z.AI introduces GLM-5.2, a flagship model for long-horizon tasks with a solid 1M-token context, advanced coding capabilities with flexible effort levels, and an open-source MIT license. It achieves top-tier performance on long-horizon coding benchmarks, rivaling closed-source models.
GLM-5.2 delivers a stable 1M-token context for long-horizon engineering tasks.
It leads open-source models on benchmarks like FrontierSWE and PostTrainBench, close to Opus 4.8.
The Agentic Resource Discovery (ARD) specification provides a discovery layer for AI agents to find tools, skills, and other agents dynamically, rather than relying on pre-installed configurations. Hugging Face has implemented a reference tool on the Hub.
ARD defines a standard for cataloging and searching agent capabilities across federated registries.
Hugging Face's Discover Tool implements ARD, enabling natural language search for Skills, MCP servers, and AI applications.
olmo-eval is a new evaluation workbench designed to support the iterative evaluation cycle during LLM development. Built on the OLMES standard, it offers flexible task definitions, swappable runtime policies, and detailed per-question comparison to help developers determine whether interventions are significant.
Designed for the repeated evaluation loop in model development, supporting quick benchmark addition, cross-checkpoint runs, and fine-grained results analysis.
Offers both lightweight and sandboxed run modes, automatically selecting based on benchmark needs, unlike tools like Harbor.
This article is the second part of the PyTorch profiling series, delving into the internals of nn.Linear layers, including transpose operations, bias-fused epilogue techniques, and the impact of torch.compile on a single linear layer. It then dissects the performance characteristics of a Multilayer Perceptron (MLP) with GeGLU activation, showcasing the scheduling and execution of GPU kernels.
nn.Linear fuses bias addition into the matrix multiplication kernel via an epilogue, avoiding extra memory accesses.
torch.compile offers no significant speedup for a single nn.Linear layer but eliminates CPU dispatch overhead.
Cohere has released North Mini Code, a 30B-parameter Mixture-of-Experts model with 3B active parameters, designed for agentic software engineering tasks. It achieves competitive performance on coding benchmarks and is available under Apache 2.0 on Hugging Face.
30B MoE model with 3B active parameters, optimized for agentic coding.
Outperforms comparable open-source models on Artificial Analysis Coding Index.
An agent built a 3D website showcasing Paris monuments by chaining two Hugging Face Spaces (image generation and 3D Gaussian splat reconstruction) via their agents.md files, with no manual integration. The article highlights the 'building block economy' for multimedia AI, where models become composable components that agents can glue together, dramatically lowering the integration barrier.
A coding agent automatically generated images and 3D Gaussian splats by calling two Hugging Face Spaces, producing a 3D gallery of Paris monuments.
Each Gradio Space's agents.md file provides a complete API specification, enabling agents to use Spaces without manual integration.
NeuroBait is a fine-tuned AI model designed to help ADHD brains by providing dopamine sparks to overcome task initiation paralysis. Created from real observation of the author's wife, it uses warm, flowing prose to offer one tiny actionable step instead of overwhelming to-do lists. Built with LoRA on Gemma 3 12B and deployed on Hugging Face, it aims to help anyone feeling stuck, not just those with ADHD.
NeuroBait generates warm, flowing text to give a tiny actionable step, helping ADHD brains start tasks. It focuses on emotional barriers, not to-do lists.
Fine-tuned with LoRA on Gemma 3 12B using a small curated synthetic dataset derived from real ADHD friction.
This article explains how to migrate GitHub Actions CI to Hugging Face Jobs to overcome limitations of GitHub-hosted runners, such as slow speed and lack of GPU access. By setting up a dispatcher Space, a GitHub App, and modifying the runs-on label, CI jobs can run on Hugging Face infrastructure with CPU or GPU hardware, streaming logs in real-time. Trackio's experience shows a ~30% reduction in CPU job time.
GitHub Actions default runners are generic, slow, and lack GPU support.
Hugging Face Jobs provides serverless infrastructure with flexible hardware (CPU, T4, H200).
The author recounts how a bank run scenario that reliably crashed under a single model ceased to work when replaced with five heterogeneous small models from different labs. After multiple failed attempts to induce a crash via external shocks, the solution was to author a deterministic override at the settlement seam, making the crash a guaranteed outcome rather than an emergent hope.
A single-model economy produced a bank run crash; a heterogeneous council of five models hoarded instead of selling.
External shocks (rumor, inventory glut) failed to force a sell-off in the multi-model system.
The author developed Pakistan Notice Helper, a safety-focused AI tool for the Hugging Face Build Small Hackathon, designed to help people in Pakistan understand suspicious messages. The tool uses a small model (Qwen3.5 4B) to analyze text or screenshots, providing risk labels, explanations, and safe next steps. It supports English and Urdu, with the Urdu mode featuring a right-to-left layout and Urdu-language assessments. The article shares lessons on model selection, prompting, Urdu UX, and using Codex for rapid development.
Pakistan Notice Helper is a local AI safety tool for suspicious messages in Pakistan, supporting text and screenshots.
The final model choice was Qwen3.5 4B Q8 via llama.cpp, passing all high-risk scam and screenshot test cases.
OpenEnv is a tool for creating an agentic execution environment like terminals, browsers, or anything an agent can interact with. Today, we’re excited to announce that OpenEnv is becoming even more open, to make the future of training agents open source. Starting today, OpenEnv will be coordinated by a committee that so far includes Meta-PyTorch, Reflection, Unsloth, Modal, Prime Intellect, Nvidia, Mercor, Fleet AI, and Hugging Face. OpenEnv now lives at huggingface/OpenEnv. The project focuses on being an interoperability layer for RL environments, not a reward framework or trainer.
OpenEnv is an open-source tool for creating agentic execution environments.
It is now governed by a committee of major AI organizations including Meta-PyTorch, Reflection, Unsloth, etc.
In this article, the author describes the inspiration behind Mythograph Atelier, an AI art studio that creates personalized abstract paintings. The idea combines a museum visit's impact, the vision of dynamic AI-native apps, and the concept of a curious AI that asks questions to understand the user before generating art.
Mythograph Atelier is an AI art studio that creates abstract paintings with personal meaning.
The AI asks questions to understand the user's taste and emotions before generating art.
Her is a tool that analyzes Claude Code session traces, reconstructing events in plain English, flagging risky moves (deploys, config changes, secrets), and showing token usage. It runs entirely on the local GPU, no third-party AI API is called, and includes an 'Ask Her' assistant to answer questions from the trace.
Her reads Claude Code .jsonl session files, summarizing events and highlighting risks.
All processing is done locally on GPU, no third-party API calls ensuring privacy.
This article is a field report from the second Build Small Hackathon, describing v2 of the 'Thousand Token Wood' simulation. In this version, each of the five woodland creature agents is powered by a different small language model (from OpenAI, OpenBMB, NVIDIA, and a fine-tuned Qwen). The player takes on the role of a shadow financier, able to lend, tip (truthfully or falsely), short, bribe, and broker alliances. The article details engineering challenges: serving layer heterogeneity (vLLM, CUDA toolkit), per-model quirks, a tolerant JSON parser, and a critical information asymmetry firewall to prevent secret flags from leaking into agent prompts. Persistent memory is handled via bounded summaries rather than raw history to avoid prompt inflation. Results show zero leaks, reliable fine-tuned 0.5B performance, and emergent behaviors from heterogeneous agents. Key takeaways: small models are reliable format generators but unreliable reasoners; heterogeneity adds value with manageable cost; secret information requires data-flow-level firewall; bounded memory keeps agents alive without compromising reasoning.
Each agent uses a different small model from different labs, making market behavior more realistic and emergent.
Information asymmetry is protected by a firewall design; tests prove the hidden truth flag never leaks into agent prompts.
Job Searcher is an AI-powered job search assistant for new grads. It analyzes resumes, generates LinkedIn search queries, and scores job postings across five dimensions: skills, experience, education, industry, and seniority. Built with a teacher-student model (DeepSeek V4 Pro and Qwen3-8B), it uses a curated dataset of 2,500 resumes and 10,000 job postings. Open-source and available on HuggingFace Spaces.
Automates LinkedIn job search with resume-based queries and multi-dimension scoring
Uses DeepSeek V4 Pro as teacher and Qwen3-8B as student
Persona Atlas turns public figures into measurable behavioral portraits by researching them online, answering open-ended questions, and embedding answers for comparison. It focuses on thinking style rather than factual knowledge, using small models to capture personality as geometry.
Enter a name, and an AI agent researches the person from the open web.
Answers to ten open-ended prompts are embedded, enabling quantitative comparison.