Google Deepmind's Gemma 4 12B is an open-source model that processes text, images, and audio natively and runs on laptops with just 16 GB of RAM. It nearly matches the twice-as-large 26B model in benchmarks and ships under an Apache 2.0 license for commercial use.
Google DeepMind has released Gemma 4 12B, a 12-billion-parameter dense multimodal model that eliminates traditional encoders, feeding vision and audio directly into the LLM backbone. It runs locally on consumer laptops with 16 GB RAM, under the Apache 2.0 license. The model natively handles text, images, audio, and video, making it the first mid-sized Gemma with native audio input.
Encoder-free design: removes separate 550M vision and 300M audio encoders, using a lightweight 35M vision embedder and direct audio wave projection.
Achieves near-26B MoE performance with under half the memory footprint, running on 16 GB devices.
Ideogram releases version 4.0 of its text-to-image model as an open-weight model with native 2K resolution, bounding box control, and improved text rendering. On the DesignArena leaderboard, it ranks first among all open models; only closed systems from OpenAI and Google score higher. Commercial use requires a paid license.
For the first time, Google is giving website operators an opt-out toggle in Search Console for AI search features like AI Overviews and AI Mode, which together already reach more than 3.5 billion monthly users. New performance reports break out impressions separately. The move was prompted by the UK's Competition and Markets Authority (CMA), which sees website operators at a severe disadvantage.
Google adds opt-out toggle in Search Console for AI Overviews and AI Mode.
These AI features reach over 3.5 billion monthly users.
Jess Asato was portrayed by Grok wearing a bikini after she criticised creation of such non-consensual pictures. A Labour MP has taken legal action against Elon Musk’s AI company after saying its Grok tool helped a user produce fake sexualised pictures of her.
Labour MP Jess Asato sues Musk's AI company over fake sexualised images generated by Grok.
Asato had previously criticised non-consensual image creation.
The article introduces the 'AI Bowtie' framework for managers to decide when to leverage AI in team workflows, avoiding the extremes of overuse or complete avoidance. It outlines five phases: research, synthesize, think (no AI), plan, and execute.
Use AI heavily in the exploration phase to generate diverse options.
Completely disable AI during the thinking phase to ensure human-led core decisions.
This article argues that artificial intelligence, especially large language models, should be understood as a form of computation rather than as artificial persons. It examines the role of loops, compositionality, and the agentic harness in enabling computation, and introduces the concept of "Verplankalkül" as an informal programming language.
LLMs perform computation through informal language rules, not just function approximation.
The power of computation comes from unbounded loops, which in AI are provided by the agentic harness.
When the conversation turns to AI infrastructure, it almost always lands on GPUs and TPUs. The New Stack sat down with Bhumik Patel of Arm and Mo Farhat of Google to talk about the chip that rarely makes the headlines anymore: the CPU, and why it’s getting more important, not less, as AI shifts from chatbots to agents.
CPUs orchestrate tool calls and memory management for AI agents.
Google's gVisor enables secure sandboxing with up to 300 sandboxes per second per cluster.
Research suggests people are working harder and less smartly with AI, but there are ways to turn emerging tech into a valuable tool. Experts advise limiting toolset, adhering to guidelines, and refining outputs to avoid cognitive fatigue.
AI can lead to increased workload and cognitive fatigue.
Focus on a limited set of AI tools that directly add value.
This article criticizes AI productivity tools like Google's Gemini Spark, arguing they solve problems created by tech companies while ignoring systemic economic issues like wage stagnation and job insecurity. The author contends that AI-driven productivity has not benefited workers and may exacerbate inequality without proper safety nets.
Google's Gemini Spark AI agent accesses personal information, raising privacy concerns.
AI productivity tools address problems manufactured by tech companies that blurred work-life boundaries.
This article applies lean manufacturing principles to AI inference, identifying seven wastes in LLM inference and proposing core principles like just-in-time context, standardized work, takt time, and prompt caching. A repo analysis agent case study shows a 13x cost reduction and 3.3x latency improvement.
Overuse of frontier models, RAG bloat, sequential blocking, and output defects are common inference wastes in AI engineering.
Lean inference principles include just-in-time context, standardized work, takt time budgeting, and prompt caching.
A practical framework for moving from simple SaaS to an AI-native platform, outlining five levels of AI integration: from MCP server and personal access tokens to embedded chat, conversation history, custom UI generation, and finally an agentic harness with planning and scheduling. The author shares insights from building multiple internal agents and retrofitting AI into existing flows.
Level 1: Expose API endpoints via MCP server without UI changes. Build prompt library and evals.
Level 2: Embed AI chat window in the dashboard with streaming and page context.
This article explores building custom agent harnesses using LangChain's create_agent and middleware. A harness is the scaffolding connecting a model to the real world; customizing it is key to agent usefulness. Middleware hooks into the agent loop at each step, enabling deterministic logic, tool lifecycle management, custom state, and stream handling. Task-harness fit determines effectiveness.
Agent = model + harness; harness determines usability.
create_agent provides the core loop; middleware enables customization at every step.
Microsoft's Copilot Health preview allows users to share medical records for personalized AI health advice. The author tested it and found mixed results due to technical glitches, while noting privacy protections and cautioning against relying on AI for medical decisions.
Microsoft Copilot Health uses personal medical records to tailor health advice.
Privacy measures include encryption, no training data use, and physician oversight.
Meta's internal TBD team, led by Wang, pushes for proprietary models and a startup culture amid company layoffs and employee protests over tracking software. Its Muse Spark model excels in visual understanding but lags in coding, with future focus on coding, agentic tasks, and video generation.
Wang advocates shifting Meta's focus to proprietary models over open-source
TBD fosters a non-hierarchical startup culture with boba tea happy hours
GitLab has laid off about 14% of its workforce, roughly 350 employees, as part of a restructuring plan announced last month. The company is exiting 22 countries, flattening management layers, and investing in infrastructure to handle increased traffic from AI workflows, with a sharper focus on R&D.
GitLab cuts approximately 14% of staff, about 350 employees.
Restructuring includes exiting 22 countries and flattening management.
Harmonic rebuilt their AI Scout using Deep Agents and LangSmith, achieving a 4x increase in user retention and transforming the tool from a rigid search interface to a trusted advisor that handles complex investment queries.
Scout V1 was a rigid LangGraph pipeline requiring extensive evals; V2 uses a single frontier model with two tool categories, simplifying architecture.
The new UX allows users to interact naturally, generating visualizations and search results that the agent can reference, creating a shared source of truth.
An experiment pitting 11 LLMs in a 2D battle royale reveals that Grok 4.1 Fast dominates with the lowest cost per win, while Claude Sonnet 4.6 suffers from excessive cooperation. The findings highlight the impact of alignment tax on performance and the inadequacy of traditional benchmarks in predicting real-world task success.
Grok 4.1 Fast won 13 of 30 games at a cost of $0.97 per win.
Claude Sonnet 4.6 won 5 games but cost $26.78 per win, 27x more than Grok.
Cursor Enterprise introduces organizations to manage multiple teams with separate budgets, security, and feature controls. Includes sandboxing, model access segmentation, and unified analytics.
Organizations allow managing multiple Cursor teams from one dashboard.
Features include sandboxing, segmented access, and unified analytics.
DeepLearning.AI and Red Hat offer a free intermediate course on efficient LLM inference with vLLM, taught by Cedric Clyburn. The course covers quantization, serving with vLLM, and benchmarking, with 9 video lessons, 3 code examples, and a quiz.
Learn to apply quantization to reduce model memory footprint and measure accuracy tradeoffs
Serve models with vLLM using continuous batching, PagedAttention, and prefix caching
Your AI agent works great in testing. Then you ship it, and something kinda breaks. A tool called loops forever, like it never learns. A retrieval step returns garbage and costs spike. You have no idea why, at all. That’s the agent observability problem. And if you’re building with LLMs, you need to solve it before production, not after. This post compares three top observability tools: LangSmith, Langfuse, and Arize. We set up each one, trace the same agent, and compare what you actually get.
Agent observability captures the full execution graph: every step, decision, LLM input/output, tool calls, token usage, latency, and evaluation scores.
LangSmith integrates natively with LangChain, providing deep tracing and a prompt playground for debugging.
Trilogy's AI Center of Excellence evaluated Fireworks AI as inference infrastructure to standardize open-weight model usage, reducing costs and enabling billion-token-scale agentic workflows.
Trilogy adopted Fireworks AI as the inference layer for enterprise open-weight models.
Reduced cost to ~1/5 of proprietary systems and eliminated rate limit issues.
The White House has issued an executive order requiring agencies like the Pentagon and CISA to strengthen cyber defense with AI tools within 30 days. AI developers can voluntarily submit models for security testing, but the order explicitly rules out mandatory approval. Given recent government pressure on AI companies, how voluntary this cooperation really is remains an open question.
Executive order mandates Pentagon and CISA to boost cybersecurity with AI in 30 days.
AI developers can voluntarily submit models for testing; no mandatory approval.
The UK's competition watchdog has ordered Google to change how it uses publishers' content in AI-powered search results, giving news websites the power to block their content from being used in AI summaries, with global ramifications.
CMA uses new powers to set bespoke rules for tech firms with 'strategic market status'.
New rules require Google to allow publishers to opt out of AI summaries.
Readers respond to Nesrine Malik's article on AI, highlighting the deeper problem of AI's relationship with evidence, including fabricated quotations and misplaced trust in machine-generated content.
AI's core issue is not bland prose but its inability to distinguish fact from fiction.
Writers caught using false quotes often trusted AI as a research aid without deceitful intent.
Impermeabiliza, a Valencia-based waterproofing specialist, integrates artificial intelligence to improve diagnosis, planning, and execution, ensuring durability of residential, industrial, and commercial structures while preventing leaks, moisture, and mold.
Impermeabiliza offers waterproofing solutions in Valencia and surroundings.
They use advanced systems to prevent leaks, moisture, and mold.
The European Commission aims to reduce dependence on foreign suppliers in cloud computing, AI and semiconductor production, ensuring no foreign government or company can use a 'kill switch' to disrupt essential tech services in Europe.
EU targets 'kill switch' risk from foreign tech providers
Proposals focus on cloud, AI, and semiconductor dependencies
AWS Deep Learning AMI and Deep Learning Containers now support SOCI snapshotter and index. SOCI enables efficient container image management through selective file downloading (lazy loading) and parallel pull modes, significantly reducing container startup times. This post explains how SOCI works, when to use each mode, and provides performance benchmarks.
SOCI (Seekable OCI) uses layer-based indexing to enable lazy loading, allowing containers to start with only necessary files, reducing cold start from 6m59s to 21s.
AWS DLAMI and DLC offer three pull mechanisms: standard Docker pull, SOCI parallel pull, and SOCI lazy loading, with trade-offs between speed and resource usage.