This article explains why large context windows are not the same as agent memory, and how retrieval, compression, and summarization techniques fit together in an agent’s cognitive stack.
Context windows are a stateless scratchpad, not persistent memory.
Retrieval-augmented generation (RAG) fetches relevant data but may introduce contradictions.
Learn how to build a text clustering pipeline using large language model embeddings and HDBSCAN to automatically discover topics in unlabeled text data. Covers embedding generation with sentence-transformers, dimensionality reduction with UMAP, and clustering with HDBSCAN.
Generate text embeddings using a pre-trained sentence-transformers model
Reduce embedding dimensionality with UMAP for clustering
This article explains how to build AI agents that can browse and interact with real websites using Playwright, browser-use, and LangGraph. It covers Playwright's advantages over Selenium (30-50% faster, persistent WebSocket, built-in auto-waiting, realistic events), setup steps, dynamic page scraping, multi-step form filling, anti-bot detection handling, session persistence, and Docker deployment. Through code examples, readers will create a working browser agent that navigates sites, fills forms, extracts structured data, and uses an LLM for decision-making.
Playwright outperforms Selenium with persistent WebSocket connections, 30-50% faster operations, and built-in auto-waiting and realistic mouse/keyboard events.
Setup requires Python 3.10+, an OpenAI API key, and a few pip installs, including Playwright browser binaries.
Learn how to build a sentiment analysis pipeline using Scikit-LLM and open-source LLMs served through the Groq API, from setup to evaluation on the IMDB dataset.
Scikit-LLM bridges classical scikit-learn pipelines with modern LLM API calls
Use Groq API to serve open-source models like Llama 3.1 8B for zero-shot classification
This article demonstrates multi-label text classification using Scikit-LLM and large language models without labeled training data. It leverages Groq's free open-source LLM for zero-shot inference, using a scikit-learn-like workflow. Steps include setup, classifier initialization, loading the go_emotions dataset, and running predictions that assign multiple sentiment labels to single texts.
Scikit-LLM enables zero-shot multi-label classification via LLMs, no training needed.
Uses Groq's free API and llama-3.3-70b-versatile model for inference.
This tutorial shows how to build multimodal AI applications — image classification, image captioning, and speech transcription — that run entirely in the browser using Transformers.js, with no server or API key, ensuring user privacy. It includes detailed code examples and project structure.
Implement image classification, image captioning, and speech transcription in the browser.
All models run client-side using Transformers.js, data never leaves the device.
AgentOps is the operational framework for autonomous AI agents in production, covering observability, evaluation, cost governance, safety, and continuous improvement. This guide explains how AgentOps differs from traditional LLM monitoring, surveys the tooling ecosystem, provides a full working code example, and shows how to debug agent failures using session replay.
AgentOps provides operational rigor for autonomous agents, ensuring explainability, measurability, and alignment with business objectives.
The five pillars of AgentOps: observability, evaluation, cost governance, safety, and continuous improvement.
Learn how to perform text classification using locally hosted open-source LLMs like Llama 3, Mistral, and Gemma via Ollama and the Scikit-LLM Python library, all without API costs.
Install Ollama and pull open-source LLMs for local use.
Configure Scikit-LLM to route requests to local Ollama endpoint.
This article benchmarks three text classification approaches: TF-IDF with logistic regression, zero-shot BART, and scikit-LLM with a Groq-hosted LLM. On a synthetic customer support dataset, scikit-LLM achieves the highest accuracy (87%) while being faster than BART, making it ideal for small datasets that require deep linguistic understanding.
TF-IDF + logistic regression is fastest but least accurate (53%)
Zero-shot BART is slow with moderate accuracy (67%)
A structured six-step LLMOps roadmap covering observability, evaluation, cost control, and agent orchestration to build production-grade LLM systems. The LLMOps market is projected to grow from $1.97 billion in 2024 to $4.9 billion by 2028 at a 42% CAGR.
LLMOps differs from traditional MLOps in prompt versioning, non-deterministic output evaluation, and cost optimization.
Foundational skills required: Python, LLM fundamentals, cloud infrastructure, and version control discipline.
This article demonstrates how to implement a context pruning pipeline for long-running AI agents to manage conversational memory efficiently using semantic similarity. It covers using sentence transformer embedding models, computing similarities, and assembling a pruned context window.
Unbounded conversation history increases token costs and degrades reasoning in long-running agents.
A context pruning pipeline keeps the current prompt, most recent turn, and top-K semantically similar past turns.
This article provides a detailed walkthrough of how logits, temperature, and top-p sampling work together to control next-token prediction in large language models. It explains the role of logits as raw scores, how temperature and top-p shape the probability distribution, and how they form a sequential pipeline. Practical advice on choosing parameter values for different use cases is also provided.
Logits are raw, unnormalized scores from the final linear layer of a transformer, converted to probabilities via softmax.
Temperature scales logits before softmax, controlling randomness: high temperature flattens distribution for creativity, low temperature sharpens it for determinism.
This article teaches how to transform a basic tool-calling script into a resilient agent that gracefully handles failures from misbehaving tools, malformed model outputs, and unavailable services. Topics include an iterative agent loop with a safety cap, four categories of tool-calling failures, and designing informative error messages for model recovery.
Learn to build an iterative agent loop with a maximum iteration cap.
Understand the four distinct failure categories when agents call tools and how to handle each.
This article explains how to implement a hybrid search strategy for RAG systems by combining BM25 lexical search with semantic search and fusing rankings using Reciprocal Rank Fusion (RRF). It provides step-by-step Python code, including dataset loading, BM25 and semantic search functions, and the hybrid search integration. Experiments on a small dataset show reasonable results, outperforming either method alone.
Hybrid search combines BM25 lexical search and semantic search to cover each other's blind spots.
Reciprocal Rank Fusion (RRF) merges rankings from both search methods.
This article walks through building a context-aware semantic search engine that combines embedding-based similarity with structured metadata filtering, covering everything from generating embeddings to persisting the index.
Generate 384-dimensional embeddings using a local pretrained model
Non-deterministic agents are those where the same input can lead to distinct outputs across multiple runs. This article discusses using statistical guardrails to monitor and evaluate their behavior, ensuring reliability and safety.
Non-deterministic agents produce different outputs from the same input.
Statistical guardrails monitor agent behavior to prevent anomalous outputs.
This article explains Agentic RAG (Retrieval-Augmented Generation) at three difficulty levels: beginner, intermediate, and advanced. It covers the basic concept, technical architecture, and cutting-edge research, helping readers understand how this approach enhances traditional RAG with autonomous decision-making.
Agentic RAG combines retrieval and generation with an agent that decides when to fetch external knowledge.
The article is structured into three levels: simple analogy, technical implementation, and advanced research.
TurboQuant has recently been launched by Google as a novel algorithmic suite and library for applying advanced quantization and compression to large language models (LLMs) and vector search engines — an indispensable element of RAG systems.
TurboQuant is a new algorithmic suite and library from Google for LLM and vector search quantization and compression.
It optimizes vector search in RAG systems, improving efficiency.
This article explores context engineering for AI agents, focusing on treating the context window as a constrained resource, separating static and dynamic context, managing conversation history, designing retrieval as a budget decision, and evaluating context quality in production.
Treat the context window like RAM: finite, cleared between sessions, and optimal usage requires deliberate budgeting.
Separate static (cacheable) context from dynamic (task-specific) context to enable prefix caching and simplify debugging.
This article explains how to use Scikit-LLM's text summarization feature to handle large volumes of text in machine learning pipelines. It covers building a custom Hugging Face summarizer transformer, integrating it into a scikit-learn pipeline with TF-IDF vectorization and a classifier, and demonstrates the process with code examples.
Scikit-LLM bridges traditional ML and LLMs, offering zero-shot classification and text summarization.
A custom HuggingFaceSummarizer class inherits from BaseEstimator and TransformerMixin to load a pretrained model and produce summaries.
This article explains how to build a fully functional AI agent that runs locally on your machine using small language models, with no internet connection or API costs. It covers the concepts of AI agents and SLMs, the advantages of local deployment, setting up Ollama and Python libraries, step-by-step agent construction, adding memory and tools, and discusses the limitations of SLMs.
AI agents are programs that use language models to reason and take actions, going beyond simple chatbots.
Small language models like Phi-3 and Mistral 7B can run on standard hardware, offering privacy and zero API costs.
This guide walks through training a Scikit-learn classifier, building a FastAPI inference server, testing it locally, and deploying it to FastAPI Cloud. It uses the breast cancer dataset and a RandomForest model.
Set up project structure and install dependencies
Train a RandomForest model on the breast cancer dataset and save it with joblib
This article explains AI agent memory across three difficulty levels: the fundamental memory problem in stateless LLM agents, the main memory types (in-context, external), and scalable architectures including writing strategies, retrieval methods, decay handling, and multi-agent consistency. It provides practical insights for building agents that improve over time.
Stateless LLM agents have no persistent memory, making multi-step tasks and personalization difficult.
In-context memory uses the context window for immediate state; external memory uses retrieval (vector search, structured queries) for persistent storage.
Zero-shot text classification allows labeling text without task-specific training data by turning labels into natural language statements and using a pretrained model to check if the text supports them. This article covers how it works, using facebook/bart-large-mnli for single and multi-label classification, and improving results with custom hypothesis templates.
Zero-shot classification reframes labeling as a reasoning task by converting labels to natural language statements.
Easily implementable via Hugging Face pipeline with pretrained models like facebook/bart-large-mnli.