AI News HubLIVE
In-site rewrite6 min read

RAG-Anything Tutorial: Build a Multimodal Retrieval Pipeline for Text, Tables, Equations, and Images in Colab

This tutorial walks through building a multimodal retrieval pipeline using RAG-Anything in Google Colab. It covers environment setup, securely entering an OpenAI API key, generating a synthetic multimodal report with a chart and PDF, converting content into RAG-Anything's content_list format, inserting it into the retrieval system, and testing different retrieval modes (naive, local, global, hybrid).

SourceMarkTechPostAuthor: Sana Hassan

In this tutorial, we build a RAG-Anything workflow and use it to explore how multimodal retrieval works across text, tables, equations, and images. We start by preparing the Colab environment, installing the required packages, and securely entering our OpenAI API key at runtime to keep the notebook practical and safe to run. We then create a synthetic multimodal report, generate a chart and PDF, convert the content into RAG-Anything’s direct content_list format, and insert it into the retrieval system. As we move through the tutorial, we configure clean OpenAI-based chat, vision, and embedding functions, initialize RAG-Anything, and test different retrieval modes such as naive, local, global, and hybrid.

Installing RAG-Anything Dependencies

Copy CodeCopiedUse a different Browser

import os import re import sys import json import time import shutil import hashlib import asyncio import inspect import getpass import subprocess import importlib import importlib.metadata from pathlib import Path from typing import List, Dict, Any def run_shell(cmd, check=True): print(f"\n$ {cmd}") result = subprocess.run(cmd, shell=True, text=True) if check and result.returncode != 0: raise RuntimeError(f"Command failed: {cmd}") return result.returncode print("=" * 80) print("RAG-Anything Advanced Colab Tutorial") print("=" * 80) print("\n[1/10] Installing dependencies...") for module_name in list(sys.modules): if module_name == "PIL" or module_name.startswith("PIL."): del sys.modules[module_name] run_shell( 'pip -q install -U ' '"raganything[image,text]" ' '"openai>=1.0.0" ' '"python-dotenv" ' '"reportlab" ' '"pandas" ' '"matplotlib" ' '"tabulate"' ) run_shell('pip -q install --no-cache-dir --force-reinstall "pillow==11.3.0"') for module_name in list(sys.modules): if module_name == "PIL" or module_name.startswith("PIL."): del sys.modules[module_name] importlib.invalidate_caches() try: print("Pillow version:", importlib.metadata.version("Pillow")) except Exception as e: print("Could not read Pillow version:", repr(e)) print("\n[2/10] Importing libraries...") import numpy as np import pandas as pd import matplotlib.pyplot as plt from IPython.display import display from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas from reportlab.lib.units import inch from openai import AsyncOpenAI from raganything import RAGAnything, RAGAnythingConfig from lightrag.utils import EmbeddingFunc print("Imports successful.")

We begin by setting up the complete Colab environment for the RAG-Anything workflow. We install the required libraries, repair the Pillow dependency, and import all the modules needed for plotting, PDF creation, OpenAI access, and RAG-Anything. We also define a reusable shell helper so the setup remains clear and easy to rerun.

Configuring Directories, Runtime Variables

Copy CodeCopiedUse a different Browser

print("\n[3/10] Preparing directories and runtime settings...") BASE_DIR = Path("/content/raganything_advanced_tutorial") if Path("/content").exists() else Path.cwd() / "raganything_advanced_tutorial" ASSET_DIR = BASE_DIR / "assets" OUTPUT_DIR = BASE_DIR / "output" WORKING_DIR = BASE_DIR / "rag_storage" LOG_DIR = BASE_DIR / "logs" RESET_STORAGE = True RUN_FULL_DOCUMENT_PARSE = False PARSER_FOR_FULL_PARSE = "mineru" PARSE_METHOD = "auto" for d in [BASE_DIR, ASSET_DIR, OUTPUT_DIR, WORKING_DIR, LOG_DIR]: d.mkdir(parents=True, exist_ok=True) if RESET_STORAGE and WORKING_DIR.exists(): shutil.rmtree(WORKING_DIR) WORKING_DIR.mkdir(parents=True, exist_ok=True) os.environ["LOG_DIR"] = str(LOG_DIR) os.environ["SUMMARY_LANGUAGE"] = "English" os.environ["ENABLE_LLM_CACHE"] = "false" os.environ["ENABLE_LLM_CACHE_FOR_EXTRACT"] = "false" os.environ["MAX_ASYNC"] = "2" os.environ["CHUNK_SIZE"] = "900" os.environ["CHUNK_OVERLAP_SIZE"] = "120" os.environ["TIMEOUT"] = "240" for var in [ "OPENAI_API_KEY", "OPENAI_ORG_ID", "OPENAI_ORGANIZATION", "OPENAI_PROJECT", "OPENAI_DEFAULT_HEADERS", "LLM_BINDING_API_KEY", "LLM_BINDING_HOST", ]: os.environ.pop(var, None) print(f"Base directory: {BASE_DIR}") print(f"Assets directory: {ASSET_DIR}") print(f"Storage directory: {WORKING_DIR}") print("\n[4/10] Entering OpenAI API key securely...") def clean_api_key(raw_value: str) -> str: raw_value = str(raw_value or "").strip() raw_value = raw_value.replace("Bearer ", "").replace("bearer ", "").strip() raw_value = raw_value.strip("'").strip('"').strip("`").strip() if "=" in raw_value: raw_value = raw_value.split("=", 1)[1].strip().strip("'").strip('"').strip("`") raw_value = re.sub(r"\s+", "", raw_value) raw_value = raw_value.encode("ascii", errors="ignore").decode("ascii").strip() return raw_value OPENAI_API_KEY_RAW = getpass.getpass("Paste your OpenAI API key here. Input is hidden: ") OPENAI_API_KEY = clean_api_key(OPENAI_API_KEY_RAW) if not OPENAI_API_KEY: raise ValueError( "No API key was captured. Paste the key into the hidden input box and press Enter." ) print("Captured key length:", len(OPENAI_API_KEY)) print("Captured key prefix:", OPENAI_API_KEY[:12] + "...") print("Captured key suffix:", "..." + OPENAI_API_KEY[-6:]) LLM_MODEL = "gpt-4o-mini" VISION_MODEL = "gpt-4o-mini" EMBEDDING_MODEL = "text-embedding-3-small" EMBEDDING_DIM = 1536 openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY) os.environ["LLM_MODEL"] = LLM_MODEL os.environ["VISION_MODEL"] = VISION_MODEL os.environ["EMBEDDING_MODEL"] = EMBEDDING_MODEL os.environ["EMBEDDING_DIM"] = str(EMBEDDING_DIM) print("Testing OpenAI chat API with the captured key...") try: test_response = await openai_client.chat.completions.create( model=LLM_MODEL, messages=[{"role": "user", "content": "Reply with exactly: ok"}], temperature=0, ) print("Chat API test response:", test_response.choices[0].message.content) except Exception as e: raise RuntimeError( "The key was captured, but OpenAI rejected the request or the account/model access failed. " "Check billing, project permissions, and make sure this is an OpenAI Platform API key." ) from e print("\nTesting OpenAI embedding API...") try: test_embedding = await openai_client.embeddings.create( model=EMBEDDING_MODEL, input=["RAG-Anything embedding test"], ) print("Embedding vector length:", len(test_embedding.data[0].embedding)) except Exception as e: raise RuntimeError( "Chat worked, but embeddings failed. Make sure your API key has permission for embeddings." ) from e print("OpenAI API key is working.") print(f"Chat model: {LLM_MODEL}") print(f"Vision model: {VISION_MODEL}") print(f"Embedding model: {EMBEDDING_MODEL}") print(f"Embedding dimension: {EMBEDDING_DIM}")

We prepare the working directories, output folders, logs, and runtime environment variables that RAG-Anything uses during execution. We securely capture the OpenAI API key via a hidden input, clean the pasted value, and verify that both the chat and embedding calls work correctly. We also define the models and embedding dimensions that power the rest of the tutorial.

Generating a Synthetic Multimodal Report

Copy CodeCopiedUse a different Browser

print("\n[5/10] Creating a synthetic multimodal report...") monthly_data = pd.DataFrame( { "Month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun"], "Query Volume": [1200, 1700, 2100, 2600, 3300, 4100], "Hybrid Accuracy": [0.71, 0.74, 0.79, 0.83, 0.87, 0.91], "Average Latency ms": [980, 920, 850, 790, 760, 730], } ) table_md = monthly_data.to_markdown(index=False) plt.figure(figsize=(8, 4.8)) plt.plot(monthly_data["Month"], monthly_data["Query Volume"], marker="o", label="Query Volume") plt.plot(monthly_data["Month"], monthly_data["Hybrid Accuracy"] * 4000, marker="s", label="Hybrid Accuracy scaled") plt.title("Multimodal RAG Usage and Quality Trend") plt.xlabel("Month") plt.ylabel("Volume / Scaled Accuracy") plt.legend() plt.grid(True, alpha=0.3) plt.text( 0.02, 0.95, "Synthetic figure: usage rises while latency falls", transform=plt.gca().transAxes, fontsize=9, verticalalignment="top", bbox=dict(boxstyle="round", alpha=0.15), ) chart_path = ASSET_DIR / "raganything_quality_trend.png" plt.tight_layout() plt.savefig(chart_path, dpi=180) plt.close() report_pdf_path = ASSET_DIR / "synthetic_multimodal_rag_report.pdf" c = canvas.Canvas(str(report_pdf_path), pagesize=letter) width, height = letter c.setFont("Helvetica-Bold", 18) c.drawString(0.8 * inch, height - 0.8 * inch, "Synthetic Multimodal RAG Evaluation Report") c.setFont("Helvetica", 10) intro_lines = [ "This report evaluates a synthetic multimodal RAG pipeline for enterprise documents.", "The knowledge base includes text, tables, equations, and visual evidence.", "The central hypothesis is that hybrid retrieval improves answer quality when evidence spans modalities.", ] y = height - 1.25 * inch for line in intro_lines: c.drawString(0.8 * inch, y, line) y -= 0.22 * inch c.setFont("Helvetica-Bold", 12) c.drawString(0.8 * inch, y - 0.1 * inch, "Table 1. Monthly system measurements") y -= 0.4 * inch c.setFont("Courier", 7.5) for row in table_md.splitlines(): c.drawString(0.8 * inch, y, row[:120]) y -= 0.17 * inch c.setFont("Helvetica-Bold", 12) c.drawString(0.8 * inch, y - 0.15 * inch, "Equation 1. Weighted multimodal score") y -= 0.45 * inch c.setFont("Helvetica", 9) c.drawString( 0.8 * inch, y, "Score(q, d) = alpha * Sim_text(q, d) + beta * Sim_graph(q, d) + gamma * Sim_visual(q, d)", ) y -= 0.5 * inch c.drawImage(str(chart_path), 0.8 * inch, y - 2.8 * inch, width=6.5 * inch, height=2.6 * inch) c.showPage() c.setFont("Helvetica-Bold", 16) c.drawString(0.8 * inch, height - 0.8 * inch, "Interpretation and Findings") c.setFont("Helvetica", 10) findings = [ "Hybrid retrieval combines semantic similarity with graph-based relationship navigation.", "The synthetic table shows accuracy improving from 0.71 to 0.91 over six months.", "The generated figure shows query volume increasing while latency gradually decreases.", "Equation-level retrieval is useful when the question depends on scoring logic rather than plain prose.", "A multimodal system should preserve page index, captions, footnotes, and local image paths for traceability.", ] y = height - 1.25 * inch for finding in findings: c.drawString(0.8 * inch, y, "- " + finding) y -= 0.28 * inch c.save() print(f"Created chart: {chart_path}") print(f"Created PDF: {report_pdf_path}") print("\nSynthetic table:") display(monthly_data)

We create a synthetic multimodal report that provides realistic content for testing in RAG-Anything. We build a small performance table, generate a chart, and export a PDF containing text, a table, an equation, and a figure. We use this controlled document to clearly observe how the system handles different content types.

Building the RAG-Anything content_list for Text

Copy CodeCopiedUse a different Browser

print("\n[6/10] Building direct multimodal content_list...") content_list: List[Dict[str, Any]] = [ { "type": "text", "text": ( "This synthetic report evaluates a multimodal retrieval augmented generation system. " "The system indexes textual explanations, a structured performance table, a scoring equation, " "and a trend figure. The main goal is to answer questions whose evidence is distributed across " "several document modalities rather than one plain text passage." ), "page_idx": 0, }, { "type": "table", "table_body": table_md, "table_caption": ["Table 1: Monthly query volume, hybrid accuracy, and average latency."], "table_footnote": ["Synthetic measurements created for a Colab tutorial."], "page_idx": 0, }, { "type": "equation", "latex": r"Score(q,d)=\alpha \cdot Sim_{text}(q,d)+\beta \cdot Sim_{graph}(q,d)+\gamma \cdot Sim_{visual}(q,d)", "text": ( "Weighted multimodal retrieval score. Alpha controls text similarity, beta controls graph relationship " "similarity, and gamma controls visual similarity." ), "page_idx": 0, }, { "type": "image", "img_path": str(chart_path.resolve()), "image_caption": ["Figure 1: Multimodal RAG usage and quality trend."], "image_footnote": ["The line chart is synthetic and generated inside this tutoria

[truncated for AI cost control]