2026-05-26 07:25 UTCIn-site rewrite6 min readUpdated: 2026-06-30 13:03 UTC

Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export

In this tutorial, we explore the TuringEnterprises/Open-MM-RL dataset as a practical foundation for multimodal reasoning and reinforcement learning with verifiable rewards. We load the dataset, inspect its schema, analyze domains, formats, question lengths, answer types, and image distributions, and visualize representative examples from each domain. We also build a lightweight reward function that checks exact, numeric, fractional, LaTeX, and symbolic answers, giving us a useful way to evaluate model outputs. Finally, we format prompts for vision-language models, optionally test SmolVLM on sample examples, and export the dataset into a GRPO-style structure for future multimodal RL training.

SourceMarkTechPostAuthor: Sana Hassan

In this tutorial, we explore the TuringEnterprises/Open-MM-RL dataset as a practical foundation for multimodal reasoning and reinforcement learning with verifiable rewards. We load the dataset, inspect its schema, analyze domains, formats, question lengths, answer types, and image distributions, and visualize representative examples from each domain. We also build a lightweight reward function that checks exact, numeric, fractional, LaTeX, and symbolic answers, giving us a useful way to evaluate model outputs. Finally, we format prompts for vision-language models, optionally test SmolVLM on sample examples, and export the dataset into a GRPO-style structure for future multimodal RL training.

Copy CodeCopiedUse a different Browser

import subprocess, sys subprocess.run([sys.executable, "-m", "pip", "-q", "install", "datasets>=3.0", "huggingface_hub>=0.24", "transformers>=4.45", "Pillow", "matplotlib", "pandas", "numpy", "sympy", "accelerate", "tqdm"], check=True) import os, re, io, json, math, random, textwrap, hashlib, warnings from collections import Counter from pathlib import Path import numpy as np import pandas as pd import matplotlib.pyplot as plt from PIL import Image import sympy as sp from datasets import load_dataset warnings.filterwarnings("ignore") random.seed(0); np.random.seed(0) pd.set_option("display.max_colwidth", 120) DS_ID = "TuringEnterprises/Open-MM-RL" ds = load_dataset(DS_ID, split="train") print(f"Loaded {DS_ID} — {len(ds)} rows") print("Features:", ds.features) print("Row 0 keys:", list(ds[0].keys()))

We install all required libraries and import the core tools needed for dataset loading, analysis, visualization, symbolic math, and file handling. We set random seeds for reproducibility and configure pandas so that longer text fields display clearly. We then load the TuringEnterprises/Open-MM-RL dataset from Hugging Face and inspect its size, features, and first-row structure.

Copy CodeCopiedUse a different Browser

df = ds.remove_columns(["images"]).to_pandas() df["n_images"] = [len(ex["images"]) for ex in ds] df["q_len_chars"] = df["question"].str.len() df["a_len_chars"] = df["answer"].str.len() print("\n=== Domain ==="); print(df["domain"].value_counts()) print("\n=== Format ==="); print(df["format"].value_counts()) print("\n=== Sub-domain (top by domain) ===") print(df.groupby("domain")["subDomain"].value_counts().head(15)) print(f"\nMean images/example: {df['n_images'].mean():.2f} max: {df['n_images'].max()}") print(f"Median Q length: {df['q_len_chars'].median():.0f} " f"Median A length: {df['a_len_chars'].median():.0f}") fig, axes = plt.subplots(1, 3, figsize=(15, 4)) df["domain"].value_counts().plot.bar(ax=axes[0], color="#4C72B0") axes[0].set_title("Examples per domain"); axes[0].set_ylabel("count") df["format"].value_counts().plot.bar(ax=axes[1], color="#55A868") axes[1].set_title("Image-format type"); axes[1].tick_params(axis='x', rotation=25) df["n_images"].plot.hist(ax=axes[2], bins=range(1, df["n_images"].max() + 2), color="#C44E52", edgecolor="white") axes[2].set_title("Images per example"); axes[2].set_xlabel("n_images") plt.tight_layout(); plt.show() def img_stats(ex): sizes = [im.size for im in ex["images"]] modes = [im.mode for im in ex["images"]] return { "n_images": len(sizes), "min_w": min(w for w, h in sizes), "max_w": max(w for w, h in sizes), "min_h": min(h for w, h in sizes), "max_h": max(h for w, h in sizes), "modes": "|".join(sorted(set(modes))), "total_pixels": sum(w * h for w, h in sizes), } img_df = pd.DataFrame([img_stats(ex) for ex in ds]) print("\n=== Image resolution stats ===") print(img_df[["min_w", "max_w", "min_h", "max_h", "total_pixels"]].describe().round(0)) print("\nMode mix:", Counter("|".join(img_df["modes"]).split("|")))

We convert the dataset into a DataFrame after removing the image column, then calculate useful fields such as the number of images, question length, and answer length. We analyze domain counts, format distribution, sub-domain breakdowns, and basic text/image statistics. We also create charts to visualize the number of examples per domain, the image formats, and the distribution of images per example.

Copy CodeCopiedUse a different Browser

def show_example(ex, max_chars=600): print("=" * 80) print(f"id={ex['conversation_id']} {ex['domain']} / {ex['subDomain']}") print(f"format={ex['format']} n_images={len(ex['images'])}") print("-" * 80) q = ex["question"][:max_chars] + ("..." if len(ex["question"]) > max_chars else "") print("Q:", textwrap.fill(q, 100)) print("-" * 80) print("A (gold):", ex["answer"]) n = len(ex["images"]) fig, axes = plt.subplots(1, n, figsize=(5 * n, 5)) if n > 1 \ else plt.subplots(1, 1, figsize=(6, 6)) axes = np.atleast_1d(axes) for ax, im in zip(axes, ex["images"]): ax.imshow(im); ax.set_xticks([]); ax.set_yticks([]) ax.set_title(f"{im.size[0]}×{im.size[1]} ({im.mode})") plt.tight_layout(); plt.show() for dom in df["domain"].unique(): idx = int(df[df["domain"] == dom].index[0]) show_example(ds[idx]) LATEX_PAT = re.compile(r"\\\[[\s\S]+?\\\]|\\\([\s\S]+?\\\)|\$[^$]+\$") df["latex_blocks_q"] = df["question"].apply(lambda s: len(LATEX_PAT.findall(s or ""))) df["latex_blocks_a"] = df["answer"].apply(lambda s: len(LATEX_PAT.findall(s or ""))) print("\n=== LaTeX blocks per field ===") print(df[["latex_blocks_q", "latex_blocks_a"]].describe().round(2)) def classify_answer(a): s = (a or "").strip().strip("$ []").strip() s_no_dollar = s.replace("$", "") if re.fullmatch(r"-?\s*\d+(\.\d+)?\s*", s_no_dollar): return "integer/float" if any(t in s for t in ["\\sqrt", "\\frac", "\\pi", "^", "_", "\\kappa", "\\lceil"]): return "symbolic" if re.fullmatch(r"[-+0-9./()\s\\a-zA-Z{}]+", s) and any(c.isdigit() for c in s): return "numeric_expr" return "text" df["answer_type"] = df["answer"].apply(classify_answer) print("\n=== Answer-type breakdown ==="); print(df["answer_type"].value_counts()) print("\n=== Answer-type × domain ===") print(pd.crosstab(df["domain"], df["answer_type"]))

We define a helper function to display one representative example from each domain, including its question, gold answer, and associated images. We use this visual inspection step to better understand how multimodal reasoning problems are structured across different domains. We then analyze LaTeX usage in questions and answers, classify answer types, and compare answer-type distributions across domains.

Copy CodeCopiedUse a different Browser

EXTRACT_PATS = [ r"\\boxed\{([^{}]+)\}", r"final\s+answer\s*[:=]\s*([^\n]+)", r"answer\s*[:=]\s*([^\n]+)", ] def extract_final(text): if not text: return "" for p in EXTRACT_PATS: m = re.search(p, text, flags=re.IGNORECASE) if m: return m.group(1).strip().strip(".,;") lines = [l.strip() for l in str(text).strip().splitlines() if l.strip()] return lines[-1] if lines else "" def latex_to_sympy(s): s = (s or "").strip().strip("$").strip() s = re.sub(r"^\\[\[\(]", "", s); s = re.sub(r"\\[\]\)]$", "", s) s = (s.replace("\\pi", "pi").replace("\\cdot", "*").replace("\\times", "*") .replace("\\,", "").replace("\\;", "").replace("\\!", "")) s = re.sub(r"\\frac\s*\{([^{}]+)\}\s*\{([^{}]+)\}", r"((\1)/(\2))", s) s = re.sub(r"\\sqrt\s*\{([^{}]+)\}", r"sqrt(\1)", s) s = s.replace("^", "**") s = re.sub(r"\\[a-zA-Z]+", "", s) s = s.replace("{", "(").replace("}", ")") return s def grade(pred, gold, tol=1e-4): """Verifiable reward in [0,1]: exact > numeric > sympy-symbolic > partial.""" if pred is None or gold is None: return 0.0 p = extract_final(str(pred)).strip() g = str(gold).strip() norm = lambda x: re.sub(r"\s+", "", x.lower()).strip("$.,;[]()") if norm(p) == norm(g): return 1.0 def to_float(x): try: return float(latex_to_sympy(x)) except Exception: try: return float(sp.sympify(latex_to_sympy(x)).evalf()) except Exception: return None fp, fg = to_float(p), to_float(g) if fp is not None and fg is not None: if abs(fp - fg) / max(1.0, abs(fg)) r={grade(pred, gold)} (want {want})") SYSTEM = ("You are a STEM expert solving multimodal reasoning problems. " "You will see a question and one or more figures. " "Reason step by step, then end with exactly one line:\n" "Final answer: ") def build_prompt(ex): img_tags = "\n".join(f"[Image {i+1}]" for i in range(len(ex["images"]))) return f"{SYSTEM}\n\n{img_tags}\n\nQuestion:\n{ex['question']}\n\nLet's think step by step." print("\n=== Example prompt (truncated) ===") print(build_prompt(ds[0])[:600], "...\n")

We build a verifiable reward function that extracts final answers and compares predictions against gold answers using exact, numeric, and symbolic matching. We also add a LaTeX-to-SymPy conversion helper, allowing mathematical expressions to be evaluated more reliably. We test the grader with sanity checks and then create a structured prompt format for vision-language model reasoning.

Copy CodeCopiedUse a different Browser

import torch USE_VLM = torch.cuda.is_available() print(f"CUDA available: {USE_VLM}") if USE_VLM: try: from transformers import AutoProcessor, AutoModelForVision2Seq MODEL_ID = "HuggingFaceTB/SmolVLM-Instruct" print(f"Loading {MODEL_ID} (this takes ~1 min) ...") processor = AutoProcessor.from_pretrained(MODEL_ID) model = AutoModelForVision2Seq.from_pretrained( MODEL_ID, torch_dtype=torch.float16, device_map="auto" ) def vlm_solve(ex, max_new_tokens=512): imgs = [im.convert("RGB") for im in ex["images"]] content = [{"type": "image"} for _ in imgs] content.append({"type": "text", "text": build_prompt(ex)}) text = processor.apply_chat_template( [{"role": "user", "content": content}], add_generation_prompt=True) inputs = processor(text=text, images=imgs, return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False) return processor.batch_decode( out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0] rows, sample_idx = [], random.sample(range(len(ds)), 6) for i in sample_idx: ex = ds[i] try: pred = vlm_solve(ex) r = grade(pred, ex["answer"]) except Exception as e: pred, r = f"", 0.0 rows.append({"id": ex["conversation_id"], "domain": ex["domain"], "reward": r, "pred_tail": pred[-200:]}) print(f" id={ex['conversation_id']} {ex['domain']:9s} r={r:.2f}") res = pd.DataFrame(rows) print(f"\nMean reward over {len(res)} samples: {res['reward'].mean():.3f}") print(res.groupby("domain")["reward"].mean().rename("avg_reward")) except Exception as e: print(f"VLM run failed ({e}); reward & data pipeline remain usable.") else: print("No GPU detected — skipping live VLM inference (Runtime → Change runtime type → GPU).") out_dir = Path("/content/open_mm_rl_processed"); out_dir.mkdir(exist_ok=True, parents=True) img_dir = out_dir / "images"; img_dir.mkdir(exist_ok=True) records = [] for ex in ds: paths = [] for j, im in enumerate(ex["images"]): p = img_dir / f"{ex['conversation_id']}_{j}.png" im.convert("RGB").save(p) paths.append(str(p)) records.append({ "id": ex["conversation_id"], "domain": ex["domain"], "subDomain": ex["subDomain"], "format": ex["format"], "prompt": build_prompt(ex), "gold": ex["answer"], "image_paths": paths, }) jsonl_path = out_dir / "data.jsonl" with open(jsonl_path, "w") as f: for r in records: f.write(json.dumps(r) + "\n") print(f"\nWrote {len(records)} records → {jsonl_path}") print(f"Saved {sum(len(r['image_paths']) for r in records)} images under {img_dir}") def mock_policy_samples(gold, K=4): """Stand-in for K policy rollouts. Replace with model.generate(do_sample=True).""" return [gold, "Final answer: 0", f"Final answer: {gold} (≈)", "I think the answer is unclear."][:K] def grpo_advantages(rewards): r = np.asarray(rewards, dtype=float) return (r - r.mean()) / (r.std() + 1e-6) print("\n=== Mock GRPO rollouts for example 0 ===") gold0 = ds[0]["answer"] cands = mock_policy_samples(gold0, K=4) rewards = [grade(c, gold0) for c in cands] adv = grpo_advantages(rewards) for c, r, a in zip(cands, rewards, adv): print(f" r={r:.2f} adv={a:+.2f} cand={c!r}") print("\nDone. To turn this into real

[truncated for AI cost control]