AI News HubLIVE
站内改写5 min read

A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics

In this tutorial, we explore the FineWeb dataset through an advanced hands-on workflow. We stream a manageable sample of the dataset without downloading the full multi-terabyte corpus, inspect its schema and metadata, and analyze key fields such as URL, language, language score, and token count. We also reproduce simplified versions of FineWeb’s quality-filtering pipeline, apply MinHash-based near-duplicate detection, verify token counts with the GPT-2 tokenizer, and generate useful analytics on domains, language scores, document lengths, and tokenizer efficiency.

SourceMarkTechPostAuthor: Sana Hassan

In this tutorial, we explore the FineWeb dataset through an advanced hands-on workflow. We stream a manageable sample of the dataset without downloading the full multi-terabyte corpus, inspect its schema and metadata, and analyze key fields such as URL, language, language score, and token count. We also reproduce simplified versions of FineWeb’s quality-filtering pipeline, apply MinHash-based near-duplicate detection, verify token counts with the GPT-2 tokenizer, and generate useful analytics on domains, language scores, document lengths, and tokenizer efficiency.

Copy CodeCopiedUse a different Browser

import subprocess, sys def pip(*pkgs): subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True) pip("datasets>=2.19", "datasketch", "tiktoken", "pandas", "matplotlib", "tqdm") import re, math, random, collections from urllib.parse import urlparse import pandas as pd import numpy as np import matplotlib.pyplot as plt from tqdm.auto import tqdm from datasets import load_dataset random.seed(0); np.random.seed(0) pd.set_option("display.max_colwidth", 90)

We begin by installing all required libraries for streaming, analysis, deduplication, tokenization, and visualization. We import the core Python packages needed to process FineWeb documents and work with tabular data. We also set random seeds and display options so that our results remain consistent and easier to inspect.

Copy CodeCopiedUse a different Browser

N_DOCS = 3000 print(f"Streaming {N_DOCS} docs from FineWeb sample-10BT ...") stream = load_dataset( "HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True, ) docs = [] for i, doc in enumerate(tqdm(stream, total=N_DOCS)): docs.append(doc) if i + 1 >= N_DOCS: break df = pd.DataFrame(docs) print("\nColumns:", list(df.columns)) print(df[["url", "language", "language_score", "token_count"]].head(5)) ex = docs[0] print("\n--- Example record (fields) ---") for k, v in ex.items(): preview = (v[:120] + "…") if isinstance(v, str) and len(v) > 120 else v print(f"{k:>16}: {preview}")

We stream a fixed number of documents from the FineWeb sample-10BT subset without downloading the full dataset. We convert the streamed records into a DataFrame and inspect key metadata fields, including URL, language, language score, and token count. We also print a complete example record to better understand the dataset’s structure.

Copy CodeCopiedUse a different Browser

WORD = re.compile(r"\b\w+\b") def gopher_quality(text): words = WORD.findall(text) n = len(words) if n 100_000: return False, "word_count_out_of_range" mean_len = sum(len(w) for w in words) / n if mean_len 10: return False, "bad_mean_word_length" if (text.count("#") + text.count("...")) / n > 0.1: return False, "too_many_symbols" lines = text.split("\n") if lines and sum(l.lstrip().startswith(("•", "-", "*")) for l in lines) / len(lines) > 0.9: return False, "mostly_bullets" stops = {"the", "be", "to", "of", "and", "that", "have", "with"} if len(stops & {w.lower() for w in words}) 0 and text.count("{") / max(len(lines), 1) > 0.5: return False, "too_many_braces" return True, "ok" def fineweb_custom(text): lines = [l.strip() for l in text.split("\n") if l.strip()] if not lines: return False, "empty" dup_frac = 1 - len(set(lines)) / len(lines) if dup_frac > 0.3: return False, "duplicated_lines" short_frac = sum(len(l) 0.67 and len(lines) > 5: return False, "list_like" return True, "ok" results = [] for d in docs: t = d["text"] g_ok, g_r = gopher_quality(t) c_ok, c_r = c4_quality(t) f_ok, f_r = fineweb_custom(t) reason = "kept" if (g_ok and c_ok and f_ok) else (g_r if not g_ok else c_r if not c_ok else f_r) results.append(reason) filter_summary = pd.Series(results).value_counts() print("\n--- Quality-filter outcomes on already-clean FineWeb data ---") print("(Most pass: FineWeb is pre-filtered. Rejections show what the rules catch.)") print(filter_summary)

We recreate simplified versions of FineWeb’s quality filters using Gopher-style, C4-style, and custom text-cleaning heuristics. We check each document for issues such as abnormal word counts, poor word statistics, boilerplate text, repeated lines, and list-like structure. We summarize how many documents pass or fail these filters to understand the quality of the already-cleaned FineWeb sample.

Copy CodeCopiedUse a different Browser

from datasketch import MinHash, MinHashLSH def shingles(text, k=5): toks = WORD.findall(text.lower()) return {" ".join(toks[i:i+k]) for i in range(max(len(toks) - k + 1, 1))} NUM_PERM = 128 THRESHOLD = 0.7 lsh = MinHashLSH(threshold=THRESHOLD, num_perm=NUM_PERM) minhashes = {} for idx, d in enumerate(tqdm(docs, desc="MinHashing")): m = MinHash(num_perm=NUM_PERM) for s in shingles(d["text"]): m.update(s.encode("utf8")) minhashes[idx] = m lsh.insert(str(idx), m) dup_pairs = set() for idx, m in minhashes.items(): for cand in lsh.query(m): c = int(cand) if c != idx: dup_pairs.add(tuple(sorted((idx, c)))) print(f"\nFound {len(dup_pairs)} near-duplicate pairs (Jaccard ≥ {THRESHOLD}).") if dup_pairs: a, b = next(iter(dup_pairs)) j = minhashes[a].jaccard(minhashes[b]) print(f"Example pair (estimated Jaccard ≈ {j:.2f}):") print(" DOC A:", docs[a]["text"][:160].replace("\n", " "), "…") print(" DOC B:", docs[b]["text"][:160].replace("\n", " "), "…") else: print("No near-dupes in this slice — expected, since FineWeb is dedup'd per crawl.")

We implement MinHash-based near-duplicate detection to approximate how large web corpora identify repeated or highly similar documents. We convert each document into word shingles, generate MinHash signatures, and index them with Locality Sensitive Hashing. We then search for near-duplicate document pairs and inspect an example if any similar texts are found.

Copy CodeCopiedUse a different Browser

import tiktoken enc = tiktoken.get_encoding("gpt2") check = docs[:200] recomputed = [len(enc.encode(d["text"])) for d in tqdm(check, desc="Tokenizing")] stored = [d["token_count"] for d in check] diffs = np.array(recomputed) - np.array(stored) print(f"\n--- Verifying token_count field (gpt2) on 200 docs ---") print(f"Mean abs diff vs stored token_count: {np.abs(diffs).mean():.2f} tokens") print(f"Exact matches: {(diffs == 0).mean()*100:.0f}% (small drift = tokenizer version)") df["chars_per_token"] = df["text"].str.len() / df["token_count"].clip(lower=1) print(f"Avg characters per token: {df['chars_per_token'].mean():.2f}")

We verify the dataset’s token_count field by recomputing GPT-2 token counts with the tiktoken tokenizer. We compare the recomputed token counts with the stored values and measure the average difference between them. We also calculate characters per token to understand tokenizer efficiency across the sampled documents.

Copy CodeCopiedUse a different Browser

df["domain"] = df["url"].apply(lambda u: urlparse(u).netloc.replace("www.", "") if isinstance(u, str) else "?") top_domains = df["domain"].value_counts().head(15) print("\n--- Top 15 domains in sample ---") print(top_domains) fig, axes = plt.subplots(2, 2, figsize=(14, 10)) axes[0, 0].hist(df["token_count"].clip(upper=4000), bins=50, color="#7b2d26") axes[0, 0].set_title("Token count per document (gpt2)") axes[0, 0].set_xlabel("tokens"); axes[0, 0].set_ylabel("docs") axes[0, 1].hist(df["language_score"], bins=40, color="#2d5d7b") axes[0, 1].axvline(0.65, color="red", ls="--", label="FineWeb cutoff 0.65") axes[0, 1].set_title("fastText English language score") axes[0, 1].set_xlabel("score"); axes[0, 1].legend() axes[1, 0].hist(df["chars_per_token"].clip(upper=8), bins=40, color="#3f7b2d") axes[1, 0].set_title("Characters per token (compression)") axes[1, 0].set_xlabel("chars / token") top_domains.iloc[::-1].plot(kind="barh", ax=axes[1, 1], color="#7b5d2d") axes[1, 1].set_title("Top domains") plt.tight_layout() plt.show() print("\n" + "=" * 70) print("SUMMARY") print("=" * 70) print(f"Docs streamed : {len(df):,}") print(f"Total gpt2 tokens : {df['token_count'].sum():,}") print(f"Median tokens/doc : {int(df['token_count'].median())}") print(f"Unique domains : {df['domain'].nunique():,}") print(f"Mean language_score : {df['language_score'].mean():.3f}") print(f"Near-duplicate pairs : {len(dup_pairs)}") print(f"Docs flagged by filters : {(pd.Series(results) != 'kept').sum()} / {len(results)}") print("\nNext steps:") print(" • Swap name='sample-10BT' for a real crawl, e.g. name='CC-MAIN-2024-10'") print(" • Raise N_DOCS for stronger statistics") print(" • Use the full datatrove pipeline to reproduce FineWeb end-to-end")

We extract domain names from URLs and identify the most frequent domains present in the FineWeb sample. We create visualizations for token count distribution, language score distribution, characters per token, and top domains. We finish by printing a compact summary of streamed documents, total tokens, median length, unique domains, language quality, duplicate count, and filter results.

In conclusion, we developed a practical understanding of how large-scale web datasets such as FineWeb are explored, filtered, deduplicated, and analyzed for language model training. We worked efficiently with streaming data, tested quality heuristics on real documents, identified near-duplicate text patterns, and validated token-level metadata using a production-style tokenizer. It can be used to scale the workflow to larger FineWeb crawls, perform deeper corpus analysis, and design high-quality preprocessing pipelines for LLM dataset preparation.

Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics appeared first on MarkTechPost.