2026-07-03 21:25 UTCIn-site rewrite6 min readUpdated: 2026-07-03 21:39 UTC

Designing a Schema-Guided Invoice Intelligence Pipeline with lift-pdf for Accounts-Payable Extraction, Validation, and Ledger Generation

In this tutorial, we build an end-to-end accounts-payable extraction pipeline with lift-pdf, using synthetic invoice PDFs as controlled test documents and a structured JSON schema as the target output format. Instead of treating invoice parsing as a simple OCR task, we frame it as schema-guided document understanding: we generate realistic invoices, define fields such as vendor identity, billing party, PO number, line items, tax, total amount, balance due, and payment status, and then ask the model to extract those values directly from the rendered PDF layout. We also include practical extraction traps that appear in real finance workflows, such as distinguishing bill-to from ship-to, separating subtotal from after-tax total, returning null for absent values, and correctly marking partially paid invoices as unpaid when a balance remains. Through GPU-aware model loading, optional 4-bit quantization, PDF generation and extraction, scoring, and ledger construction, we turn this tutorial into a compact yet realistic demonstration of document intelligence for invoice mining.

SourceMarkTechPostAuthor: Sana Hassan

Copy CodeCopiedUse a different Browser

N_DOCS = 3 FORCE_FULL_PRECISION = False FORCE_4BIT = False SHOW_FIRST_PAGE = True RUN_ON_REAL_PDF = False REAL_PDF_URL = "" REAL_PDF_PAGES = "0-1" PIN_PILLOW = True PILLOW_VERSION = "11.3.0" import os, sys, subprocess, json, re, time, warnings warnings.filterwarnings("ignore") os.environ["TOKENIZERS_PARALLELISM"] = "false" def pip(*pkgs, upgrade=False): """Install without invoking a shell (so '[hf]' is never glob-expanded).""" args = [sys.executable, "-m", "pip", "install", "-q"] + (["-U"] if upgrade else []) + list(pkgs) print(" pip install", *pkgs) subprocess.run(args, check=False) print("STEP 1/7 · Installing lift + light dependencies (first run is the slow one)…") pip("reportlab", "pypdfium2", "pandas", "matplotlib") pip("lift-pdf[hf]") pip("bitsandbytes", "accelerate", upgrade=True) if PIN_PILLOW: pip(f"pillow=={PILLOW_VERSION}") if "PIL" in sys.modules: import PIL if getattr(PIL, "version", "") != PILLOW_VERSION: print(f" Pinned Pillow {PILLOW_VERSION} on disk, but a stale " f"{getattr(PIL, 'version', '?')} is loaded in memory — restarting runtime.") print(" Just re-run the cell(s) after Colab reconnects.") os.kill(os.getpid(), 9) print(" …install finished.\n") import torch

We begin by defining the runtime controls that decide how many invoices we process, whether we use 4-bit loading, whether we preview the generated PDF, and whether we later test a real invoice. We install the core dependencies for PDF generation, rendering, tabular analysis, plotting, and lift-pdf inference. We also pin Pillow to a stable version because the tutorial addresses a known Colab compatibility issue among Pillow, torchvision, and Transformers. This setup gives us a reproducible environment before we load any model or generate any document.

Copy CodeCopiedUse a different Browser

def detect_gpu(): if not torch.cuda.is_available(): raise SystemExit( "\n✗ No CUDA GPU found. In Colab: Runtime ▸ Change runtime type ▸ GPU " "(A100 is best; L4/T4 also work).\n" ) p = torch.cuda.get_device_properties(0) cc = torch.cuda.get_device_capability(0) return p.name, p.total_memory / 1e9, cc def enable_4bit(compute_dtype): """Load lift's weights in 4-bit NF4 whatever transformers Auto* class it uses internally.""" import inspect, functools, transformers from transformers import BitsAndBytesConfig bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=compute_dtype, ) def patch(cls): try: cm = inspect.getattr_static(cls, "from_pretrained") orig = cm.func if isinstance(cm, (classmethod, staticmethod)) else cm except Exception: return @functools.wraps(orig) def inner(cls_, *args, kwargs): kwargs.setdefault("quantization_config", bnb) kwargs.setdefault("device_map", {"": 0}) model = orig(cls_, *args, kwargs) try: model.to = lambda *a, k: model model.cuda = lambda *a, k: model except Exception: pass return model cls.from_pretrained = classmethod(inner) for name in ["AutoModelForImageTextToText", "AutoModelForMultimodalLM", "AutoModelForVision2Seq", "AutoModelForCausalLM", "AutoModel"]: c = getattr(transformers, name, None) if c is not None: patch(c) try: from transformers.modeling_utils import PreTrainedModel patch(PreTrainedModel) except Exception: pass print("STEP 2/7 · Preparing the model backend…") gpu_name, vram, cc = detect_gpu() use_4bit = FORCE_4BIT or (vram = 8 else torch.float16 print(f" GPU: {gpu_name} | ~{vram:.0f} GB | compute capability {cc[0]}.{cc[1]}") print(f" Load mode: {'4-bit NF4' if use_4bit else 'full bf16'} (compute dtype {compute_dtype})") os.environ.setdefault("TORCH_DEVICE", "cuda:0") os.environ.setdefault("MODEL_CHECKPOINT", "datalab-to/lift") if use_4bit: enable_4bit(compute_dtype) from lift import extract from lift.model import InferenceManager print(" Loading lift weights (≈20 GB download on first run)…") _t = time.time() MODEL = InferenceManager(method="hf") print(f" ✓ model ready in {time.time() - _t:.0f}s\n") def run_lift(pdf_path, schema, page_range=None): kw = {"model": MODEL} if page_range: kw["page_range"] = page_range result = extract(pdf_path, schema, **kw) return getattr(result, "extraction", None)

We prepare the GPU-aware inference backend and decide whether the model should run in full precision or 4-bit NF4 quantization based on available VRAM. We patch the Hugging Face model-loading path so lift can transparently load the checkpoint with a BitsAndBytes quantization configuration when needed. We initialize the InferenceManager once and reuse it across all invoices, avoiding repeated model-loading overhead. Finally, we wrap lift.extract() inside a small helper so each PDF can be mined with the same schema and optional page range.

Copy CodeCopiedUse a different Browser

DOCS = [ dict( invoice_number="INV-2026-0412", invoice_date="2026-05-04", due_date="2026-06-03", vendor_name="Cloudworks Inc.", vendor_address="500 Market St, Suite 900, San Francisco, CA 94105, USA", bill_to_name="Acme Robotics LLC", bill_to_address="12 Foundry Rd, Pittsburgh, PA 15222, USA", ship_to_name="Acme Robotics — Warehouse 4", ship_to_address="88 Dockside Blvd, Newark, NJ 07114, USA", po_number=None, discount_amount=None, currency_code="USD", currency_symbol="$", tax_rate=0.085, amount_paid=0.00, line_items=[ ("Cloud Compute — Standard tier (monthly)", 3, 240.00), ("Object Storage — 2 TB", 1, 46.00), ("Priority Support add-on", 1, 99.00), ], notes="Payment due within 30 days. Late payments accrue 1.5% monthly interest.", ), dict( invoice_number="INV-ND-2026-118", invoice_date="2026-04-18", due_date="2026-05-18", vendor_name="Nordic Design Studio Oy", vendor_address="Eteläranta 12, 00130 Helsinki, Finland", bill_to_name="Helsinki Media Oy", bill_to_address="Mannerheimintie 4, 00100 Helsinki, Finland", ship_to_name=None, ship_to_address=None, po_number="PO-HM-5589", discount_amount=785.00, currency_code="EUR", currency_symbol="€", tax_rate=0.24, amount_paid=8760.60, line_items=[ ("Brand identity design package", 1, 4200.00), ("Web UI design — 12 screens", 12, 180.00), ("Custom illustration set", 1, 850.00), ("Design-system documentation", 1, 640.00), ], notes="Paid in full — thank you. All amounts in EUR.", ), dict( invoice_number="INV-BR-4471", invoice_date="2026-06-01", due_date="2026-07-15", vendor_name="BuildRight Contractors Inc.", vendor_address="740 Industrial Way, Austin, TX 78744, USA", bill_to_name="Sunrise Property Group", bill_to_address="9 Lakeview Terrace, Austin, TX 78703, USA", ship_to_name="Sunrise Property Group — Lot 14 site office", ship_to_address="Parcel 14, Mesa Ridge Development, Austin, TX 78737, USA", po_number="PO-SPG-2211", discount_amount=None, currency_code="USD", currency_symbol="$", tax_rate=0.07, amount_paid=15000.00, line_items=[ ("Site preparation and grading", 1, 18500.00), ("Foundation concrete pour (Phase 1)", 1, 27400.00), ], notes="A 15,000 USD deposit has been received. Remaining balance due by the date above.", ), ][:N_DOCS] def compute(d): """Derive every money figure once, so PDF text and ground truth are guaranteed identical.""" items = [(desc, q, up, round(q * up, 2)) for (desc, q, up) in d["line_items"]] subtotal = round(sum(t for *_, t in items), 2) disc = d.get("discount_amount") taxable = round(subtotal - (disc or 0.0), 2) tax = round(taxable * d["tax_rate"], 2) total = round(taxable + tax, 2) paid = round(d.get("amount_paid", 0.0), 2) balance = round(total - paid, 2) return dict(items=items, subtotal=subtotal, discount=disc, tax=tax, total=total, amount_paid=paid, balance=balance, is_paid=(balance Status: {status}", BODY), Spacer(1, 16), Paragraph("Notes", LBL), Paragraph(d["notes"], BODY)] SimpleDocTemplate(path, pagesize=LETTER, topMargin=0.7 * inch, bottomMargin=0.7 * inch, leftMargin=0.8 * inch, rightMargin=0.8 * inch).build(story) print("STEP 3/7 · Generating synthetic invoice PDFs…") CORPUS = [] for i, d in enumerate(DOCS): path = f"/content/invoice_{i}.pdf" if os.path.isdir("/content") else f"invoice_{i}.pdf" render_pdf(d, path) CORPUS.append((d, ground_truth(d), path)) print(f" ✓ {os.path.basename(path)} — {d['vendor_name']} → {d['bill_to_name']}") print() if SHOW_FIRST_PAGE: try: import pypdfium2 as pdfium, matplotlib.pyplot as plt pg = pdfium.PdfDocument(CORPUS[0][2])[0] img = pg.render(scale=2.0).to_pil() plt.figure(figsize=(6.4, 8.3)); plt.imshow(img); plt.axis("off") plt.title("What lift reads — page 1 of invoice_0.pdf", fontsize=10); plt.show() except Exception as e: print(" page preview skipped:", e, "\n")

We render each synthetic invoice into a realistic one-page PDF using ReportLab, including headers, invoice metadata, billing and shipping blocks, line-item tables, totals, payment status, and notes. We intentionally preserve layout elements that make invoice extraction difficult, such as separate bill-to and ship-to sections and subtotal versus total fields. We then generate the PDF corpus and optionally preview the first page using pypdfium2 and Matplotlib. This step creates the actual visual documents that lift reads during extraction.

Copy CodeCopiedUse a different Browser

SCHEMA = { "type": "object", "properties": { "invoice_number": {"type": "string", "description": "The invoice's unique identifier / number"}, "invoice_date": {"type": "string", "description": "Date the invoice was issued (as printed)"}, "due_date": {"type": "string", "description": "Date payment is due"}, "vendor": { "type": "object", "description": "The party that ISSUED the invoice (the seller / supplier)", "properties": { "name": {"type": "string"}, "address": {"type": "string"}, }}, "customer_name": {"type": "string", "description": "The party the invoice is billed TO (the 'Bill To' party) — " "not the vendor, and not the 'Ship To' party if it differs"}, "purchase_order_number": {"type": "string", "description": "The PO number referenced on the invoice. " "Return null if no purchase-order number appears"}, "currency": {"type": "string", "description": "ISO 4217 currency code of the amounts, e.g. USD or EUR"}, "line_items": { "type": "array", "description": "Every billed line item, in order", "items": {"type": "object", "properties": { "description": {"type": "string"}, "quantity": {"type": "number"}, "unit_price": {"type": "number"}, "line_total": {"type": "number", "description": "quantity × unit_price for this line"}, }}}, "subtotal": {"type": "number", "description": "Sum of line totals BEFORE tax and discount"}, "discount_amount": {"type": "number", "description": "Total discount applied. Return null if no discount is shown"}, "tax_amount": {"type": "number", "description": "Total tax / VAT charged"}, "total_amount": {"type": "number", "description": "The grand total the customer owes, AFTER tax and any discount — " "NOT the pre-tax subtotal and NOT the tax line"}, "amount_paid": {"type": "number", "descriptio

[truncated for AI cost control]