Report: Building an Evaluation Pipeline for PDF → Structured‑Data Extraction (Invoices & Expenses)
1. Why dedicated evaluation matters #
-
Regulatory/audit requirements demand field‑level accuracy; even a single wrong “Total” or “Tax” can flow straight into the GL.
-
Vision‑language models (VLMs) now rival traditional OCR, but quality is highly variable across layouts, scan quality and handwriting. A repeatable eval harness is the only way to know when a model is truly “good enough.” (OmniAI)
2. Off‑the‑shelf evaluation frameworks you can stand up quickly #
Framework (budget‑friendly) | What it gives you out‑of‑the‑box | How to apply it to invoice extraction | Typical effort |
---|---|---|---|
structured‑evals (PyDantic Evals) (PyPI) |
Evaluate JSON‑mode outputs against a Pydantic schema; plug‑in field‑specific metrics. | 1. Define a Invoice(BaseModel) with the exact fields you want.2. Register metrics: exact_match , numeric tolerance, date‑parsing.3. Feed predicted JSON + ground truth to the Evaluator ⇒ instant per‑field scores. |
½ day to wire up; works in a notebook/CI runner. |
Transformer Lab – Basic Metrics plugin (transformerlab.ai) | Point‑and‑click checks: “Is valid JSON?”, “Contains dates?” etc. | Upload a CSV of (input_pdf_path, model_output) ; add rules like Is Valid JSON and regexes for money/date formats. |
Minutes; no code. |
Transformer Lab – LLM‑as‑Judge plugin (DeepEval) (transformerlab.ai) | Uses a judge LLM to grade faithfulness, hallucination, bias. | Prompt the judge with: “Given the source OCR text chunk and the extracted JSON, score each field for correctness.” Automates fuzzy checks when “Total” appears twice, formatting varies, etc. | 1–2 hrs to craft prompt & schema. |
Transformer Lab – EleutherAI Harness plugin (transformerlab.ai) | Runs the classic LM Harness battery of tasks. | Not invoice‑specific, but useful sanity‑check if you plan to fine‑tune—confirm you didn’t degrade general ability to read numbers or dates. | Drop‑in. |
Transformer Lab – Objective Metrics & Red‑Teaming plugins (transformerlab.ai) | Numeric deltas, robustness, PII leakage, prompt‑injection stress tests. | Upload adversarial docs (e.g., totals hidden in watermark, vendor address inside a table); check that the pipeline neither misses data nor leaks sensitive info. | A few sample docs + plugin config. |
Tip: Run PyDantic Evals in your CI for every model/prompt change; run the heavier “LLM‑as‑Judge” nightly on a sample batch.
3. Choosing metrics that correlate with business risk #
Field type | Recommended metric(s) | Threshold suggestion |
---|---|---|
Currency / quantities | abs_diff ≤ $0.01 OR exact string |
≥ 99 % per field |
Vendor / GL coding | exact_match (case‑insensitive) |
≥ 98 % |
Dates | parsed_date == parsed_gt_date |
≥ 99.5 % |
Line‑items array | F1 on each line (ids unordered) |
≥ 97 % |
Full JSON | mean(field_score) for dashboard trend |
track regression ≤ 0.5 pt |
4. Novel extraction strategies worth prototyping #
-
Multi‑pass, self‑verification loop
-
Pass 1: cheap model produces draft JSON + per‑token confidence.
-
Pass 2: LLM (e.g., GPT‑4o or Gemini Flash) receives source text + draft + schema → returns “fixed” JSON and a list of deltas.
-
Flag to a human only if
confidence < τ
or evaluator score drops.
-
-
Layout‑aware hybrid
- Combine classical OCR for bounding boxes with VLM for semantics (LayoutLMv3 shows strong gains on invoices). (arXiv)
-
Benchmark‑guided provider routing
- Use OmniAI benchmark JSON accuracy table to choose cheapest provider that still meets > 90 % field accuracy for your doc type; fall back to a premium model when Basic Eval fails. (OmniAI)
-
Synthetic data augmentation
- Use “Infinite PDF Generator” or similar to create vendor/format variants; lets you reach hundreds of labelled pages without costly annotation. (Same pipeline OmniAI used.) (OmniAI)
5. Should we fine‑tune a model? #
Option | Data you need | Compute | Cost/time | When it pays off |
---|---|---|---|---|
Prompt‑only (zero‑shot / few‑shot) | 0–20 golden PDFs | none | Instant | If documents are fairly uniform. |
LoRA fine‑tune on LayoutLMv3 or Qwen2‑VL | ~500 annotated pages (≈ 5 hrs of labelling with SuperAnnotate/Label Studio) | Single 24 GB GPU or AWS g5.xlarge; LoRA ≈ 4 GB vRAM | 3–4 hrs wall‑clock; <$50 cloud | You see stubborn 3‑5 % error that prompt tweaks can’t fix. |
Full fine‑tune / continued pre‑train | 5 k – 50 k pages, maybe synthetic | Multi‑GPU (Transformer Lab now supports this) (transformerlab.ai) | Days; $$$ | Only for at‑scale product or exotic layouts. |
Transformer Lab’s LoRA trainer wizard (GUI) plus its dataset‑generation plugin can walk you through the whole flow and track metrics to W&B. (transformerlab.ai)
6. Practical next steps (1‑week PoC) #
Day | Task |
---|---|
1 | Collect 50 representative PDF invoices/receipts; hand‑label ground‑truth JSON (or reuse existing AP system exports). |
2 | Stand up structured‑evals in a notebook; verify you can score one document. |
3 | Pipe your current OCR → LLM chain into the eval harness; baseline numbers. |
4 | Install Transformer Lab; run Basic Metrics + LLM‑as‑Judge on the same batch; compare. |
5 | Prototype a self‑verification second pass; measure delta in evaluator scores. |
6–7 | Decide: acceptable accuracy met? → move to small pilot. Otherwise, scope a LoRA fine‑tune (label 500 docs, estimate $50 compute). |
7. Key take‑aways #
-
Start small but instrumented – a lightweight PyDantic eval plus 50 golden docs gives you a living metric you can track in CI.
-
Transformer Lab’s plugin ecosystem lets non‑ML engineers run deeper evals (judge‑LLM, robustness, red‑teaming) without writing code.
-
Multi‑pass + self‑verification is often cheaper than jumping straight to expensive fine‑tuning.
-
Fine‑tune only when evaluation shows a persistent accuracy gap that matters to the business and can’t be solved with prompt/heuristic fixes.
With a one‑week investment you’ll have a reproducible scoreboard, the ability to A/B any new model/provider, and clear data to justify (or skip) further fine‑tuning.