ChatGPT o3 report on eval and training

· luftyluft's blog


Report: Building an Evaluation Pipeline for PDF → Structured‑Data Extraction (Invoices & Expenses)


1. Why dedicated evaluation matters #


2. Off‑the‑shelf evaluation frameworks you can stand up quickly #

Framework (budget‑friendly) What it gives you out‑of‑the‑box How to apply it to invoice extraction Typical effort
structured‑evals (PyDantic Evals) (PyPI) Evaluate JSON‑mode outputs against a Pydantic schema; plug‑in field‑specific metrics. 1. Define a Invoice(BaseModel) with the exact fields you want.2. Register metrics: exact_match, numeric tolerance, date‑parsing.3. Feed predicted JSON + ground truth to the Evaluator ⇒ instant per‑field scores. ½ day to wire up; works in a notebook/CI runner.
Transformer Lab – Basic Metrics plugin (transformerlab.ai) Point‑and‑click checks: “Is valid JSON?”, “Contains dates?” etc. Upload a CSV of (input_pdf_path, model_output); add rules like Is Valid JSON and regexes for money/date formats. Minutes; no code.
Transformer Lab – LLM‑as‑Judge plugin (DeepEval) (transformerlab.ai) Uses a judge LLM to grade faithfulness, hallucination, bias. Prompt the judge with: “Given the source OCR text chunk and the extracted JSON, score each field for correctness.” Automates fuzzy checks when “Total” appears twice, formatting varies, etc. 1–2 hrs to craft prompt & schema.
Transformer Lab – EleutherAI Harness plugin (transformerlab.ai) Runs the classic LM Harness battery of tasks. Not invoice‑specific, but useful sanity‑check if you plan to fine‑tune—confirm you didn’t degrade general ability to read numbers or dates. Drop‑in.
Transformer Lab – Objective Metrics & Red‑Teaming plugins (transformerlab.ai) Numeric deltas, robustness, PII leakage, prompt‑injection stress tests. Upload adversarial docs (e.g., totals hidden in watermark, vendor address inside a table); check that the pipeline neither misses data nor leaks sensitive info. A few sample docs + plugin config.

Tip: Run PyDantic Evals in your CI for every model/prompt change; run the heavier “LLM‑as‑Judge” nightly on a sample batch.


3. Choosing metrics that correlate with business risk #

Field type Recommended metric(s) Threshold suggestion
Currency / quantities abs_diff ≤ $0.01 OR exact string ≥ 99 % per field
Vendor / GL coding exact_match (case‑insensitive) ≥ 98 %
Dates parsed_date == parsed_gt_date ≥ 99.5 %
Line‑items array F1 on each line (ids unordered) ≥ 97 %
Full JSON mean(field_score) for dashboard trend track regression ≤ 0.5 pt

4. Novel extraction strategies worth prototyping #

  1. Multi‑pass, self‑verification loop

    • Pass 1: cheap model produces draft JSON + per‑token confidence.

    • Pass 2: LLM (e.g., GPT‑4o or Gemini Flash) receives source text + draft + schema → returns “fixed” JSON and a list of deltas.

    • Flag to a human only if confidence < τ or evaluator score drops.

  2. Layout‑aware hybrid

    • Combine classical OCR for bounding boxes with VLM for semantics (LayoutLMv3 shows strong gains on invoices). (arXiv)
  3. Benchmark‑guided provider routing

    • Use OmniAI benchmark JSON accuracy table to choose cheapest provider that still meets > 90 % field accuracy for your doc type; fall back to a premium model when Basic Eval fails. (OmniAI)
  4. Synthetic data augmentation

    • Use “Infinite PDF Generator” or similar to create vendor/format variants; lets you reach hundreds of labelled pages without costly annotation. (Same pipeline OmniAI used.) (OmniAI)

5. Should we fine‑tune a model? #

Option Data you need Compute Cost/time When it pays off
Prompt‑only (zero‑shot / few‑shot) 0–20 golden PDFs none Instant If documents are fairly uniform.
LoRA fine‑tune on LayoutLMv3 or Qwen2‑VL ~500 annotated pages (≈ 5 hrs of labelling with SuperAnnotate/Label Studio) Single 24 GB GPU or AWS g5.xlarge; LoRA ≈ 4 GB vRAM 3–4 hrs wall‑clock; <$50 cloud You see stubborn 3‑5 % error that prompt tweaks can’t fix.
Full fine‑tune / continued pre‑train 5 k – 50 k pages, maybe synthetic Multi‑GPU (Transformer Lab now supports this) (transformerlab.ai) Days; $$$ Only for at‑scale product or exotic layouts.

Transformer Lab’s LoRA trainer wizard (GUI) plus its dataset‑generation plugin can walk you through the whole flow and track metrics to W&B. (transformerlab.ai)


6. Practical next steps (1‑week PoC) #

Day Task
1 Collect 50 representative PDF invoices/receipts; hand‑label ground‑truth JSON (or reuse existing AP system exports).
2 Stand up structured‑evals in a notebook; verify you can score one document.
3 Pipe your current OCR → LLM chain into the eval harness; baseline numbers.
4 Install Transformer Lab; run Basic Metrics + LLM‑as‑Judge on the same batch; compare.
5 Prototype a self‑verification second pass; measure delta in evaluator scores.
6–7 Decide: acceptable accuracy met? → move to small pilot. Otherwise, scope a LoRA fine‑tune (label 500 docs, estimate $50 compute).

7. Key take‑aways #

  1. Start small but instrumented – a lightweight PyDantic eval plus 50 golden docs gives you a living metric you can track in CI.

  2. Transformer Lab’s plugin ecosystem lets non‑ML engineers run deeper evals (judge‑LLM, robustness, red‑teaming) without writing code.

  3. Multi‑pass + self‑verification is often cheaper than jumping straight to expensive fine‑tuning.

  4. Fine‑tune only when evaluation shows a persistent accuracy gap that matters to the business and can’t be solved with prompt/heuristic fixes.


With a one‑week investment you’ll have a reproducible scoreboard, the ability to A/B any new model/provider, and clear data to justify (or skip) further fine‑tuning.