Deepseek report on eval and extraction

· luftyluft's blog


Evaluation Framework for Accurate OCR/VLM Data Extraction in Financial Documents
Prepared for Accounting Team Proof-of-Concept (POC)

1. Introduction #

This report outlines evaluation techniques and novel strategies for ensuring high accuracy in structured data extraction from financial documents (invoices/expenses) using OCR/Vision-Language Models (VLMs). Given the critical nature of financial data, we prioritize methods that:

We explore PyDantic Evals, Mozilla Transformer Lab, and novel multi-pass refinement techniques, alongside assessing whether fine-tuning a model is viable for this POC.


2. Evaluation Techniques for Structured Data Extraction #

Based on Transformer Lab’s Evaluation Methods, we adapt the following for financial document processing:

(A) Exact Match (EM) & Fuzzy Matching #

(B) Confidence Score Thresholding #

(C) Multi-Model Consensus Voting #

(D) Template-Based Validation #

(E) Human-in-the-Loop (HITL) Sampling #


3. Novel Strategies for Improving Extraction Accuracy #

(A) Multi-Pass Refinement Workflow #

  1. First Pass: Fast extraction (Gemini 2.0 Flash for raw text).
  2. Second Pass: Structured parsing (Mistral OCR + JSON schema).
  3. Third Pass: Error correction (LayoutLMv3 for layout-aware validation).

Example: If the first pass misses a line item, the second pass cross-references bounding boxes, and the third pass validates totals.

(B) Hybrid OCR/VLM Pipelines #

(C) Low-Confidence Auto-Correction #


4. Should We Fine-Tune a Model? #

Pros: #

Cons: #

Recommendation: #


5. Tooling Recommendations #

Tool Use Case Cost
PyDantic Evals Schema validation + fuzzy matching Free (OSS)
Transformer Lab Model evaluation + fine-tuning Free (OSS)
Label Studio Human-in-the-loop auditing Free (OSS)
Pulse API Auto-template learning (if scalable) Paid
LayoutLMv3 Layout-aware extraction Free (Microsoft)

6. Next Steps for POC #

  1. Implement PyDantic Evals for schema validation.
  2. Benchmark Gemini vs. Mistral OCR on 50 sample invoices.
  3. Set up multi-pass refinement for low-confidence extractions.
  4. Evaluate fine-tuning need after initial testing.

Budget-Friendly Priority: Start with multi-model consensus + HITL sampling before investing in fine-tuning.


Appendix: Relevant Tools from Hacker News

Let me know if you'd like to dive deeper into any section!