Gemini report on eval and training

Okay, let's break down how to approach evaluating the accuracy of your invoice and expense data extraction system. This report will cover evaluation techniques (referencing Transformer Lab), relevant frameworks (Pydantic Evals), novel extraction strategies, considerations for model training, and a brief review of the mentioned tools.

Report: Evaluating Structured Data Extraction Accuracy for Invoice/Expense Processing

Executive Summary

This report outlines strategies for evaluating the accuracy of structured data extraction from PDF invoices and expense documents using OCR/VLM tools, tailored for an accounting team operating under a Proof-of-Concept (PoC) budget. Achieving high accuracy is paramount for financial data. We recommend establishing a "gold standard" dataset and employing a suite of evaluation techniques, focusing on field-level accuracy. Frameworks like Pydantic Evals and libraries inspired by Transformer Lab can streamline this process. Novel multi-pass refinement strategies can improve robustness. While model fine-tuning is possible, it involves significant effort and should likely be deferred post-PoC. Leveraging existing specialized tools alongside robust evaluation of your current pipeline is the recommended path forward.

1. Introduction

Your team is developing a tool to automate data extraction from invoices and expense reports using a combination of OCR and Large Language Models (LLMs) / Vision Language Models (VLMs) like Mistral OCR, OpenAI, and Gemini Pro. Initial results are promising, but the financial nature of the data necessitates rigorous accuracy validation. This report provides guidance on implementing evaluation tools and techniques within a PoC budget, leveraging modern frameworks like Pydantic Evals and concepts from Transformer Lab, to ensure the reliability of the extracted structured data (e.g., JSON conforming to a schema).

2. The Core Challenge: Defining and Measuring Accuracy for Structured Data

Evaluating LLM/VLM output for structured data extraction differs from evaluating free-form text generation. Accuracy isn't just about fluency; it's about correctness at the field level. Key challenges include:

Layout Variability: Invoices/expenses lack a universal standard format.
Data Types: Extracting different types (dates, currency amounts, vendor names, line items) requires different validation.
Ambiguity: Sometimes information might be missing, poorly scanned, or presented unclearly.
Completeness: Were all required fields extracted? Were all line items captured?

"Accuracy" in this context means:

Correctness: Is the extracted value for invoice_total identical to the value in the PDF?
Completeness: Were all required fields (e.g., vendor_name, invoice_date, total_amount) found?
Formatting: Are dates, numbers, and currencies in the expected format?
Structure: Does the output strictly adhere to the predefined JSON schema?

3. Evaluation Frameworks and Techniques

A systematic evaluation process is crucial. The foundation is a "Gold Standard" Dataset:

Create: Manually (and meticulously) extract and verify data from a representative sample of your typical invoices and expense documents (different vendors, formats, scan qualities). Store this ground truth data in the target JSON structure.
Size: For a PoC, even 25-50 diverse documents can provide initial insights, though more is always better.
Purpose: This dataset serves as the benchmark against which your model's extractions are compared.

Techniques (Inspired by Transformer Lab's list & applied to structured data):

Here’s how techniques, like those listed in the Transformer Lab documentation, can be applied to evaluate your structured JSON output against the gold standard:

a. Exact Match:
- Concept: Checks if the extracted value for a specific field is identical to the ground truth value.
- Application: Ideal for fields like Invoice Numbers, PO Numbers, Dates (after normalization), Currency Symbols, Quantities, and Total Amounts.
- Implementation: Simple string comparison. extracted_value == ground_truth_value.
b. Fuzzy Match (e.g., Levenshtein Distance, Jaro-Winkler):
- Concept: Measures the similarity between two strings, allowing for minor differences (typos, abbreviations). Calculates an edit distance or similarity score.
- Application: Useful for Vendor Names (e.g., "ABC Inc." vs "ABC Incorporated"), Addresses, or Line Item Descriptions where minor variations might be acceptable depending on requirements.
- Implementation: Use libraries like fuzzywuzzy or rapidfuzz in Python. Define an acceptable similarity threshold (e.g., > 0.9).
c. Semantic Similarity (Using Embeddings):
- Concept: Uses language model embeddings to determine if two pieces of text have similar meanings, even if worded differently.
- Application: Less critical for exact data extraction accuracy (like amounts), but could be relevant if you were trying to categorize line items or match vendor names against a known database where names might be significantly different but refer to the same entity. Probably overkill for core PoC evaluation of financial fields.
- Implementation: Requires embedding models (e.g., from sentence-transformers) and calculating cosine similarity between embeddings.
d. Regular Expression (Regex) Matching:
- Concept: Uses patterns to validate the format of extracted data, even if not checking against ground truth content directly.
- Application: Crucial for ensuring Dates (YYYY-MM-DD, DD/MM/YYYY), Amounts (^\$?[\d,]+(\.\d{2})$), Phone Numbers, Tax IDs, etc., are extracted in the correct structure. Can also be used to extract data if the format is highly consistent (though VLMs aim to replace this).
- Implementation: Use Python's re module.
e. JSON Schema Validation:
- Concept: Ensures the overall structure of the output JSON conforms to your predefined schema (correct field names, data types - string, number, boolean, array - presence of required fields).
- Application: Foundational check. Ensures the VLM is producing structurally valid output that your downstream systems can parse.
- Implementation: Libraries like jsonschema in Python. Pydantic models inherently perform this validation if you parse the output into them.
f. Manual Review / Human-in-the-Loop:
- Concept: Human experts review the model's extractions, comparing them to the source PDF, especially for flagged errors or low-confidence results.
- Application: Essential for creating the gold standard, understanding nuanced errors, and providing qualitative feedback. Can be integrated into the workflow for critical documents or exceptions.
- Implementation: Requires a simple UI or process for reviewers.

Leveraging Pydantic Evals and Transformer Lab Concepts:

Pydantic Evals: This framework is particularly well-suited because your target output is structured JSON.
- Define your desired invoice/expense structure as a Pydantic model. This is your schema.
- Use Pydantic Evals to:
  - Automatically validate if the LLM output parses correctly into your Pydantic model (covers JSON Schema Validation).
  - Implement custom evaluation functions (metrics) that compare fields in the parsed Pydantic object against your gold standard Pydantic object (easily implementing Exact Match, Fuzzy Match).
  - Calculate aggregate scores (e.g., field-level accuracy across the dataset).
- Its tight integration with Pydantic makes validation and comparison intuitive.
Transformer Lab: While perhaps more research-oriented, the principles are valuable. It emphasizes modular evaluation components. You can build your evaluation suite using similar ideas: create reusable Python functions for each metric (Exact Match, Fuzzy Match, Regex Format Check) and combine them into a comprehensive evaluation script that runs against your gold standard dataset. Libraries used within Transformer Lab or inspired by it can provide implementations for specific metrics.

Recommendation: Start with Pydantic Evals due to its direct applicability to structured data validation and Pydantic models. Use it to implement Exact Match, Fuzzy Match (via custom functions), and Schema Validation.

4. Novel Data Extraction Strategies (Iterative Refinement)

Instead of a single extraction pass, consider multi-step approaches to improve accuracy and handle uncertainty:

Confidence Scoring: Modify prompts to ask the VLM to provide a confidence score (e.g., 1-5 or 0-1) for each extracted field. Low-confidence fields (< threshold) can be flagged for review or re-processing. Caveat: LLMs can be poorly calibrated and express high confidence even when wrong.
Multi-Pass Refinement:
1. Initial Extraction: Perform the standard extraction into the JSON schema.
2. Validation & Cross-Checking: Programmatically check for inconsistencies (e.g., sum of line item totals doesn't match invoice total; tax amount doesn't match tax rate * subtotal). Also, check for low confidence scores.
3. Targeted Re-Extraction: For inconsistent or low-confidence fields, send a new, targeted prompt back to the VLM. Instead of "Extract data," ask "Based on the document, what is the value for the 'Invoice Total'? Look near the bottom right." You might even include bounding box information from the OCR/VLM if available.
Self-Correction / Reflection: Prompt the LLM to review its own extracted JSON output against the raw OCR text (or even the image again) and identify potential errors or inconsistencies based on common sense rules (e.g., "Does the total amount seem reasonable given the line items?").
Hybrid Approach: Use the VLM for complex fields (Vendor Name, Line Items) but potentially use rule-based methods (Regex) for highly consistent fields like Dates or Invoice Numbers if they follow predictable patterns across many documents, using the VLM as a fallback.

5. Model Training and Fine-tuning

Is it feasible for PoC? Generally no, not within a tight PoC budget unless you have existing expertise and infrastructure.
Effort Involved:
- Data Curation: Requires a significantly larger dataset (hundreds to thousands of examples) than needed for evaluation. Each example needs the source document and the meticulously verified structured output (the "gold standard" data).
- Annotation: This is the most time-consuming part – accurately labeling or structuring the data for every document in the training set.
- Technical Expertise: Requires knowledge of fine-tuning frameworks (e.g., Hugging Face transformers, trl), hyperparameter tuning, and evaluation during training.
- Compute Resources: Fine-tuning requires GPU resources, which incur costs.
- Deployment: Making the fine-tuned model available requires hosting and integration.
When is it useful?
- When general-purpose models consistently fail on specific, recurring patterns unique to your document set.
- To improve performance on very specific, non-standard layouts.
- To potentially reduce inference costs or latency compared to very large general models (though this isn't guaranteed).
Recommendation: Focus first on prompt engineering, selecting the best base model (OpenAI, Gemini, Mistral, etc.), and implementing robust evaluation and refinement strategies. Fine-tuning is a more advanced optimization step, typically considered after exhausting other options or when scaling up.

6. Review of Mentioned Tools/Services

OmniOCR / OmniAI: Focuses on benchmarking VLM vs. traditional OCR, particularly with structured outputs as an evaluation method. Their findings might help you choose a base VLM or understand its limitations compared to older OCR tech, especially on complex/messy documents.
LayoutLM (v1-v3): Foundational models from Microsoft specifically designed to understand document layout by combining text and visual information. While you might not use LayoutLM directly (unless fine-tuning), the models you are using (like Gemini, potentially OpenAI's models) likely incorporate similar layout-aware principles. Understanding this helps appreciate why VLMs work better than simple text OCR on complex documents.
olmOCR: An open-source option based on Qwen2-VL, fine-tuned on many PDFs. Its focus on converting PDFs to markdown while preserving structure (tables, etc.) is relevant. Being open-source and potentially cost-effective makes it interesting, but evaluate its accuracy on your specific task (JSON extraction, not just markdown conversion) against your gold standard.
Pulse API: A commercial, production-focused service explicitly designed for extracting structured JSON from documents, supporting custom schemas. Claims to handle complex layouts and multimodal content well. This represents a potential "buy" solution instead of "build." Could be worth evaluating alongside your in-house approach if budget permits a trial/comparison.
JigsawStack vOCR: Another commercial service claiming high accuracy, multilingual support, native structured output/retrieval, and context understanding, contrasting itself with Mistral OCR. Like Pulse API, it's a potential alternative to evaluate, especially if your current pipeline struggles. Their comparison points (bounding boxes, native structured output) highlight features that can be important.

7. Recommendations for Your PoC

Build Gold Standard: Prioritize creating a small (25-50 docs) but diverse and accurately annotated gold standard dataset representing your typical invoices/expenses.
Define Schema: Formalize your target JSON structure using a Pydantic model.
Implement Basic Evaluation: Use Pydantic Evals to:
- Validate schema adherence.
- Implement Exact Match checks for critical fields (IDs, Dates, Totals).
- Implement Fuzzy Match checks for names/addresses (using fuzzywuzzy within a custom Pydantic Eval function).
- Calculate field-level accuracy metrics across your gold standard set.
Iterate on Prompts: Experiment heavily with different prompts for your chosen VLMs (Gemini Pro, OpenAI) to maximize accuracy using your evaluation suite for feedback. Test providing the schema definition in the prompt.
Explore Confidence & Flags: Modify prompts to request confidence scores. Implement logic to flag low-confidence extractions for manual review. This is a practical first step towards refinement.
Consider Targeted Re-Extraction: If time permits, experiment with a simple multi-pass approach: identify a key field often missed (e.g., Invoice Date) and write a specific secondary prompt to ask only for that field if it was missing in the first pass.
Defer Fine-Tuning: Do not invest time in fine-tuning during the PoC phase.
Evaluate Alternatives (Optional): If your current approach (Mistral OCR + VLM) shows significant gaps after prompt tuning and basic refinement, and budget allows, consider running your gold standard dataset through a specialized API like Pulse API or JigsawStack vOCR for comparison. Focus evaluation on the final structured JSON output.

8. Conclusion

Guaranteeing accuracy for financial data extraction is challenging but achievable through systematic evaluation. By establishing a gold standard dataset, leveraging frameworks like Pydantic Evals for structured comparisons (Exact Match, Fuzzy Match, Schema Validation), and potentially implementing simple refinement loops (confidence scoring, targeted re-extraction), you can gain confidence in your PoC results. Focus on optimizing your current pipeline through prompt engineering and robust evaluation before considering more resource-intensive steps like fine-tuning or switching wholesale to external commercial tools without comparative data. This structured approach will provide clear metrics on accuracy and guide further development efforts effectively within your budget constraints.

last updated: 2025-04-17