CIBMTR Field Extraction Comparison & LLM Training Analysis

Comparing Final Review extracted entries vs. typical inputs from 20 filled forms

52
Fields Extracted (Final Review)
6
Fields with Patterns (Filled Forms)
23
Conflict Resolutions
157
Total CIBMTR Fields

📊 What This Analysis Shows

1. Final Review Extracted Entries

These are fields extracted from uploaded PDFs in the Final Review page, including:

  • Auto-filled fields: Automatically extracted by models (52 fields)
  • Manual corrections: Fields corrected by users
  • Conflict resolutions: Fields where multiple values existed and were resolved (23 fields)

2. Typical Inputs from Filled Forms

These are answer patterns extracted from 20 professionally filled-out CIBMTR forms, showing:

  • Answer variations: Different ways the same question can be answered
  • Format patterns: Date formats (MM/DD/YYYY, M/D/YYYY), number formats, text variations
  • Value types: What types of answers are expected (dates, numbers, codes, text)
  • 6 fields with extracted patterns: Q1, Q2, Q3, Q5, Q63, Q97

3. How This Refines LLM Extraction

By comparing extracted values with typical patterns, we can:

  • Constrain answer options: Limit LLM to valid answer formats (e.g., dates must be MM/DD/YYYY)
  • Improve accuracy: Train models to recognize correct patterns (e.g., Q97 accepts numeric values: 315, 459, 460, 474)
  • Reduce errors: Filter out invalid formats automatically
  • Guide extraction: Provide examples of what to look for in prompts
  • Limit answer scope: Instead of asking open-ended questions, provide constrained choices based on observed patterns

🔍 Field-by-Field Comparison

Exact match | Partial match | No match
Field ID Final Review Extracted Value Typical Input Patterns Match Status LLM Constraint

📄 Per-File Entry Comparison

Showing entries extracted from each of the 20 filled-out PDF forms

PDF File Patient ID Fields Extracted Sample Fields Answer Variations

🤖 LLM Training Recommendations

1. Constrain Answer Formats

Based on typical inputs, LLMs should be constrained to accept only valid formats:


                    

2. Provide Answer Examples in Prompts

Include example answers in prompts to guide extraction:


                    

3. Validation Rules

Implement validation to reject invalid formats:


                    

4. Constraining Number of Answers

Instead of open-ended extraction, limit LLM responses to observed patterns: