Extracting Invoice Data Using Regex and LLMs
Automate invoice data entry with 98% accuracy using pattern matching and AI
The Invoice Data Entry Problem
Manual invoice entry is one of the most time-consuming bookkeeping tasks. Processing 100 invoices per month can consume 15-20 hours of manual data entry. With regex-powered AI, you can reduce this to under 1 hour while improving accuracy.
Critical Invoice Fields to Extract
Every invoice contains similar data points that regex can reliably identify:
1. Invoice Number Patterns
Common Formats:
- INV-12345 β Pattern: INV-\d{5}
INV-\d{5} - Invoice #A-2025-001 β Pattern: Invoice #[A-Z]-\d{4}-\d{3}
Invoice #[A-Z]-\d{4}-\d{3} - #2025110001 β Pattern: #\d{10}
#\d{10} - SI-Nov-2025-123 β Pattern: SI-[A-Za-z]{3}-\d{4}-\d+
SI-[A-Za-z]{3}-\d{4}-\d+
2. Date Extraction
Invoices use various date formats. Regex patterns for each:
| Format | Pattern | Example |
|---|---|---|
| MM/DD/YYYY | \d{2}/\d{2}/\d{4} | 11/15/2025 |
| DD-MM-YYYY | \d{2}-\d{2}-\d{4} | 15-11-2025 |
| Month DD, YYYY | [A-Za-z]+ \d{1,2}, \d{4} | November 15, 2025 |
| YYYY-MM-DD | \d{4}-\d{2}-\d{2} | 2025-11-15 |
3. Amount Extraction
Currency amounts come in many formats:
- With currency symbol: \$[\d,]+\.\d{2} β $1,234.56With currency symbol:
\$[\d,]+\.\d{2} - Without symbol: \b\d+\.\d{2}\b β 1234.56Without symbol:
\b\d+\.\d{2}\b - With thousands separator: \$?[\d,]+\.\d{2} β $1,234.56 or 1,234.56With thousands separator:
\$?[\d,]+\.\d{2} - Total line: (?i)total:?\s*\$?([\d,]+\.\d{2})Total line:
(?i)total:?\s*\$?([\d,]+\.\d{2})
4. Vendor Information
Extract vendor details:
- Company name: Often in first few lines, all capsCompany name:
- Address: Street, City, State ZIP patternAddress:
- Tax ID: \d{2}-\d{7} (EIN format)Tax ID:
\d{2}-\d{7} - Website: www\.[a-z0-9-]+\.(com|net|org)Website:
www\.[a-z0-9-]+\.(com|net|org)
AI + Regex Workflow for Invoices
Step-by-Step Process
- Convert PDF to text Use OCR or PDF extraction tool (many AI platforms include this)Convert PDF to text
Use OCR or PDF extraction tool (many AI platforms include this)
- Apply regex pre-extraction Pull out obvious patterns: dates, amounts, invoice numbersApply regex pre-extraction
Pull out obvious patterns: dates, amounts, invoice numbers
- AI prompt with extracted data "From this invoice text, I've extracted: - Invoice number: [regex result] - Date: [regex result] - Total: [regex result] Please verify these are correct and extract: 1. Vendor name 2. Billing address 3. Line items with descriptions and amounts 4. Tax amount 5. Payment terms Return in JSON format."AI prompt with extracted data
"From this invoice text, I've extracted: - Invoice number: [regex result] - Date: [regex result] - Total: [regex result] Please verify these are correct and extract: 1. Vendor name 2. Billing address 3. Line items with descriptions and amounts 4. Tax amount 5. Payment terms Return in JSON format."
"From this invoice text, I've extracted: - Invoice number: [regex result] - Date: [regex result] - Total: [regex result] Please verify these are correct and extract: 1. Vendor name 2. Billing address 3. Line items with descriptions and amounts 4. Tax amount 5. Payment terms Return in JSON format." - AI processes and structures data Returns clean JSON with all invoice fieldsAI processes and structures data
Returns clean JSON with all invoice fields
- Regex validation of AI output Verify amounts match pattern, dates are valid, invoice number format correctRegex validation of AI output
Verify amounts match pattern, dates are valid, invoice number format correct
- Import to accounting system Direct API integration or CSV importImport to accounting system
Direct API integration or CSV import
Advanced Extraction Patterns
Line Items
Extract individual line items from invoices:
Pattern: ^(.+?)\s+(\d+)\s+\$?([\d,]+\.\d{2})\s+\$?([\d,]+\.\d{2})$ Matches: Office Supplies 5 $10.00 $50.00 Consulting Hours 10 $150.00 $1,500.00 Groups: 1. Description 2. Quantity 3. Unit price 4. Line totalPattern: ^(.+?)\s+(\d+)\s+\$?([\d,]+\.\d{2})\s+\$?([\d,]+\.\d{2})$ Matches: Office Supplies 5 $10.00 $50.00 Consulting Hours 10 $150.00 $1,500.00 Groups: 1. Description 2. Quantity 3. Unit price 4. Line totalTax Amounts
Find sales tax or VAT:
Pattern: (?i)(sales?\s+tax|vat):?\s*\$?([\d,]+\.\d{2}) Matches: Sales Tax: $45.67 VAT $123.45 Tax $5.00Pattern: (?i)(sales?\s+tax|vat):?\s*\$?([\d,]+\.\d{2}) Matches: Sales Tax: $45.67 VAT $123.45 Tax $5.00Payment Terms
Pattern: (?i)(net|due in)\s+(\d+)\s+(days?|months?) Matches: Net 30 days Due in 15 days Net 60Pattern: (?i)(net|due in)\s+(\d+)\s+(days?|months?) Matches: Net 30 days Due in 15 days Net 60Real-World Success Story
Case Study: Construction Company
Challenge: 200 vendor invoices monthly, each requiring manual entry Time: 25 hours per month Solution: Regex + AI extraction system Results: β’ 95% of fields auto-extracted β’ Time reduced to 2 hours (92% savings) β’ Error rate dropped from 5% to 0.5% β’ ROI: $2,000+ monthly in saved laborChallenge:
Time:
Solution:
Results:
Best Practices
1. Vendor-Specific Patterns
Create custom patterns for your top 20 vendorsβthey likely represent 80% of invoice volume.
2. Validation Regex
After extraction, validate:
- Amount format is correct
- Date is not in future
- Invoice number is unique
- Subtotal + tax = total (AI can calculate, regex can validate format)
3. Confidence Scoring
Have AI rate extraction confidence. If <90%, flag for manual review.
Common Pitfalls to Avoid
- β Making patterns too specific (won't catch variations)
- β Making patterns too broad (false positives)
- β Not testing on sample invoices first
- β Forgetting case-insensitive matching
- β Start with vendor-specific patterns
- β Use AI to suggest patterns based on examples
- β Validate extracted data before importing
Next in This Series
Continue learning:
- Date Format Standardization with Regex and AIDate Format Standardization with Regex and AI
- Vendor Name Normalization Using RegexVendor Name Normalization Using Regex
- Receipt Parsing Automation with RegexReceipt Parsing Automation with Regex