Data Cleaning with Regex Before AI Analysis
Prepare clean, structured data for AI using regex patterns—garbage in, garbage out no more
Why Data Cleaning Comes First
The old programming adage "garbage in, garbage out" applies doubly to AI-assisted bookkeeping. AI models work best with clean, structured data. Feed them messy, inconsistent data, and you'll get unreliable results—no matter how advanced the AI.
Regular expressions are the bookkeeper's power tool for data cleaning. Before sending your financial data to an LLM for analysis, use regex to remove noise, standardize formats, and ensure quality.
Common Data Cleaning Tasks
1. Remove Extra Whitespace
Bank exports often have irregular spacing:
Before: "VENDOR NAME $500.00" Pattern: \s+ Replace: " " (single space) After: "VENDOR NAME $500.00"\s+
2. Standardize Currency Symbols
Before: "USD 100.00", "US$ 100.00", "100.00 USD" Pattern: (USD\s*\$?|US\$|\s*USD$) Replace: "$" After: "$100.00" Makes all amounts consistent for AI processingBefore: "USD 100.00", "US$ 100.00", "100.00 USD" Pattern: (USD\s*\$?|US\$|\s*USD$) Replace: "$" After: "$100.00" Makes all amounts consistent for AI processing3. Remove Non-Printable Characters
Pattern: [^\x20-\x7E]+ Replace: "" Removes: Tabs, line breaks, special characters Leaves: Only printable ASCII characters Critical for clean CSV importsPattern: [^\x20-\x7E]+ Replace: "" Removes: Tabs, line breaks, special characters Leaves: Only printable ASCII characters Critical for clean CSV imports4. Normalize Account Numbers
Before: "Account #1234", "ACCT 1234", "Acct# 1234" Pattern: (?i)acct\.?\s*#?\s*(\d+) Extract: Group 1 (just the number) After: "1234" Standardized for matching and lookupsBefore: "Account #1234", "ACCT 1234", "Acct# 1234" Pattern: (?i)acct\.?\s*#?\s*(\d+) Extract: Group 1 (just the number) After: "1234" Standardized for matching and lookupsPre-Processing for AI Analysis
Clean Transaction Descriptions
Bank transaction descriptions contain clutter that confuses AI:
Example:
Raw: "SQ *COFFEE SHOP 123 MAIN ST CA SN:AB12CD34 CARD 1234" Cleaning patterns: 1. Remove card info: CARD\s+\d{4} → "" 2. Remove serial: SN:[A-Z0-9]+ → "" 3. Extract vendor: SQ \*(.+?)(?:\s+\d+|$) → "COFFEE SHOP" Clean: "COFFEE SHOP" Now AI can accurately categorize without noise!CARD\s+\d{4}SN:[A-Z0-9]+SQ \*(.+?)(?:\s+\d+|$)
Standardizing Field Formats
Phone Numbers
Before: "(760) 249-7680", "760-249-7680", "7602497680" Pattern: \(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4}) Replace: "$1-$2-$3" After: "760-249-7680" Consistent format for AI to process contact infoBefore: "(760) 249-7680", "760-249-7680", "7602497680" Pattern: \(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4}) Replace: "$1-$2-$3" After: "760-249-7680" Consistent format for AI to process contact infoEmail Addresses
Extract valid emails: Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} Validate: Use AI to check if domain exists Clean: Convert to lowercase for consistencyExtract valid emails: Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} Validate: Use AI to check if domain exists Clean: Convert to lowercase for consistencyEINs (Employer Identification Numbers)
Before: "12-3456789", "123456789", "12 3456789" Pattern: (\d{2})[-\s]?(\d{7}) Replace: "$1-$2" After: "12-3456789" Properly formatted for IRS formsBefore: "12-3456789", "123456789", "12 3456789" Pattern: (\d{2})[-\s]?(\d{7}) Replace: "$1-$2" After: "12-3456789" Properly formatted for IRS formsRemoving Duplicates
Duplicate Transaction Detection
Use regex to create unique identifiers:
Create hash from: - Date: \d{2}/\d{2}/\d{4} - Vendor: ^[A-Z\s]+ - Amount: \$[\d,]+\.\d{2} Combine: "2025-11-15|AMAZON|1234.56" AI can then: "Find all transactions with identical hashes. These are likely duplicates. Flag for review."Create hash from: - Date: \d{2}/\d{2}/\d{4} - Vendor: ^[A-Z\s]+ - Amount: \$[\d,]+\.\d{2} Combine: "2025-11-15|AMAZON|1234.56" AI can then: "Find all transactions with identical hashes. These are likely duplicates. Flag for review."\d{2}/\d{2}/\d{4}^[A-Z\s]+\$[\d,]+\.\d{2}Special Characters and Encoding
Remove Problem Characters
// Smart quotes to straight quotes Pattern: [""] Replace: " // Em dash to hyphen Pattern: — Replace: - // Degree symbol to word Pattern: ° Replace: deg // Non-breaking space to regular space Pattern: \u00A0 Replace: " "// Smart quotes to straight quotes Pattern: [""] Replace: " // Em dash to hyphen Pattern: — Replace: - // Degree symbol to word Pattern: ° Replace: deg // Non-breaking space to regular space Pattern: \u00A0 Replace: " "Google Sheets Cleaning Functions
Comprehensive Cleaning Formula
=TRIM( REGEXREPLACE( REGEXREPLACE( REGEXREPLACE(A2, "[^\x20-\x7E]", "" // Remove non-printable ), "\s+", " " // Multiple spaces to one ), "^\s+|\s+$", "" // Trim edges ) ) Chains three regex operations: 1. Remove special characters 2. Collapse multiple spaces 3. Trim whitespace=TRIM( REGEXREPLACE( REGEXREPLACE( REGEXREPLACE(A2, "[^\x20-\x7E]", "" // Remove non-printable ), "\s+", " " // Multiple spaces to one ), "^\s+|\s+$", "" // Trim edges ) ) Chains three regex operations: 1. Remove special characters 2. Collapse multiple spaces 3. Trim whitespaceAI-Guided Data Quality Checks
After regex cleaning, use AI to verify quality:
Quality Check Prompt:
"I cleaned this data using regex patterns. Validate quality: 1. All amounts match ^\$[\d,]+\.\d{2}$ ✓ 2. All dates match ^\d{4}-\d{2}-\d{2}$ ✓ 3. No duplicate whitespace ✓ Now check for: - Logical inconsistencies - Unlikely amounts (e.g., $0.00 transactions) - Missing required fields - Dates in wrong fiscal period"^\$[\d,]+\.\d{2}$^\d{4}-\d{2}-\d{2}$
Real-World Cleaning Workflow
Excel/CSV Import Preparation
- Export from bank (often messy format)Export from bank
- Regex cleaning: Remove extra spaces: \s+ → " " Standardize amounts: Add $ and .00 where missing Fix dates: Convert all to YYYY-MM-DD Clean vendor names: Remove transaction codesRegex cleaning:
- Remove extra spaces: \s+ → " "
\s+ - Standardize amounts: Add $ and .00 where missing
- Fix dates: Convert all to YYYY-MM-DD
- Clean vendor names: Remove transaction codes
- Remove extra spaces: \s+ → " "
- AI validation: "Check this cleaned data for any remaining issues"AI validation:
- Import to QuickBooks with confidenceImport to QuickBooks
Best Practices
- Clean early: Don't wait until reconciliation timeClean early:
- Document patterns: Save regex for reuseDocument patterns:
- Test thoroughly: Run on historical data firstTest thoroughly:
- Validate with AI: Double-check cleaning didn't corrupt dataValidate with AI:
- Keep originals: Always maintain raw data backupKeep originals:
Conclusion
Data cleaning is the unglamorous but essential foundation of AI-assisted bookkeeping. By using regex to systematically clean and standardize your financial data before AI analysis, you ensure accurate results, save debugging time, and build reliable automated workflows.
Remember: Clean data in = accurate insights out!