A hybrid OCR-LLM pipeline that extracts structured data from documents using Tesseract OCR and Google's Gemini Flash LLM. The system processes images through local OCR first, then leverages AI with Pydantic schema validation to ensure reliable JSON output every time.
The system runs images through Tesseract for fast, local OCR then sends to Gemini Flash with a prompt that includes the OCR snippet and the Pydantic-derived JSON schema. Gemini must return JSON which is validated with Pydantic, retrying up to three times on failure - resulting in cheap offline OCR for clean scans, LLM processing and schema-guaranteed JSON every time.
- Receipts - Extracts merchant info, items, prices, payment method
- Driver's Licenses - Extracts name, DOB, license number, state, expiry
- Resumes - Extracts contact info, skills, education, work experience
- 🔍 Tesseract OCR for local text extraction
- 🤖 Gemini Flash integration for AI-powered data extraction
- 📋 Pydantic schema validation ensures consistent JSON output
- 🔄 Automatic retry (up to 3 attempts) on validation failures
- 📁 Batch processing for folders of documents
- 🖼️ Multi-format support (PDF, PNG, JPG, JPEG, TIFF)
- Python 3.13+
- Google AI API key (for Gemini Flash)
- Tesseract OCR installed on your system
-
Clone the repository:
git clone https://github.com/pushkar1713/ocr-llm-py.git cd firstwork-assignment -
Install dependencies:
uv add
-
Set up environment variables: Create a
.envfile in the root directory:GEMINI_API_KEY=your_google_ai_api_key_here
Process a single document:
python main.py run --type receipt --path /path/to/document.pdfProcess a folder of documents:
python main.py run --type licence --path /path/to/folderRun without arguments for interactive prompts:
python main.py runThe system will prompt you for:
- Document type (receipt, licence, resume)
- Path to file or folder
- Output directory (optional, defaults to
./output)
Process a receipt:
python main.py run -t receipt -p shop_receipts/receipt_001.jpgProcess driver's licenses in bulk:
python main.py run -t licence -p Drivers_license/ -o processed_licenses/Process resume PDFs:
python main.py run -t resume -p Resume/candidate_resume.pdfThe system generates JSON files in the specified output directory:
- Format:
{filename}.p{page_number}.json - Each page of a document gets its own JSON file
- Example:
document.pdf→document.p0.json,document.p1.json
Driver's License:
{
"name": "Aaron Collins",
"dob": "1956-10-09",
"license_number": "MKWMJO89",
"issuing_state": "Ireland",
"expiry_date": "1967-04-14"
}firstwork-assignment/
├── main.py # CLI entry point
├── src/
│ ├── config.py # Environment configuration
│ ├── models.py # Pydantic data models
│ ├── pipeline.py # Main processing pipeline
│ └── utils/
│ ├── ocr.py # Tesseract OCR functions
│ └── llm.py # Gemini LLM integration
├── output/ # Generated JSON files
├── pyproject.toml # Project dependencies
└── README.md
- Document Input - Accepts PDF or image files
- OCR Processing - Tesseract extracts text with confidence scoring
- LLM Enhancement - If OCR confidence is low, Gemini Flash processes both image and OCR text
- Schema Validation - Pydantic validates output against predefined models
- Retry Logic - Up to 3 attempts for failed validations
- JSON Output - Structured data saved as JSON files
- DPI: 300 (configurable in
pdf_to_images()) - Tesseract config: English language, PSM 6
- Model:
gemini-2.5-flash - Response format: JSON only
- Max retries: 3 attempts
- Validation errors trigger retry with error context
- API errors are logged with detailed messages
- File errors show clear path and permission issues
google-generativeai- Gemini AI integrationpytesseract- Tesseract OCR wrapperpydantic- Data validation and serializationopencv-python-headless- Image processingpymupdf- PDF processingtyper- CLI frameworkrich- Terminal formatting