🔍 InterPARES-Vision

Advanced Document Layout Analysis & Structured Output Generation

A powerful AI-driven tool for archivists to digitize, structure, and interact with documents using advanced OCR vision model and LLM.

🌐 Live Demo

Try InterPARES-Vision online: demos.dlnlp.ai/InterPARES/

No installation required - access the full functionality through your web browser!

📖 Overview

InterPARES-Vision is an advanced OCR (Optical Character Recognition) and layout analysis tool designed specifically for archival documents. It combines state-of-the-art AI vision models to extract text, preserve document structure, generate machine-readable outputs, and interact with extracted text from scanned documents and images.

💡 Key Capabilities

Document Structure Understanding: Identifies headings, paragraphs, tables, lists, and maintains proper reading order
Interactive AI Chat: Ask questions about parsed documents and interact with extracted text using natural language
Metadata Extraction: Request translations, summaries, and structured metadata through conversational queries
Multi-format Output: Generate Markdown, JSON, and annotated visualizations
Batch Processing: Handle multi-page PDFs with consistent quality

✨ Features

🎯 Core Features

Layout Detection: Identifies document regions including text blocks, tables, images, headings, and captions
OCR Text Extraction: Extracts text with high accuracy, even from degraded or complex documents
Structure Preservation: Maintains document hierarchy and reading order
Multi-page PDF Support: Process entire PDF documents with page-by-page analysis
Interactive Visualization: View detected layout regions overlaid on original documents

💬 AI-Powered Chat

Natural Language Queries: Ask questions about document content in plain language
Metadata Generation: Extract structured metadata in JSON format for archival systems
Translation Support: Request translations of document sections or entire documents
Summarization: Get concise summaries and key information extraction
Classification Assistance: Identify document types and suggest archival categories

📊 Output Formats

Markdown: Formatted text with preserved structure and hierarchy
JSON: Structured data with bounding boxes, element types, and coordinates
Annotated Images: Visual overlay showing detected layout regions with color-coded boxes
Downloadable Results: ZIP package with all output formats and original files

📁 Supported File Types

Format	Extensions	Description
PDF Documents	`.pdf`	Multi-page or single-page PDF files (processed page-by-page)
Images	`.jpg`, `.jpeg`, `.png`	Scanned images or photographs of documents

💡 Best Results: Use high-resolution scans (200+ DPI) with good contrast and minimal skew.

🚀 Quick Start

Prerequisites

Python 3.8 or higher
CUDA-compatible GPU (recommended for optimal performance)
8GB+ RAM (16GB+ recommended for large documents)

Installation

# Clone the repository
git clone https://github.com/UBC-NLP/InterPARES_vision.git
cd InterPARES-vision

# Install dependencies
pip install -r requirements.txt

# Install DotsOCR parser
pip install dots-ocr

Running the Application

# Start the application (default port: 7860)
python app.py 7860

The application will be available at http://localhost:7860 (or your specified port).

📘 Usage Guide

1️⃣ Select or Upload a Document

Option A: Use Example Documents

Click on any thumbnail in the "📥 Select Example Document" gallery
Browse through available examples using Previous/Next buttons

Option B: Upload Your Own

Click "📁 Upload PDF or Image" button
Select a file from your computer (PDF, JPG, PNG)

2️⃣ Navigate Multi-Page Documents

For PDF files:

Use ⬅ Previous and Next ➡ buttons to browse pages
View current position with page counter (e.g., "2 / 10")

3️⃣ Choose a Prompt Mode

Mode	Description	Best For
prompt_layout_all_en	Full analysis: layout + OCR + reading order	Complex documents with mixed content
prompt_layout_only_en	Layout detection without text extraction	Understanding document organization
prompt_ocr	OCR-focused with minimal layout	Simple text documents

💡 Recommendation: Start with prompt_layout_all_en for comprehensive analysis.

4️⃣ Parse the Document

Click 🔍 Parse to begin processing. The system will:

Analyze document layout
Extract text from detected regions
Generate structured output in multiple formats

5️⃣ View Results

Results appear in three tabs:

Markdown Render Preview: Human-readable formatted view
Markdown Raw Text: Plain Markdown with formatting codes
Current Page JSON: Structured data with coordinates and element types

6️⃣ Ask Questions (💬 Interactive Chat)

After parsing, use the AI chat feature:

Example Questions:
- "Extract the main keywords for archival indexing"
- "What is the document type and subject matter?"
- "Extract metadata in JSON format"
- "Translate the summary section into French"
- "List all dates, names, and locations mentioned"

7️⃣ Download Results

Click ⬇️ Download Results to get a ZIP file containing:

Layout images with annotations
JSON files with structured data
Markdown files with formatted text
Original input file

🏛️ Archival Applications

Use Cases for Archivists

Digitization Projects: Convert scanned documents to searchable, structured text
Metadata Extraction: Automatically generate catalog records and finding aids
Collection Assessment: Rapidly evaluate document content and significance
Multilingual Access: Translate documents for broader accessibility
Data Extraction: Pull structured information from historical records
Classification Support: AI-assisted document type and subject identification

Best Practices

✅ Use consistent scan settings (200+ DPI) for optimal results
✅ Process similar document types together with the same prompt mode
✅ Review sample outputs (5-10%) from each batch for quality assurance
✅ Keep original scans alongside OCR outputs in your digital repository
✅ Document processing settings (tool version, prompt mode, date) in metadata
✅ Verify AI-generated metadata against professional archival standards

Used Model Cards

Models Functionality in the Pipeline

Model	Role	Description
dot.ocr	Vision/OCR	Advanced vision model for document layout analysis, text extraction, and structure recognition
Qwen3-4B-Instruct-2507-FP8	Chat/Reader	Large Language Model for natural language interaction, summarization, and metadata extraction from parsed content

dot.ocr

Model Card: https://huggingface.co/rednote-hilab/dots.ocr/

Limitations and Considerations

Performance depends heavily on input image resolution (200+ DPI recommended)
Complex handwritten text may have lower recognition accuracy compared to printed text
Very dense or overlapping layouts might require manual verification
Processing speed scales with image size and complexity

Qwen3-4B-Instruct-2507-FP8

Model Card: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507-FP8

Limitations and Considerations

Context window limitations may apply for extremely long documents
As with all LLMs, there is a potential for hallucination, especially with ambiguous input
Inference speed depends on available GPU resources (FP8 quantization helps efficiency)
Knowledge cutoff applies to information not contained within the provided document context

🔧 Configuration

Server Configuration

Default settings in app.py:

DEFAULT_CONFIG = {
    'ip': "127.0.0.1",
    'port_vllm': 8001,
    'min_pixels': MIN_PIXELS,
    'max_pixels': MAX_PIXELS,
    'test_images_dir': "./assets/showcase_origin",
}

Chat Model Configuration

The chat feature uses vLLM with OpenAI-compatible API:

chat_client = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model="Qwen3-4B-Instruct-2507-FP8",
    temperature=0.1,
    max_tokens=16000,
    streaming=True
)

📞 Support

Live Demo: demos.dlnlp.ai/InterPARES/
Issues: GitHub Issues

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets/showcase_origin		assets/showcase_origin
dots_ocr		dots_ocr
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

UBC-NLP/InterPARES_vision

Folders and files

Latest commit

History

Repository files navigation

🔍 InterPARES-Vision

🌐 Live Demo

📖 Overview

💡 Key Capabilities

✨ Features

🎯 Core Features

💬 AI-Powered Chat

📊 Output Formats

📁 Supported File Types

🚀 Quick Start

Prerequisites

Installation

Running the Application

📘 Usage Guide

1️⃣ Select or Upload a Document

2️⃣ Navigate Multi-Page Documents

3️⃣ Choose a Prompt Mode

4️⃣ Parse the Document

5️⃣ View Results

6️⃣ Ask Questions (💬 Interactive Chat)

7️⃣ Download Results

🏛️ Archival Applications

Use Cases for Archivists

Best Practices

Used Model Cards

Models Functionality in the Pipeline

dot.ocr

Limitations and Considerations

Qwen3-4B-Instruct-2507-FP8

Limitations and Considerations

🔧 Configuration

Server Configuration

Chat Model Configuration

📞 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages