Advanced Document Layout Analysis & Structured Output Generation
A powerful AI-driven tool for archivists to digitize, structure, and interact with documents using advanced OCR vision model and LLM.
Try InterPARES-Vision online: demos.dlnlp.ai/InterPARES/
No installation required - access the full functionality through your web browser!
InterPARES-Vision is an advanced OCR (Optical Character Recognition) and layout analysis tool designed specifically for archival documents. It combines state-of-the-art AI vision models to extract text, preserve document structure, generate machine-readable outputs, and interact with extracted text from scanned documents and images.
- Document Structure Understanding: Identifies headings, paragraphs, tables, lists, and maintains proper reading order
- Interactive AI Chat: Ask questions about parsed documents and interact with extracted text using natural language
- Metadata Extraction: Request translations, summaries, and structured metadata through conversational queries
- Multi-format Output: Generate Markdown, JSON, and annotated visualizations
- Batch Processing: Handle multi-page PDFs with consistent quality
- Layout Detection: Identifies document regions including text blocks, tables, images, headings, and captions
- OCR Text Extraction: Extracts text with high accuracy, even from degraded or complex documents
- Structure Preservation: Maintains document hierarchy and reading order
- Multi-page PDF Support: Process entire PDF documents with page-by-page analysis
- Interactive Visualization: View detected layout regions overlaid on original documents
- Natural Language Queries: Ask questions about document content in plain language
- Metadata Generation: Extract structured metadata in JSON format for archival systems
- Translation Support: Request translations of document sections or entire documents
- Summarization: Get concise summaries and key information extraction
- Classification Assistance: Identify document types and suggest archival categories
- Markdown: Formatted text with preserved structure and hierarchy
- JSON: Structured data with bounding boxes, element types, and coordinates
- Annotated Images: Visual overlay showing detected layout regions with color-coded boxes
- Downloadable Results: ZIP package with all output formats and original files
| Format | Extensions | Description |
|---|---|---|
| PDF Documents | .pdf |
Multi-page or single-page PDF files (processed page-by-page) |
| Images | .jpg, .jpeg, .png |
Scanned images or photographs of documents |
π‘ Best Results: Use high-resolution scans (200+ DPI) with good contrast and minimal skew.
- Python 3.8 or higher
- CUDA-compatible GPU (recommended for optimal performance)
- 8GB+ RAM (16GB+ recommended for large documents)
# Clone the repository
git clone https://github.com/UBC-NLP/InterPARES_vision.git
cd InterPARES-vision
# Install dependencies
pip install -r requirements.txt
# Install DotsOCR parser
pip install dots-ocr# Start the application (default port: 7860)
python app.py 7860
The application will be available at http://localhost:7860 (or your specified port).
Option A: Use Example Documents
- Click on any thumbnail in the "π₯ Select Example Document" gallery
- Browse through available examples using Previous/Next buttons
Option B: Upload Your Own
- Click "π Upload PDF or Image" button
- Select a file from your computer (PDF, JPG, PNG)
For PDF files:
- Use β¬ Previous and Next β‘ buttons to browse pages
- View current position with page counter (e.g., "2 / 10")
| Mode | Description | Best For |
|---|---|---|
| prompt_layout_all_en | Full analysis: layout + OCR + reading order | Complex documents with mixed content |
| prompt_layout_only_en | Layout detection without text extraction | Understanding document organization |
| prompt_ocr | OCR-focused with minimal layout | Simple text documents |
π‘ Recommendation: Start with prompt_layout_all_en for comprehensive analysis.
Click π Parse to begin processing. The system will:
- Analyze document layout
- Extract text from detected regions
- Generate structured output in multiple formats
Results appear in three tabs:
- Markdown Render Preview: Human-readable formatted view
- Markdown Raw Text: Plain Markdown with formatting codes
- Current Page JSON: Structured data with coordinates and element types
After parsing, use the AI chat feature:
Example Questions:
- "Extract the main keywords for archival indexing"
- "What is the document type and subject matter?"
- "Extract metadata in JSON format"
- "Translate the summary section into French"
- "List all dates, names, and locations mentioned"
Click β¬οΈ Download Results to get a ZIP file containing:
- Layout images with annotations
- JSON files with structured data
- Markdown files with formatted text
- Original input file
- Digitization Projects: Convert scanned documents to searchable, structured text
- Metadata Extraction: Automatically generate catalog records and finding aids
- Collection Assessment: Rapidly evaluate document content and significance
- Multilingual Access: Translate documents for broader accessibility
- Data Extraction: Pull structured information from historical records
- Classification Support: AI-assisted document type and subject identification
- β Use consistent scan settings (200+ DPI) for optimal results
- β Process similar document types together with the same prompt mode
- β Review sample outputs (5-10%) from each batch for quality assurance
- β Keep original scans alongside OCR outputs in your digital repository
- β Document processing settings (tool version, prompt mode, date) in metadata
- β Verify AI-generated metadata against professional archival standards
| Model | Role | Description |
|---|---|---|
| dot.ocr | Vision/OCR | Advanced vision model for document layout analysis, text extraction, and structure recognition |
| Qwen3-4B-Instruct-2507-FP8 | Chat/Reader | Large Language Model for natural language interaction, summarization, and metadata extraction from parsed content |
Model Card: https://huggingface.co/rednote-hilab/dots.ocr/
- Performance depends heavily on input image resolution (200+ DPI recommended)
- Complex handwritten text may have lower recognition accuracy compared to printed text
- Very dense or overlapping layouts might require manual verification
- Processing speed scales with image size and complexity
Model Card: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507-FP8
- Context window limitations may apply for extremely long documents
- As with all LLMs, there is a potential for hallucination, especially with ambiguous input
- Inference speed depends on available GPU resources (FP8 quantization helps efficiency)
- Knowledge cutoff applies to information not contained within the provided document context
Default settings in app.py:
DEFAULT_CONFIG = {
'ip': "127.0.0.1",
'port_vllm': 8001,
'min_pixels': MIN_PIXELS,
'max_pixels': MAX_PIXELS,
'test_images_dir': "./assets/showcase_origin",
}The chat feature uses vLLM with OpenAI-compatible API:
chat_client = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
model="Qwen3-4B-Instruct-2507-FP8",
temperature=0.1,
max_tokens=16000,
streaming=True
)- Live Demo: demos.dlnlp.ai/InterPARES/
- Issues: GitHub Issues