Skip to content

Add \markitdown --info FILE\ flag to emit document metadata as JSON (no full conversion) #2182

Description

@fuleinist

Feature: markitdown --info FILE — emit document metadata without conversion

Summary

Add a lightweight --info flag that prints a structured JSON summary of a file without running the full conversion pipeline. This is useful for triage and routing in agent-driven document pipelines.

Motivation

When markitdown is used inside an AI agent harness, the agent often needs to make a routing decision before committing to a full conversion:

  • Is this file a DOCX, a PDF with images, or a scanned image that needs OCR?
  • Roughly how big is the conversion (page count, embedded image count, table count)?
  • Which converter path would be taken, and is it likely to fail?

Currently the only way to learn any of this is to call markitdown and either inspect the output or wait for a converter exception. A cheap metadata preflight lets agents:

  1. Skip work entirely for unsupported types (already returns an error, but only after some setup).
  2. Estimate cost/duration before invoking expensive converters (PDF, PPTX with embedded media).
  3. Surface a structured "document card" to the user before committing.

Proposed behaviour

markitdown --info path/to/file.docx
{
  "path": "path/to/file.docx",
  "size_bytes": 482113,
  "mime_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
  "detected_converter": "DocxConverter",
  "page_count": 12,
  "image_count": 4,
  "table_count": 3,
  "estimated_tokens": 4800
}
  • page_count / image_count / table_count: only populated when the underlying converter can cheaply extract them; null otherwise.
  • estimated_tokens: a rough estimate from converted character count (existing DocumentConverterResult markdown length / 4), or null if not yet converted.
  • Output is always JSON to stdout, even without --info, when --json is the desired long-term flag (cf. Feature Request: Add JSON Output Option #2029) — but --info ships first because it does not require a full conversion pass.

Why a separate flag instead of extending --json?

--json is for the converted markdown. --info is for metadata about the source. They serve different lifecycle stages and different consumers (one feeds downstream prompts, the other feeds routing logic). Keeping them separate avoids overloading --json.

Backwards compatibility

Pure addition. Existing CLI flags and exit codes are untouched. --info exits 0 even when the file is unsupported (the JSON's detected_converter will be null and the converter error captured in a warning field) so that agents can still inspect the response.

Stretch

  • --info --jsonl FILE1 FILE2 … for batch preflight (one record per line, mirrors jq -c shape).
  • Wire markitdown --info into the MCP markitdown tool as a separate inspect_document action so MCP-based agents don't need to spawn a subprocess just to peek.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions