Skip to content

Handle YouTube transcript errors and improve metadata extraction#2178

Open
Sasivarnasarma wants to merge 1 commit into
microsoft:mainfrom
Sasivarnasarma:fix/youtube-transcript-fallback
Open

Handle YouTube transcript errors and improve metadata extraction#2178
Sasivarnasarma wants to merge 1 commit into
microsoft:mainfrom
Sasivarnasarma:fix/youtube-transcript-fallback

Conversation

@Sasivarnasarma

Copy link
Copy Markdown

Description

This Pull Request addresses an issue in YouTubeConverter where transcript listing failures crash the entire converter, causing it to fall back to HtmlConverter and output raw, unreadable scraping HTML (such as cookie walls) instead of clean video details.

Problem

In markitdown/converters/_youtube_converter.py, the initial call to retrieve the transcripts list (ytt_api.list(video_id)) was executed outside of the safety try/except block.

Transcript listing and fetching commonly fail due to:

  1. IP Blocking / Rate Limiting: Datacenter/CI IP addresses are regularly blocked or rate-limited by YouTube.
  2. Subtitles Disabled: The target video does not have any manual or auto-generated captions.
  3. Age Gate / Region Restrictions: The video requires authentication cookies.

When list() raises an exception under these conditions, the entire YouTubeConverter.convert method crashes. The orchestrator catches the crash and falls back to HtmlConverter, which yields a bad user experience (raw HTML of the YouTube sign-in page / cookie consent banner) instead of the successfully extracted video metadata (title, views, runtime, description).

Solution

  1. Wrapped the transcript listing operations (list() and language extraction) in the outer try/except block.
  2. If listing or fetching the transcript fails, the converter now gracefully handles the exception, prints the failure, sets a fallback string *(Transcript unavailable)*, and continues execution.
  3. This ensures the output Markdown still contains the successfully parsed video title, views, keywords, runtime, and description.

Wraps YouTube transcript listing and retrieval in a try/except block.
This prevents the converter from crashing and falling back to HtmlConverter
when transcripts are disabled, rate-limited, or blocked.

Instead, the converter now gracefully continues and returns the successfully
extracted video metadata and description.
@Sasivarnasarma

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant