Handle YouTube transcript errors and improve metadata extraction#2178
Open
Sasivarnasarma wants to merge 1 commit into
Open
Handle YouTube transcript errors and improve metadata extraction#2178Sasivarnasarma wants to merge 1 commit into
Sasivarnasarma wants to merge 1 commit into
Conversation
Wraps YouTube transcript listing and retrieval in a try/except block. This prevents the converter from crashing and falling back to HtmlConverter when transcripts are disabled, rate-limited, or blocked. Instead, the converter now gracefully continues and returns the successfully extracted video metadata and description.
Author
|
@microsoft-github-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This Pull Request addresses an issue in
YouTubeConverterwhere transcript listing failures crash the entire converter, causing it to fall back toHtmlConverterand output raw, unreadable scraping HTML (such as cookie walls) instead of clean video details.Problem
In
markitdown/converters/_youtube_converter.py, the initial call to retrieve the transcripts list (ytt_api.list(video_id)) was executed outside of the safetytry/exceptblock.Transcript listing and fetching commonly fail due to:
When
list()raises an exception under these conditions, the entireYouTubeConverter.convertmethod crashes. The orchestrator catches the crash and falls back toHtmlConverter, which yields a bad user experience (raw HTML of the YouTube sign-in page / cookie consent banner) instead of the successfully extracted video metadata (title, views, runtime, description).Solution
list()and language extraction) in the outertry/exceptblock.*(Transcript unavailable)*, and continues execution.