Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

video disabled due to region lock shows Transcript/subtitle disabled. #213

Open
michaelthwan opened this issue Jun 28, 2023 · 11 comments
Open
Labels
enhancement New feature or request

Comments

@michaelthwan
Copy link

To Reproduce

Steps to reproduce the behavior:

What code / cli command are you executing?

A user tried extracting this video
https://www.youtube.com/watch?v=kZsVStYdmws
This video is available in only some regions (e.g. Hong Kong, Taiwan) but not for the others (e.g. United States).
Therefore, it works in local (Hong Kong) but after deployment (to a US server), it will shows
Subtitles are disabled for this video

This code can reproduce that, it worked if using VPN for HK region. Doesn't work for US

video_id = "kZsVStYdmws"
YouTubeTranscriptApi.list_transcripts(video_id)

Which Python version are you using?

Python 3.10.8

Which version of youtube-transcript-api are you using?

youtube-transcript-api 0.6.0

Expected behavior

Describe what you expected to happen.
I think it is okay that region which disabled the video cannot fetch transcript, but the exception is confusing that I troubleshot for a while to understand why it happened.

Potentially, it is because it entered raise TranscriptsDisabled part. Therefore maybe adding one more exception handling helps.

    def _extract_captions_json(self, html, video_id):
        splitted_html = html.split('"captions":')

        if len(splitted_html) <= 1:
            if video_id.startswith('http://') or video_id.startswith('https://'):
                raise InvalidVideoId(video_id)
            if 'class="g-recaptcha"' in html:
                raise TooManyRequests(video_id)
            if '"playabilityStatus":' not in html:
                raise VideoUnavailable(video_id)

          **Here, added exception**

            **raise TranscriptsDisabled**(video_id)

Actual behaviour

it will shows Subtitles are disabled for this video for disabled video region even the subtitle is enabled.

@michaelthwan
Copy link
Author

I will respect whether you fix it or not. Thanks for handling

@jdepoix
Copy link
Owner

jdepoix commented Jun 28, 2023

Hi @michaelthwan,
thank you for reporting. I agree: this is not something we can do anything about, but a more descriptive error message would be nice. I am currently a bit short on time to implement this myself, but I will put it on the list and contributions will be very much welcome! 😊

@jdepoix jdepoix added the enhancement New feature or request label Jun 28, 2023
@crhowell
Copy link
Contributor

crhowell commented Jul 11, 2023

@jdepoix I finally had some down time, taking a look at this issue.

As far as what YouTube identifies this error as its still considered "Video unavailable" for the main reason, but has subreason text that displays The uploader has not made this video available in your country

In the browser, in place of the video not loading due to a region lock we get a black background with white text showing:

Video unavailable
The uploader has not made this video available in your country

In the HTML we end up with this to search against

"playabilityStatus":{"status":"UNPLAYABLE","reason":"Video unavailable","errorScreen":{"playerErrorMessageRenderer":{"subreason":{"runs":[{"text":"The uploader has not made this video available in your country"}]}

We could do a new error message class such as this? To keep it somewhat inline with whats in the response of YouTube.

# file: youtube_transcript_api/_errors.py

class VideoUnplayable(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = 'The video has not been made available in your country'

Though it would be another search for an exact string match against html such as

def _extract_captions_json(self, html, video_id):
    splitted_html = html.split('"captions":')
    
    if len(splitted_html) <= 1:
        if video_id.startswith('http://') or video_id.startswith('https://'):
            raise InvalidVideoId(video_id)
        if 'class="g-recaptcha"' in html:
            raise TooManyRequests(video_id)
        if '"playabilityStatus":' not in html:
            raise VideoUnavailable(video_id)     
        
        # add something like this
        if 'The uploader has not made this video available in your country' in html:
            raise VideoUnplayable(video_id)

Its a little fragile but I think you've once said before that technically this entire API is unofficial and could break at any time anyway. Let me know what you think. I could PR this in and probably add a test case or two while I have some down time.

@crhowell
Copy link
Contributor

@jdepoix Interestingly enough we could also add an Age-related error class as well. Although it seems we could get around the age-related retrieval of a transcript since you can pull a transcript regardless if you are logged in or not. To do that would require adding logic around my findings in #110. But until we have that workaround implemented we could at least throw an appropriate error a very similar way as the country/region lock since the HTML to match on for that lives in the same spot and looks like this.

"playabilityStatus":{"status":"LOGIN_REQUIRED","reason":"Sign in to confirm your age","errorScreen":{"playerErrorMessageRenderer":{"subreason":{"runs":[{"text":"This video may be inappropriate for some users."}]}

This would let us also sign off #111 until a workaround is implemented.

@jdepoix
Copy link
Owner

jdepoix commented Jul 23, 2023

Hi @crhowell, thanks for looking into this and sorry for the late reply!
It looks like the data in "playabilityStatus" could generally be useful to provide more helpful exceptions and error messages! We could add a exception type for each status (LoginRequired, VideoUnplayable) which render playabilityStatus.reason and playabilityStatus.errorScreen.playerErrorMessageRenderer.subreason.runs as part of the error message. However, just looking for a natural language string in the html definitely is too fragile, as this probably will be in a different language depending on the locale. But isn't this part of the json we are parsing in json.loads(splitted_html[1].split(',"videoDetails')[0].replace('\n', '')) anyways? In that case we could just check what the status is and throw the corresponding exception, while passing in the reason/subreason. If it is not part of the json we are currently parsing, I guess we should find a way to parse it, since everything else will be very fragile.

@crhowell
Copy link
Contributor

crhowell commented Jul 23, 2023

@jdepoix Well its branched logic in there based on whether or not splitted_html has an index 1 or not.

Basically if we split the html html.split('"captions":') on captions. If that List has a length less than or equal to 1. We will ALWAYS raise an exception and json.loads never runs.

Otherwise, that means if we have more than 1 index position our list, we do try to parse the 1st index position.

But in our case for these specific errors, from what ive inspected via debug breakpoint we do not have more than 1 index position so we would never hit the json.loads side of the branching, we always raise the exception which leaves us back with the fragile in html statement.

Let me include a snippet of the full function logic

def _extract_captions_json(self, html, video_id):
    splitted_html = html.split('"captions":')
    if len(splitted_html) <= 1:
        if video_id.startswith('http://') or video_id.startswith('https://'):
            raise InvalidVideoId(video_id)
        if 'class="g-recaptcha"' in html:
            raise TooManyRequests(video_id)
        if '"playabilityStatus":' not in html:
            raise VideoUnavailable(video_id)
        # NOTE: this is where we hit for our current issues errors.
        raise TranscriptsDisabled(video_id)

    captions_json = json.loads(
        splitted_html[1].split(',"videoDetails')[0].replace('\n', '')
    ).get('playerCaptionsTracklistRenderer')
    if captions_json is None:
        raise TranscriptsDisabled(video_id)

    if 'captionTracks' not in captions_json:
        raise NoTranscriptAvailable(video_id)

    return captions_json

Update
Confirmed that both the Age Restricted video and Country/Region locked video len(splitted_html) will be 1

@michaelthwan
Copy link
Author

You guys are very helpful. Thank you @crhowell @jdepoix

@jdepoix
Copy link
Owner

jdepoix commented Jul 26, 2023

Hi @crhowell, yeah, that makes sense, but this should be solvable 😊

if len(splitted_html) <= 1:
        if video_id.startswith('http://') or video_id.startswith('https://'):
            raise InvalidVideoId(video_id)
        if 'class="g-recaptcha"' in html:
            raise TooManyRequests(video_id)
        splitted_html = html.split('"playabilityStatus":')
        if len(splitted_html) <= 1:
            raise VideoUnavailable(video_id)
        
        playability_status_json = json.loads(
            splitted_html[1].split(',"WHAT_EVER_THE_NEXT_PROPERTY_IS')[0].replace('\n', '')
        )

        # ... handle playability_status_json ...

        # fallback if we don't know the status
        raise TranscriptsDisabled(video_id)

@crhowell
Copy link
Contributor

crhowell commented Jul 26, 2023

@jdepoix I can throw an initial pass PR together for this I have a partial solution already. Ill test it against Age/Region error cases as well as the valid working cases so we can see what kind of "reason" shows up when everything is working fine and transcripts are retrievable.

Ill tag you for review on it once submitted.

Update
PR #219

Note, this PR is a quick first pass at it. Worth testing against more video IDs, I am sure there are some edge cases and more "status" values we might be able to get to add as custom errors.

I did a little bit of testing. Let me know what you do or dont like we can tweak it as necessary. I need to add a few tests for the helpers, so coverage dropped a tiny bit due to that.

@mihailmariusiondev
Copy link

I'm experiencing the same problem with this library. My setup:

  • VPS hosted on Hetzner (German IP): Unable to retrieve transcripts
  • Local machine (Spain IP): Can download transcripts without any problem

This suggests that the issue is indeed related to region-based restrictions or how YouTube is responding to requests from different geographic locations. The current error message ("Subtitles are disabled for this video") is misleading and made troubleshooting difficult.

I'm looking forward to the implementation of a more descriptive error handling system that can differentiate between truly disabled subtitles and region-based restrictions. This would greatly improve the user experience and make it easier to diagnose and handle these issues in our applications.

In the meantime, is there a recommended workaround for handling region-locked videos? Would using a proxy or VPN be a viable solution for production environments facing this issue?

@jdepoix
Copy link
Owner

jdepoix commented Oct 21, 2024

Hi @mihailmariusiondev, this most likely is not an issue of your region being blocked, but the IP of your cloud provider being blocked. Have a look #303 to find more about this issue.

Repository owner deleted a comment from hatemmezlini Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants