Yesterday I was working on a personal project where I need to download transcript of Youtube video. I use unofficial youtube-transcript-api Python library to fetch transcript for any youtube video. It worked fine locally but as soon as I deployed my app on a cloud VM it started giving error.
The code to list transcripts for a video is shown below.
from youtube_transcript_api import YouTubeTranscriptApi
video_id = "eIho2S0ZahI"
transcripts = YouTubeTranscriptApi.list_transcripts(video_id)
The code throws following error on cloud VM.
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=w8rYQ40C9xo! This is most likely caused by:
Subtitles are disabled for this video
If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!
As described in the GitHub issue https://github.com/jdepoix/youtube-transcript-api/issues/303 the only way to overcome this issue is to use a proxy. Most resedential proxies cost somewhere between $ 5-10 for 1 GB data transfer. They are priced per GB so depending on your usage cost will vary. One of the commenter suggested to use Tor proxy. Tor provides anonymity by routing your internet traffic through a series of volunteer-operated servers, which can help you avoid being blocked by services like YouTube.
There is an open source project https://github.com/dperson/torproxy that allows you run Tor proxy in a Docker container.
You can run it as follows:
docker run -p 8118:8118 -p 9050:9050 -d dperson/torproxy
This will run proxy on 9050 port.
You can then change your code to use the proxy.
from youtube_transcript_api import YouTubeTranscriptApi
proxies = {
'http': "socks5://127.0.0.1:9050",
'https': "socks5://127.0.0.1:9050",
}
video_id = "eIho2S0ZahI"
transcripts = YouTubeTranscriptApi.list_transcripts(video_id, proxies=proxies)
Discover more from Shekhar Gulati
Subscribe to get the latest posts sent to your email.