-
Star
(107)
You must be signed in to star a gist -
Fork
(32)
You must be signed in to fork a gist
-
-
Save glasslion/b2fcad16bc8a9630dbd7a945ab5ebf5e to your computer and use it in GitHub Desktop.
| """ | |
| Convert YouTube subtitles(vtt) to human readable text. | |
| Download only subtitles from YouTube with youtube-dl: | |
| youtube-dl --skip-download --convert-subs vtt <video_url> | |
| Note that default subtitle format provided by YouTube is ass, which is hard | |
| to process with simple regex. Luckily youtube-dl can convert ass to vtt, which | |
| is easier to process. | |
| To conver all vtt files inside a directory: | |
| find . -name "*.vtt" -exec python vtt2text.py {} \; | |
| """ | |
| import sys | |
| import re | |
| def remove_tags(text): | |
| """ | |
| Remove vtt markup tags | |
| """ | |
| tags = [ | |
| r'</c>', | |
| r'<c(\.color\w+)?>', | |
| r'<\d{2}:\d{2}:\d{2}\.\d{3}>', | |
| ] | |
| for pat in tags: | |
| text = re.sub(pat, '', text) | |
| # extract timestamp, only kep HH:MM | |
| text = re.sub( | |
| r'(\d{2}:\d{2}):\d{2}\.\d{3} --> .* align:start position:0%', | |
| r'\g<1>', | |
| text | |
| ) | |
| text = re.sub(r'^\s+$', '', text, flags=re.MULTILINE) | |
| return text | |
| def remove_header(lines): | |
| """ | |
| Remove vtt file header | |
| """ | |
| pos = -1 | |
| for mark in ('##', 'Language: en',): | |
| if mark in lines: | |
| pos = lines.index(mark) | |
| lines = lines[pos+1:] | |
| return lines | |
| def merge_duplicates(lines): | |
| """ | |
| Remove duplicated subtitles. Duplacates are always adjacent. | |
| """ | |
| last_timestamp = '' | |
| last_cap = '' | |
| for line in lines: | |
| if line == "": | |
| continue | |
| if re.match('^\d{2}:\d{2}$', line): | |
| if line != last_timestamp: | |
| yield line | |
| last_timestamp = line | |
| else: | |
| if line != last_cap: | |
| yield line | |
| last_cap = line | |
| def merge_short_lines(lines): | |
| buffer = '' | |
| for line in lines: | |
| if line == "" or re.match('^\d{2}:\d{2}$', line): | |
| yield '\n' + line | |
| continue | |
| if len(line+buffer) < 80: | |
| buffer += ' ' + line | |
| else: | |
| yield buffer.strip() | |
| buffer = line | |
| yield buffer | |
| def main(): | |
| vtt_file_name = sys.argv[1] | |
| txt_name = re.sub(r'.vtt$', '.txt', vtt_file_name) | |
| with open(vtt_file_name) as f: | |
| text = f.read() | |
| text = remove_tags(text) | |
| lines = text.splitlines() | |
| lines = remove_header(lines) | |
| lines = merge_duplicates(lines) | |
| lines = list(lines) | |
| lines = merge_short_lines(lines) | |
| lines = list(lines) | |
| with open(txt_name, 'w') as f: | |
| for line in lines: | |
| f.write(line) | |
| f.write("\n") | |
| if __name__ == "__main__": | |
| main() |
does doenload os subtitles any longer work?
youtube-dl -o ytdl-subs --skip-download --write-sub --sub-format vtthas no effect - not text fiels written.
I had to youtube-dl --write-auto-sub --convert-subs=srt --skip-download URL
see also WIP https://github.com/freeload101/SCRIPTS/blob/master/Bash/Stream_to_Text_with_Keywords.sh
when i run this with the asterisk, the program only converts one file. not all of them.
when i run this with the asterisk, the program only converts one file. not all of them.
use a for loop ? or
find . -iname "*.vtt" -exec python vtt2text.py '{}' \;
Reference: https://github.com/freeload101/SCRIPTS/blob/master/Bash/Stream_to_Text_with_Keywords.sh
find . -iname "*.vtt" -exec python vtt2text.py '{}' \;
how do I run this? sorry I'm still learning, I feel like a script kiddie
find . -iname "*.vtt" -exec python vtt2text.py '{}' \;how do I run this? sorry I'm still learning, I feel like a script kiddie
Well you know what a script kiddie is so your 1/2 way there! Not sure this is the place to have this conversation so hit me up on Discord operat0r#1379 or 404.647.4250 -RMcCurdy.com
@claudchereji it's a script for a linux terminal . it also not hard to modify the python script so as to handle multiple files.
I had trouble with international characters using this script with python3 (works with python2). seems youtube doesn't use utf-8 for everything. passing encoding='iso-8859-1' to preserve bytes when opening the vtt file fixed this for me. i plan to fork the gist.
My fork is at https://gist.github.com/xloem/f7ecb8668c14ef07718b4d3447ebe9a2 . This fork handles unexpected encodings and multiple vtt files (@claudchereji ). If people work on this further I request somebody make a git repository for it to track the work.
Kudos for the awesome work. Just a question, how do I make it such that it removes the time stamp altogether. I don't even want the HH:MM.
Thanks
It looks like timestamp output is produced by line 66 in this file (yield line after matching a time format), not sure.
I am also seeking a way to remove the timestamp. I'm very new to python so I am struggling to follow where I can tweak the code without breaking it. But I think it's falling off somewhere because it's removing duplicates. I tried making another def later on with re.sub but no dice.
Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.
Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.
This looks like what I want but I am not sure of how to use it.
Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.This looks like what I want but I am not sure of how to use it.
if you want to join me on a Stream we can walk though it and record podcast/video for HackerPublicRadio.org ! just hit me up sometime freeload01____yahoo.com
Thanks a lot for the script @glasslion.
Just found out this script after I made this one:
https://gist.github.com/arturmartins/1c78de3e8c21ffce81a17dc2f2181de4
Might be of help to some.
Would a command-line tool with interface below be welcome?
yt-text bZ6pA--F3D4 > subtitles.txt
or better with full URL?
yt-text https://youtu.be/bZ6pA--F3D4 > subtitles.txt
Would a command-line tool with interface below be welcome?
yt-text bZ6pA--F3D4 > subtitles.txt
or better with full URL?
yt-text https://youtu.be/bZ6pA--F3D4 > subtitles.txt
Yes, it would be 😁
EDIT: For anyone interested, https://gist.github.com/epogrebnyak/ba87ba52f779f7ebd93b04b2af1059aa
Hi everyone, wrapped this script here: https://github.com/epogrebnyak/justsubs
Sample usage:
from justsubs import Video
subs = Video("KzWS7gJX5Z8").subtitles(language="en-uYU-mmqFLq8")
subs.download()
print(subs.get_text_blocks()[:10])
print(subs.get_plain_text()[:550])It seems simply "en" does not work, need "en-uYU-mmqFLq8".
Also pip install justsubs should work
For YouTube subtitles, there were some timestamps and metadata remaining while using the script.
I've fixed it here:
https://gist.github.com/florentroques/c08bbe54fba42ec56c9d48229ed9c49b
does doenload os subtitles any longer work?
youtube-dl -o ytdl-subs --skip-download --write-sub --sub-format vtthas no effect - not text fiels written.