Subtitles On The Go

Published

October 8, 2022

Download the video from Youtube:

yt-dlp "https://www.youtube.com/watch?v=q4tpg6mkBTs"

Extract the audio (I changed the name of the file to nims.webm):

ffmpeg -i nims.webm -vn -ab 128k -ar 44100 -y nims.mp3

Setup up OpenAI Whisper…OK, so this is non-trivial. For my workflow I’m renting an A100-80G Linux Desktop from Paperspace. I log into the machine and ran sudo apt update so I can install pip and git. Then I installed Whisper (https://github.com/openai/whisper):

pip install git+https://github.com/openai/whisper.git 

There are numerous ways to get the video on the other machine. I chose to send it over with magic wormhole.

I ran the default script provided in the repo:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

Got a transcript in Japanese and posted it into DeepL for a translation. It was decent but with obvious mistakes even for someone who doesn’t speak Japanese. I know Whisper should be able to do translation so I just had to figure it out.

Turns out the cli has a lot of params one of which is translation. Bad news was my cli installation wasn’t working. I had to figure out how do to it in Python. Or copy-paste which is what I did after searching the Whisper repo’s Discussion page. With that I cut DeepL out of the pipeline, what’s more the translation was even better.

Next I needed actual subtitles - another feature tied to the cli installation. I ended up copying and modifying two utility scripts from the Whisper AI repo into my own to produce an STR (subtitle) file. This is my final script:

import whisper

def format_timestamp(seconds: float, always_include_hours: bool = False, decimal_marker: str = '.'):
    assert seconds >= 0, "non-negative timestamp expected"
    milliseconds = round(seconds * 1000.0)

    hours = milliseconds // 3_600_000
    milliseconds -= hours * 3_600_000

    minutes = milliseconds // 60_000
    milliseconds -= minutes * 60_000

    seconds = milliseconds // 1_000
    milliseconds -= seconds * 1_000

    hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
    return f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"

model = whisper.load_model("large")
result = model.transcribe("nims.mp3", task="translate")
#print(result["text"])

with open("script.srt", "w+") as f:

  for i, segment in enumerate(result['segments'], start=1):
    print(
            f"{i}\n"
            f"{format_timestamp(segment['start'], always_include_hours=True, decimal_marker=',')} --> "
            f"{format_timestamp(segment['end'], always_include_hours=True, decimal_marker=',')}\n"
            f"{segment['text'].strip().replace('-->', '->')}\n",
            file = f,
            flush = True,
        )

I sent the script back to my computer and then called up ffmpeg to do the honors of generating the video with subtitles:

ffmpeg -i nims.webn -vf subtitles=nims_script.srt output.mp4

There’s a popular AI YouTube channel, Two Minutes Papers, where the host has his own catchphrase, “What a time to be alive!” That’s what I said after being able to watch and understand what was going in the video. (For QA it helped that the subject and topic was obvious.) It took me a while to get the initial pipeline done, but now that its set, I can probably go from video to video with English subtitles in less than 5 minutes - controlling for the size of the video.

What a time to be alive.