First month for free!
Get started
SRT and VTT file formats are common formats for showing captions in videos on the internet. If you've ever uploaded a Youtube video, you may be familiar with uploading an SRT file to provide captions with timestamps to a video.
SRT and VTT files are used for providing captioning alongside video. While VTT is newer, both SRT and VTT formats provide a mechanism to do this captioning. VTT is built for the modern browser to add additional styling and context to captioning, whereas SRT is much more limited in its abilities. VTT can provide fairly complex CSS to provide a custom look and feel to the captioning process.
SRT and VTT files are great for providing accessibility to videos, enabling more people to watch videos and enjoy the audio than ever before. Additionally, with the rise of video content platforms like Youtube, TikTok, and Instagram, it has become more common for videos to, by default, be accompanied by captioning for those that may not have their sound on when viewing videos. All-in-all, it is important to understand what SRT and VTT files can do if you are looking to be a content creator in any form.
First, we need to get the transcription of a conversation, along with information about which speaker was speaking when. To do this, we use Whisper API, a powerful speech-to-text API that allows for us to transcribe the audio and also diarize the audio as well in just a few simple lines of code, as shown below.
Next is the most complex part of the code: separating the text out into speaker buckets. The basic idea of the code is that we have a bunch of speaker buckets, potentially overlapping. So, we could have:
Speaker 1: 0 seconds to 10 seconds
Speaker 2: 8 seconds to 12 seconds
We need to turn this into:
Speaker 1: 0 seconds to 8 seconds
Overlap: 8 seconds to 10 seconds
Speaker 2: 10 seconds to 12 seconds
Additionally, we need to iterate through each word and check if the word start time is within any of these now non-overlapping sepaker buckets. If it is, we put that word into that bucket.
While it may look complex, we are just figuring out which bucket each word should be in.
By the end, the output of our bucketing should look something like the below.
Now that we've written our code for bucketing the text into different speaker segments, let's dive into the specific code for SRT files in Python. The SRT file is formatted as:
1
00:03:30,000 --> 00:03:40,000
Speech-to-text is the best!
More generally, it's:
[segment_number]
[hour_start]:[minute_start]:[second_start],[milliseconds_start] --> [hour_end]:[minute_end]:[second_end],[milliseconds_end]
[text]
[blank line]
Below is some code that takes our speaker segments and goes through them one by one. If the speaker segment has any length or text, we print it out in the format we are supposed to, with the hour, minute, second, and millisecond broken out.
Now, let's dive into making a VTT file in Python. In many ways, it is similar to an SRT file format, but is much more feature rich with formatting options. Additionally, VTT files do a period as opposed to a comma for milliseconds, they don't require the line number, and blank lines between entries are optional. For this section, we will show the power of the VTT file by attaching speaker identification and overlap warnings to our audio segments, which we are unable to do in an SRT file format. We do this with the voice tag <v> and with the class tag <c>, which enables us to customize what we do with overlapping text with CSS. In the below code, we do almost the same thing we did for the SRT file, except we add in a voice tag to incorporate our speaker information that we have for each segment and, for any overlapping sections, we denote that with an overlap class tag, allowing us to customize the visual look of the overlapping element. Take a look at the Mozilla Docs to get a better idea of how WebVTT is used in practice.
Now, finally, we can output our files with a simple function call, utilizing the output from Whisper API speech-to-text and our handy VTT and SRT formats.
At the very end, you can always output these results from the function directly into a file in Python with a VTT or SRT file extension, enabling you to easily create an SRT or VTT file!
We have now put together both a VTT and SRT file in Python using built-in functions and a powerful speech-to-text API. Now that you've put together your first VTT and SRT files in Python, you should be able t build from here and create wonderful transcription experiences for your next video!