VTT and SRT Files For Videos Using Python

SRT and VTT file formats are common formats for showing captions in videos on the internet. If you've ever uploaded a Youtube video, you may be familiar with uploading an SRT file to provide captions with timestamps to a video.

What is the basic concept behind SRT and VTT files?

SRT and VTT files are used for providing captioning alongside video. While VTT is newer, both SRT and VTT formats provide a mechanism to do this captioning. VTT is built for the modern browser to add additional styling and context to captioning, whereas SRT is much more limited in its abilities. VTT can provide fairly complex CSS to provide a custom look and feel to the captioning process.

Why use SRT or VTT files?

SRT and VTT files are great for providing accessibility to videos, enabling more people to watch videos and enjoy the audio than ever before. Additionally, with the rise of video content platforms like Youtube, TikTok, and Instagram, it has become more common for videos to, by default, be accompanied by captioning for those that may not have their sound on when viewing videos. All-in-all, it is important to understand what SRT and VTT files can do if you are looking to be a content creator in any form.

Making SRT and VTT files from an audio file in Python

Speech-to-Text API

First, we need to get the transcription of a conversation, along with information about which speaker was speaking when. To do this, we use Whisper API, a powerful speech-to-text API that allows for us to transcribe the audio and also diarize the audio as well in just a few simple lines of code, as shown below.

import requests
url = "https://transcribe.whisperapi.com"
headers = {
  'Authorization': 'Bearer YOUR_API_KEY'
}
data = {
  "fileType": "wav", #default is wav
  "diarization": "true",
  "url": "",
  "initialPrompt": "",
  "language": "en",
  "task": "transcribe"
}
response = requests.post(url, headers=headers,data=data)

Speech-to-Text API Output Format

Typical output format is something like the below. The next output is just an example; your specific output will differ.

{
  "language": "en",
  "text": "Hello World",
  "segments": [{
              "start": 0.0,
              "end": 3.0,
              "text": "Hello World",
              "whole_word_timestamps": 
                  [
                      {"word": "Hello", "start": 0.0, "end": 1.5, "timestamp": 1.5, "probability": 1.0},
                      {"word": " World", "start": 1.5, "end": 3.0, "timestamp": 3.0, "probability": 1.0},
                  ]
              }],
  "diarization": [
      {"startTime": 0.0, "stopTime": 3.0, "speaker": "SPEAKER_00"}
  ]
}

Combining Speech-to-Text with Diarization

Next is the most complex part of the code: separating the text out into speaker buckets. The basic idea of the code is that we have a bunch of speaker buckets, potentially overlapping. So, we could have:
Speaker 1: 0 seconds to 10 seconds
Speaker 2: 8 seconds to 12 seconds

We need to turn this into:
Speaker 1: 0 seconds to 8 seconds
Overlap: 8 seconds to 10 seconds
Speaker 2: 10 seconds to 12 seconds

Additionally, we need to iterate through each word and check if the word start time is within any of these now non-overlapping sepaker buckets. If it is, we put that word into that bucket.
While it may look complex, we are just figuring out which bucket each word should be in.

def findSpeakerIndex(word, speakerSegments, i):
  index=i
  while index < len(speakerSegments) and word["start"] > speakerSegments[index]["stopTime"]:
    index=index+1
  return index


def separate_overlaps(speaker_ranges):
  # Sort the speaker ranges by their start time
  speaker_ranges = sorted(speaker_ranges, key=lambda x: x['startTime'])

  # Initialize an empty list to store the separated ranges
  separated_ranges = []

  # Iterate through the speaker ranges
  for i in range(len(speaker_ranges)):
      # If this is the first range, add it to the separated ranges list
      if i == 0:
          separated_ranges.append(speaker_ranges[i])
      else:
          # Get the previous range
          prev_range = separated_ranges[-1]

          # If the current range starts after the previous range ends, add it to the separated ranges list
          if speaker_ranges[i]['startTime'] >= prev_range['stopTime']:
              separated_ranges.append(speaker_ranges[i])
          else:
              # Otherwise, there is an overlap, so split the ranges
              # First, add the part of the previous range that doesn't overlap
              separated_ranges[-1]['stopTime'] = speaker_ranges[i]['startTime']
              # Then add the overlap as a new range
              overlap_range = {'speaker': 'OVERLAP',
                                'startTime': speaker_ranges[i]['startTime'],
                                'stopTime': min(speaker_ranges[i]['stopTime'], prev_range['stopTime'])}
              separated_ranges.append(overlap_range)
              # Finally, add the part of the current range that doesn't overlap
              non_overlap_range = {'speaker': speaker_ranges[i]['speaker'],
                                    'startTime': overlap_range['stopTime'],
                                    'stopTime': speaker_ranges[i]['stopTime']}
              separated_ranges.append(non_overlap_range)
  return separated_ranges


def getSpeakerSegments(res):
  currentSpeakerIndex=0
  speakerSegments=separate_overlaps(res["diarization"])
  speakerSegmentsResults=[{"text": "", "time": -1,"speaker": speakerSegments[i]["speaker"], "startTime": speakerSegments[i]["startTime"], "stopTime": speakerSegments[i]["stopTime"]} for i in range(len(speakerSegments))]


  for segment in res["segments"]:
    for word in segment["whole_word_timestamps"]:
      currentSpeakerIndex=findSpeakerIndex(word, speakerSegments, currentSpeakerIndex)
      if currentSpeakerIndex >= len(speakerSegmentsResults):
        continue
      thisSpeakerSegmentResults=speakerSegmentsResults[currentSpeakerIndex]
      #IF YOU ARE LOOKING FOR THE START OF THE SEGMENT
      if thisSpeakerSegmentResults["time"] == -1:
        thisSpeakerSegmentResults["time"]=word["start"]
      thisSpeakerSegmentResults["text"]=thisSpeakerSegmentResults["text"]+word["word"]
  return speakerSegmentsResults

Our Word Bucketing End Results

By the end, the output of our bucketing should look something like the below.

[{"text": "Hello", "startTime": 0.0, "stopTime": 1.5, "speaker": "SPEAKER_00"}, {"text": "WhisperAPI", "startTime": 1.5, "stopTime": 2.0, "speaker": "OVERLAP"}, {"text": " How are you", "startTime": 2.0, "stopTime": 3.0, "speaker": "SPEAKER_01"}]

SRT File Writing in Python

Now that we've written our code for bucketing the text into different speaker segments, let's dive into the specific code for SRT files in Python. The SRT file is formatted as:
1
00:03:30,000 --> 00:03:40,000
Speech-to-text is the best!

More generally, it's:
[segment_number]
[hour_start]:[minute_start]:[second_start],[milliseconds_start] --> [hour_end]:[minute_end]:[second_end],[milliseconds_end]
[text]
[blank line]
Below is some code that takes our speaker segments and goes through them one by one. If the speaker segment has any length or text, we print it out in the format we are supposed to, with the hour, minute, second, and millisecond broken out.

def format_duration(duration, character):
  hours = int(duration) // 3600
  minutes = int(duration) // 60 - hours*60
  seconds = int(duration) %60
  milliseconds = int(duration*1000)-int(duration)*1000
  return '{:02d}:{:02d}:{:02d}{}{:03d}'.format(hours, minutes, seconds, character, milliseconds)

def srtFormat(res):
  speakerSegmentsResults=getSpeakerSegments(res)
  for i in range(len(speakerSegmentsResults)):
    if speakerSegmentsResults[i]["text"] != "" and speakerSegmentsResults[i]["time"] != -1:
      print(i+1)
      print(format_duration(float(speakerSegmentsResults[i]["startTime"]), ",") + " --> " + format_duration(float(speakerSegmentsResults[i]["stopTime"]), ","))
      print(speakerSegmentsResults[i]["text"])
      print()

VTT File Writing in Python

Now, let's dive into making a VTT file in Python. In many ways, it is similar to an SRT file format, but is much more feature rich with formatting options. Additionally, VTT files do a period as opposed to a comma for milliseconds, they don't require the line number, and blank lines between entries are optional. For this section, we will show the power of the VTT file by attaching speaker identification and overlap warnings to our audio segments, which we are unable to do in an SRT file format. We do this with the voice tag <v> and with the class tag <c>, which enables us to customize what we do with overlapping text with CSS. In the below code, we do almost the same thing we did for the SRT file, except we add in a voice tag to incorporate our speaker information that we have for each segment and, for any overlapping sections, we denote that with an overlap class tag, allowing us to customize the visual look of the overlapping element. Take a look at the Mozilla Docs to get a better idea of how WebVTT is used in practice.

def vttFormat(res):
    print("WEBVTT")  # VTT files start with this header
    print()
    speakerSegmentsResults=getSpeakerSegments(res)
    for i in range(len(speakerSegmentsResults)):
        if speakerSegmentsResults[i]["text"] and speakerSegmentsResults[i]["time"] != -1:
            # Format the time stamps
            start_time = format_duration(float(speakerSegmentsResults[i]["startTime"]), ".")
            stop_time = format_duration(float(speakerSegmentsResults[i]["stopTime"]), ".")

            # Print the time range
            print(f"{start_time} --> {stop_time}")

            # Handle the speaker and overlap case
            speaker = speakerSegmentsResults[i].get("speaker", "")
            if speaker == "OVERLAP":
                print("<c.overlap>Overlapping conversation:")
                print(speakerSegmentsResults[i]["text"] + "</c>")  # Close the overlap tag
            elif speaker:
                print(f"<v {speaker}>", end="")  # Open the speaker tag
                print(speakerSegmentsResults[i]["text"] + f"</v>")  # Close the speaker tag
            else:
                print(speakerSegmentsResults[i]["text"])
            print()  # Blank line after each segment

Printing Our Final Results

Now, finally, we can output our files with a simple function call, utilizing the output from Whisper API speech-to-text and our handy VTT and SRT formats.

srtFormat(response.json())
vttFormat(response.json())

At the very end, you can always output these results from the function directly into a file in Python with a VTT or SRT file extension, enabling you to easily create an SRT or VTT file!

Conclusion

We have now put together both a VTT and SRT file in Python using built-in functions and a powerful speech-to-text API. Now that you've put together your first VTT and SRT files in Python, you should be able t build from here and create wonderful transcription experiences for your next video!

Whisper API