Python Speech-to-Text Tutorial

Speech-to-Text has become even more popular, particularly with the rise of Large Language Models and Artificial Intelligence, making the need for voice-to-text more prevalent. However, getting started with speech-to-text can feel complicated, especially if you don't have a background in standing-up scalable, accurate speech models.

Enter Whisper API: The Best Speech-to-Text API

Whisper API is a wonderful API offering based on the OpenAI ASR solution released in 2022. OpenAI Whisper is a groundbreaking solution for speech-to-text, offering wonderful accuracy, speed, and affordability. For more information on its affordability and accuracy, check out this piece that dives into accuracy and pricing of Whisper API vs other services. Plus, it's open-source, enabling users to spin up their own servers when they reach scale. To begin with, it oftentimes makes sense to use an external API. If you are thinking about bringing speech-to-text in-house, then open source may be the best option for your needs. If that's of interest, check out this piece on open source STT accuracy benchmarks.

This article will go through how to use an external API for doing speech-to-text in Python. If you are looking to stand-up your own infrastructure, check out What is OpenAI Whisper to find details on that.

Diving into Python Speech-to-Text

Here is a complete API call to do speech-to-text:

import requests
url = "https://transcribe.whisperapi.com"
headers = {
'Authorization': 'Bearer YOUR_API_KEY'
}
file = {'file': open('YOUR_FILE_PATH', 'rb')}
data = {
  "fileType": "wav", #default is wav
  "diarization": "true",
  "numSpeakers": ""
  "url": "",
  "callbackURL": ""
  "initialPrompt": "",
  "language": "",
  "task": "transcribe"
}
response = requests.post(url, headers=headers,data=data, files=file)

Yes, that's really it! In fact, all the empty string parameters aren't actually necessary for the API call. The result of this API call will look something like:

{
  "language": "en",
  "text": "Hello World",
  "segments": [{
              "start": 0.0,
              "end": 3.0,
              "text": "Hello World",
              "whole_word_timestamps": 
                  [
                      {"word": "Hello", "start": 0.0, "end": 1.5, "timestamp": 1.5, "probability": 1.0},
                      {"word": " World", "start": 1.5, "end": 3.0, "timestamp": 3.0, "probability": 1.0},
                  ]
              }],
  "diarization": [
      {"startTime": 0.0, "stopTime": 3.0, "speaker": "SPEAKER_00"}
  ]
}

Breaking Down The Parameters for Python Speech-to-Text API

File

The API can accept either a url parameter or a file. By specifying the file, you can take a file from the local computer to send to the API.

File Type

Input the file type for the API, such as WAV or MP3.

Diarization

This parameter sets whether a diarization of the audio file should be returned, which will give back the different speaker segments and the range for the speaker segements.

Num Speakers

This is a parameter to hint the model on how many speakers there are for diarization. If voices are close, this can help the model know when it should separate or group speakers.

URL

This is the other way to send an audio file to the API, via a publicly accessible url.

Callback URL

This parameter allows for the results of the API to be sent to an external URL, which is great for asynchronous workflows.

Initial Prompt

This parameter allows for hinting of the model. For example, if someone says DALL·E, the model might mistake that for Dolly. The Initial Prompt could provide context like "we are talking about DALL·E"

Language

This parameter hints the model of the language of the speech, so that the detection process doesn't need to happen.

Task

This can either be transcribe (the default), or translate, which translates from the language into English.

Breaking Down The Response from the Python Speech-to-Text API

Language

This is the detected language during the transcription process by the API.

Text

This is the text from the transcription of the audio from the API.

Segments

These are the chunks of audio that are individually transcribed by the API.

Start and End

These are the starting and ending timestamps of a segment or word.

Whole Word Timestamps

These are the individual timestamps for each word

Probability

The likelihood that the word was transcribed correctly. This is a value between 0 and 1, inclusive.0

What Can I Do with Speech-to-Text APIs in Python?

Anything you'd like! Python is one of the most used programming languages in the world and speech-to-text, especially with the recent changes in Artificial Intelligence and Large Language Models, is becoming even more important. With a speech-to-text API, you can create voice assistants, video transcribers and analyzers, translators, and more. There is a never-ending possibility of what you can do in Python and now a never-ending possibility of bringing in the world of audio to your applications.

What are Some Open-Source Speech-to-Text Tools in Python?

Whisper

OpenAI Whisper was one of the more groundbreaking open-source additions to the ASR and speech-to-text market. It has enabled companies to spin up amazingly accurate and reliable speech-to-text offerings for commercial use.

Faster Whisper

Faster Whisper is an amazing improvement to the OpenAI model, enabling the same accuracy from the base model at much faster speeds via intelligent optimizations to the model.

Pyannote Audio

Pyannote Audio is a best-in-class open-source diarization library for speech. With great accuracy and active development, this is a great solution for augmenting existing ASR solutions.

Microsoft Presidio

Presidio is a wonderful, open-source PII (Personally Identifiable Information) redaction library, with the ability to augment an existing speech-to-text API with great sanitization for personally identifiable information (i.e. removing sensitive information).

PaddleNLP

PaddleNLP is a powerful library for sentiment analysis, text classification, and more, which can be great for post-processing audio transcription.

Concluding Speech-to-Text with Python

Speech-to-Text in Python has become easier to accomplish with the rise of OpenAI Whisper's speech-to-text model and all the wonderful open-source ASR tools available on the market to augment that experience. Whether you are a beginner or advanced Python user, speech-to-text in Python is extremely approachable given the ease-of-use offered from speech-to-text APIs in the market.

Whisper API

Python Speech-to-Text Tutorial

Enter Whisper API: The Best Speech-to-Text API

Diving into Python Speech-to-Text

Breaking Down The Parameters for Python Speech-to-Text API

File

File Type

Diarization

Num Speakers

URL

Callback URL

Initial Prompt

Language

Task

Breaking Down The Response from the Python Speech-to-Text API

Language

Text

Segments

Start and End

Whole Word Timestamps

Probability

What Can I Do with Speech-to-Text APIs in Python?

What are Some Open-Source Speech-to-Text Tools in Python?

Whisper

Faster Whisper

Pyannote Audio

Microsoft Presidio

PaddleNLP

Concluding Speech-to-Text with Python

Whisper API Blog