First month for free!
Get started
OpenAI released a state-of-the-art AI model for speech transcription and translation in September 2022. Since then, companies have been scrambling to keep up with the newly released model. This article will go over how the OpenAI Whisper model works, why it matters, and what you can do with it, including in-depth instructions for making your own self-hosted transcription api and using a third-party transcription api.
OpenAI Whisper is a tool created by OpenAI that can understand and transcribe spoken language, much like how Siri or Alexa works. This kind of tool is often referred to as an automatic speech recognition (ASR) system. The way OpenAI Whisper works is a bit like a translator. It uses something called an encoder-decoder Transformer architecture. To simplify, imagine someone translating English to French. The 'encoder' part would understand the English sentence, and the 'decoder' part would then generate the French sentence. In the case of Whisper, the 'encoder' understands the spoken language (the audio), and the 'decoder' generates the written text. To learn how to understand and transcribe spoken language, Whisper was trained using a huge amount of data from the internet - equivalent to continuously listening for over 77 years! This data is multilingual (it includes many different languages) and multitask (it covers different types of tasks, not just transcription). The audio data that OpenAI Whisper learns from is processed in a specific way to make it easier for the system to understand. It's like when you adjust the settings on your TV to make the picture clearer. The audio is 're-sampled' to a standard quality (16,000 Hz, which is a measure of sound frequency), and then it's transformed into a visual representation (the '80-channel log-magnitude Mel spectrogram') that the system can learn from. This transformation is done in small chunks (25-millisecond windows) that slightly overlap (a 'stride' of 10 milliseconds) to ensure no part of the audio is missed.
While it’s hard to know the intention for certain, it may be that affordable voice detection and transcription is a key to ChatGPT proliferating. The future of ChatGPT is likely in the form of some voice assistant. To have a voice assistant, you need good transcription, but OpenAI probably felt that the existing open-source models on transcription were lacking. They then released the model for the public to build trust and also enable the public to make improvements to the model, thus making transcription cheaper and faster. Since GPT is the core IP for OpenAI, OpenAI likely felt comfortable open-sourcing this complementary tool.
There are a lot of potential applications for OpenAI Whisper. However, a set of common use cases include:
- Voice transcription for chat applications
- Podcast transcription
- Customer Service or Sales transcription
- Internet video transcription and captioning
- Legal service transcription services
- Education transcription
- Healthcare/patient transcriptions
- And more
As OpenAI Whisper becomes faster, cheaper, and smaller, we could see it making its way into our mobile devices to start doing close to real-time transcriptions for everything in our day-to-day lives.
OpenAI released 9 different models for Whisper. They released a Tiny, Base, Small, and Medium model for both all languages and English only. They also released a Large model that is not available as english-only.
Which model should you use? It depends!
If you are looking to self-host, things like model size are important, along with the relative speeds of the output. For example, if you are looking to use the Large model, it may be that only one instance of the large model can be loaded into the memory of an Nvidia T4 GPU, whereas multiple instances of the model can be loaded into a T4 for smaller models. Below is a table from OpenAI about the relative speeds of the model.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en | tiny | ~1 GB | ~32x |
base | 74 M | base.en | base | ~1 GB | ~16x |
small | 244 M | small.en | small | ~2 GB | ~6x |
medium | 769 M | medium.en | medium | ~5 GB | ~2x |
large | 1550 M | N/A | large | ~10 GB | 1x |
For accuracy comparisons, find a section on that below.
In general, we would recommend you have baseline expectations ahead of time of speed, cost to host, and accuracy. Going in with those expectations allows for you to pick the model that is right for your needs. What makes it different than other speech recognition tools? Whisper is trained on a large amount of weakly supervised data from the web, which makes it robust and capable of generalizing well to a variety of tasks and languages. It's also a multitask model, meaning it can handle tasks in 96 different languages. Furthermore, Whisper can perform well in a zero-shot setting, without the need for any dataset-specific fine-tuning. Given that Whisper is trained on a large set of data available on the internet, it likely performs quite well as compared to other tools when it comes to content being transcribed from internet sources.
OpenAI released their own relative accuracies per language for the Large model.
To benchmark the performance of the Whisper models relative to each other and industry benchmarks, Ryan Hileman did some testing of common datasets against Whisper models.
To do your own benchmarking, we would recommend looking at publicly available datasets and doing your own benchmarking where possible. Here is a link to publicly available datasets, which are oftentimes transcribed ahead of time to help you calculate error rates of a particular transcription service.
If you are looking for a how-to on benchmarking, check out this tutorial on accuracy testing a transcription API.
You can find a quick tutorial from the OpenAI Whisper GitHub here on usage.
In general, the usage is quite simple:
pip install -U openai-whisper
With just a line of code, you can have the system installed on your system
import whisper model = whisper.load_model("base") result = model.transcribe("audio.mp3") print(result["text"])
The above usage is four lines of code to transcribe and print your results. First you load the model, then you transcribe and print the results. Easy!
Probably! The transcription will be much, much faster doing that. However, GPUs can be quite expensive, so take that into consideration.
To create an endpoint, the easiest way is likely to create a docker container that holds the Whisper endpoint. Then, deploy that container on a cloud service.
Here is a containerized version of OpenAI Whisper.To deploy the container on a public cloud, you can choose Google Cloud Container Optimized OS. To do this, you need to understand the concepts of running a virtual machine, hosting a container in google cloud, running a VM with a startup script, and running a VM with a GPU attached. The concepts you will need to understand are as follows, with instruction links:
Google Cloud Overview
Deploying a Virtual Machine using a Startup Script
Deploying a Virtual Machine with a GPU attached
Artifact Registry in Google Cloud to host your container
Running a VM with a container on GPU with a startup script
Check out this tutorial on creating your own Whisper API if you're looking for a bit of a deeper dive on the subject.
Yes! Here are a bunch of helpful libraries to know about that could be worth checking out:
WhisperX
Faster Whisper
Whisper Jax
Stable Timestamps
Pyannote Audio Diarization
Insanely Fast Whisper
Yes, there are two main APIs that can be used. First, an independent API has been developed with the Whisper Small model via Faster Whisper that is more cost-effective called WhisperAPI.com which uses additional libraries such as diarization to enhance the experience. Next, there is the official API offered through OpenAI which offers a number of models, including Large.
Whisper is a piece of groundbreaking work from OpenAI. It will enable future innovations in the transcription api space and potentially unlock a future where we can have a magical personal assistant in our pockets that works better than what exists today. While it’s unclear exactly where the innovation will go in the future, it is great to know that a lot of people are already building on top of this tool and making improvements at an incredible pace.