What is OpenAI Whisper?

Opening about OpenAI Whisper

OpenAI released a state-of-the-art AI model for speech transcription and translation in September 2022. Since then, companies have been scrambling to keep up with the newly released model. This article will go over how the OpenAI Whisper model works, why it matters, and what you can do with it, including in-depth instructions for making your own self-hosted transcription api and using a third-party transcription api.

How does OpenAI Whisper work?

OpenAI Whisper is a tool created by OpenAI that can understand and transcribe spoken language, much like how Siri or Alexa works. This kind of tool is often referred to as an automatic speech recognition (ASR) system. The way OpenAI Whisper works is a bit like a translator. It uses something called an encoder-decoder Transformer architecture. To simplify, imagine someone translating English to French. The 'encoder' part would understand the English sentence, and the 'decoder' part would then generate the French sentence. In the case of Whisper, the 'encoder' understands the spoken language (the audio), and the 'decoder' generates the written text. To learn how to understand and transcribe spoken language, Whisper was trained using a huge amount of data from the internet - equivalent to continuously listening for over 77 years! This data is multilingual (it includes many different languages) and multitask (it covers different types of tasks, not just transcription). The audio data that OpenAI Whisper learns from is processed in a specific way to make it easier for the system to understand. It's like when you adjust the settings on your TV to make the picture clearer. The audio is 're-sampled' to a standard quality (16,000 Hz, which is a measure of sound frequency), and then it's transformed into a visual representation (the '80-channel log-magnitude Mel spectrogram') that the system can learn from. This transformation is done in small chunks (25-millisecond windows) that slightly overlap (a 'stride' of 10 milliseconds) to ensure no part of the audio is missed.

Why did OpenAI release Whisper open source?

While it’s hard to know the intention for certain, it may be that affordable voice detection and transcription is a key to ChatGPT proliferating. The future of ChatGPT is likely in the form of some voice assistant. To have a voice assistant, you need good transcription, but OpenAI probably felt that the existing open-source models on transcription were lacking. They then released the model for the public to build trust and also enable the public to make improvements to the model, thus making transcription cheaper and faster. Since GPT is the core IP for OpenAI, OpenAI likely felt comfortable open-sourcing this complementary tool.

Potential Applications of OpenAI Whisper

There are a lot of potential applications for OpenAI Whisper. However, a set of common use cases include:
- Voice transcription for chat applications
- Podcast transcription
- Customer Service or Sales transcription
- Internet video transcription and captioning
- Legal service transcription services
- Education transcription
- Healthcare/patient transcriptions
- And more

As OpenAI Whisper becomes faster, cheaper, and smaller, we could see it making its way into our mobile devices to start doing close to real-time transcriptions for everything in our day-to-day lives.

Which model of OpenAI Whisper should I use?

OpenAI released 9 different models for Whisper. They released a Tiny, Base, Small, and Medium model for both all languages and English only. They also released a Large model that is not available as english-only.

Which model should you use? It depends!

If you are looking to self-host, things like model size are important, along with the relative speeds of the output. For example, if you are looking to use the Large model, it may be that only one instance of the large model can be loaded into the memory of an Nvidia T4 GPU, whereas multiple instances of the model can be loaded into a T4 for smaller models. Below is a table from OpenAI about the relative speeds of the model.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	tiny.en	tiny	~1 GB	~32x
base	74 M	base.en	base	~1 GB	~16x
small	244 M	small.en	small	~2 GB	~6x
medium	769 M	medium.en	medium	~5 GB	~2x
large	1550 M	N/A	large	~10 GB	1x

For accuracy comparisons, find a section on that below.

In general, we would recommend you have baseline expectations ahead of time of speed, cost to host, and accuracy. Going in with those expectations allows for you to pick the model that is right for your needs. What makes it different than other speech recognition tools? Whisper is trained on a large amount of weakly supervised data from the web, which makes it robust and capable of generalizing well to a variety of tasks and languages. It's also a multitask model, meaning it can handle tasks in 96 different languages. Furthermore, Whisper can perform well in a zero-shot setting, without the need for any dataset-specific fine-tuning. Given that Whisper is trained on a large set of data available on the internet, it likely performs quite well as compared to other tools when it comes to content being transcribed from internet sources.

How accurate is OpenAI Whisper?

OpenAI released their own relative accuracies per language for the Large model.

Language Accuracy Breakdown OpenAI Whisper

To benchmark the performance of the Whisper models relative to each other and industry benchmarks, Ryan Hileman did some testing of common datasets against Whisper models.

To do your own benchmarking, we would recommend looking at publicly available datasets and doing your own benchmarking where possible. Here is a link to publicly available datasets, which are oftentimes transcribed ahead of time to help you calculate error rates of a particular transcription service.

If you are looking for a how-to on benchmarking, check out this tutorial on accuracy testing a transcription API.

How do I use OpenAI Whisper?

You can find a quick tutorial from the OpenAI Whisper GitHub here on usage.

In general, the usage is quite simple:

Installation

pip install -U openai-whisper

With just a line of code, you can have the system installed on your system

Python File

import whisper model = whisper.load_model("base") result = model.transcribe("audio.mp3") print(result["text"])

The above usage is four lines of code to transcribe and print your results. First you load the model, then you transcribe and print the results. Easy!

Should I run OpenAI Whisper on GPU?

Probably! The transcription will be much, much faster doing that. However, GPUs can be quite expensive, so take that into consideration.

How do I create an API endpoint to run OpenAI Whisper self-hosted?

To create an endpoint, the easiest way is likely to create a docker container that holds the Whisper endpoint. Then, deploy that container on a cloud service.

Here is a containerized version of OpenAI Whisper.

To deploy the container on a public cloud, you can choose Google Cloud Container Optimized OS. To do this, you need to understand the concepts of running a virtual machine, hosting a container in google cloud, running a VM with a startup script, and running a VM with a GPU attached. The concepts you will need to understand are as follows, with instruction links:

Google Cloud Overview

Deploying a Virtual Machine using a Startup Script

Deploying a Virtual Machine with a GPU attached

Artifact Registry in Google Cloud to host your container

Running a VM with a container on GPU with a startup script

Check out this tutorial on creating your own Whisper API if you're looking for a bit of a deeper dive on the subject.

Are there complementary libraries to OpenAI Whisper I should know of?

Yes! Here are a bunch of helpful libraries to know about that could be worth checking out:

WhisperX

Faster Whisper

Whisper Jax

Stable Timestamps

Pyannote Audio Diarization

Insanely Fast Whisper

Is there an OpenAI Whisper API?

Yes, there are two main APIs that can be used. First, an independent API has been developed with the Whisper Small model via Faster Whisper that is more cost-effective called WhisperAPI.com which uses additional libraries such as diarization to enhance the experience. Next, there is the official API offered through OpenAI which offers a number of models, including Large.

Conclusion

Whisper is a piece of groundbreaking work from OpenAI. It will enable future innovations in the transcription api space and potentially unlock a future where we can have a magical personal assistant in our pockets that works better than what exists today. While it’s unclear exactly where the innovation will go in the future, it is great to know that a lot of people are already building on top of this tool and making improvements at an incredible pace.

Whisper API