First month for free!
Get started
With the rise of Artificial Intelligence (AI) and Large Language Models (LLMs), speech-to-text is becoming more important than ever, as audio is a more natural interface than text for many use cases. Open source is oftentimes the backbone for innovation in any software area, with speech-to-text being no exception. But, finding these speech-to-text offerings and, more importantly, evaluating these offerings, can be a cumbersome or even impossible task. This piece aims to go through the most prominent free open source speech-to-text offerings and benchmark their accuracy against common audio datasets to better inform you on the capabilities of offerings in the market.
OpenAI Whisper was one of the more groundbreaking open-source additions to the ASR and speech-to-text market. It was trained on over 680,000 hours of diverse speech across the internet, enabling an incredible accuracy in zero-shot instances across languages. Whisper is one of the most performant of the open source models on the market.
Faster Whisper is an amazing improvement to the OpenAI model, enabling the same accuracy as the OpenAI offering, while offering a much faster speed (up to 4x faster) and a smaller model size (up to 3x smaller). Faster Whisper works by using an open source library called C2 Translate, a library that makes runtime improvements to AI models such as weights quantization, layers fusion, and batch reordering.
Insanely Fast Whisper is based on the OpenAI model as well, with even further increased speed from Faster Whisper, generating up to 9x faster speeds than even Faster Whisper! There is not significant benchmarking at this point on comparative accuracy or model sizes with the Insanely Fast Whisper offering, but it's certainly a big step forward in the speech-to-text space.
DeepSpeech is an open source project released under the Mozilla Public License built by Mozilla. DeepSpeech can run on-device, even something like a Raspberry Pi 4, all the way up to high-powered GPUs. The creation of this STT model is based on the groundbreaking Baidu Deep Speech research paper. DeepSpeech is capable of running on a CPU, and is capable of reaching real-time transcription performance, especially when coupled with a GPU. Additionally, Deepspeech is highly configurable, as the deepspeech enables you to train your own model.
Kaldi is an open source project that is highly extensible, with over 13k+ stars on Github and 9k+ commits as well. It is a great project for diving into speech-to-text as a speech-to-text researcher.
Another open-source tool built directly on Pytorch, SpeechBrain is a no-brainer option for speech-to-text enthusiasts. Additionally, SpeechBrain has done a great job of offering speech recognition, speaker recognition, speech enhancement, speech separation, language modeling, and dialogue all-in-one, which makes it a great choice for a conversational AI use case. SpeechBrain has over 7k+ stars and 1k+ forks of the repository on Github.
There are many ways to accuracy test speech-to-text models, however a common way is to compare models against highly curated audio datasets, so that it's a fair comparison. For this comparison, we will compare the top open-source speech-to-text softwares against each other using the Common Voice and LibriSpeech datasets.
Common Voice is an amazing initiative from Mozilla to curate voices from people across the internet. Anyone can contribute to the project either by providing a voice snippet, or listening to existing voice snippets and providing feedback on the transcription quality. It is truly an experience curated by the community at large, which is exciting.
LibriSpeech is a common audio dataset derived from read audiobooks, carefully segmented and aligned to ensure accuracy of the dataset.
While the results across the following accuracy tests may not have been normalized the same way (hardware for the tests may differ, the version of datasets may differ, the languages across which the tools were tested may differ, etc), it provides a great starting point for evaluating the tools.
The most common way to evaluate is the word error rate (WER) of each model. Let's compare each model's WER against Whisper Large for the Common Voice and LibriSpeech datasets.
Whisper: 9.0% Word Error Rate (WER)
Accuracy Testing Source Whisper
Deepspeech: 43.82% Word Error Rate (WER)
Accuracy Testing Source Deepspeech
Whisper: 2.7% Word Error Rate (WER) Clean, 5.2% Word Error Rate (WER) Other
Accuracy Testing Source Whisper
Deepspeech: 7.27% Word Error Rate (WER) Clean, 21.45% Word Error Rate (WER) Other
Accuracy Testing Source Deepspeech
Winner: Whisper
Whisper: 9.0% Word Error Rate (WER)
Accuracy Testing Source Whisper
Kaldi: 4.44% Word Error Rate (WER)
Accuracy Testing Source Common Voice Kaldi
Whisper: 2.7% Word Error Rate (WER) Clean, 5.2% Word Error Rate (WER) Other
Accuracy Testing Source Whisper
Kaldi: 3.8% Word Error Rate (WER) Clean, 8.76% Word Error Rate (WER) Other
Accuracy Testing Source LibriSpeech Kaldi
Winner: It Depends on the Audio Dataset, but Likely Whisper Given Kaldi has Common Voice Dataset Overlap Concerns
Whisper: 9.0% Word Error Rate (WER)
Accuracy Testing Source Whisper
SpeechBrain: 15.58% Word Error Rate (WER)
Accuracy Testing Source SpeechBrain
Whisper: 2.7% Word Error Rate (WER) Clean, 5.2% Word Error Rate (WER) Other
Accuracy Testing Source Whisper
SpeechBrain: 2.46% Word Error Rate (WER) Clean, 5.77% Word Error Rate (WER) Other
Accuracy Testing Source SpeechBrain
Winner: It Depends a Bit on the Audio Dataset, but Overall Whisper
Whisper: 9.0% Word Error Rate (WER)
Accuracy Testing Source Whisper
Wav2vec 2.0: 16.1% Word Error Rate (WER)*
Accuracy Testing Source Common Voice Wav2vec 2.0
*Note: this accuracy is for English only. Whisper received 10.0% WER in the same study on English-only WER.
Whisper: 9.0% Word Error Rate (WER)
Accuracy Testing Source Whisper
Wav2vec 2.0: 1.8% Word Error Rate (WER), 3.3% Word Error Rate (WER) Other
Accuracy Testing Source LibriSpeech Wav2vec 2.0
Winner: It Depends on the Audio Dataset, but Whisper for Common Speech and Most Audio Datasets
There are a ton of different speech-to-text offerings on the market, all with their own pros and cons. It can be difficult to evaluate these different options against each other, as each offers its own point of differentiation. Whisper overall has solidified itself as a leader in the space with a strong zero-shot capability across many languages, dialects, background noise situations, and more. Additionally, Whisper has one of the most active development communities in the world; it seems that each month, there is a new, state-of-the-art improvement to the core technology. As AI becomes increasingly popular, it will be more important that these open source tools continue to perform as well if not better than private solutions. Hopefully, this piece helped clarify some of the valuable open source offerings on the market and helped elucidate the accuracy you should expect with each one. If you're looking into paid STT services, check out this piece on the top STT APIs to better evaluate your options.