Word Error Rate (WER)

Word Error Rate (WER) is the calculation of how often a mistake was made in the audio transcription process. Word Error Rate is the most typical calculation for estimating the general accuracy for a speech-to-text service. This piece dives into the complexities of Word Error Rate, how it’s used in the ASR market today, and the future of WER.

Why is Word Error Rate Important?

WER is the closest to an apples-to-apples comparison of different speech-to-text softwares and APIs. By picking a neutral audio source as the input into each system, we can estimate how good each API is at audio transcription. While accuracy is just one of the inputs into evaluating a voice-to-text service, it is nonetheless one of the most important ones. By having an objective number attached to accuracy, a team can go into the evaluation process with clear key performance indicator (KPI) requirements for a speech-to-text API. Any API that meets that minimum threshold can be a contender for a production use case.

How is Word Error Rate Calculated?

WER = (S + D + I)/(S + D + C)

Where:
S is the number of substitutions (i.e. 'Dolly’ vs the actual text 'DALL·E’)
D is the number of deletions (i.e. 'I speech-to-text’ vs the actual text 'I like speech-to-text’)
I is the number of insertions (i.e. 'I really like speech-to-text’ vs the actual text 'I like speech-to-text’)
C is the number of correctly predicted words.

The accuracy of a given speech-to-text transcription is just 1-WER. So, if the word error rate is 20%, then the STT service was 80% accurate.

Before a word error rate is calculated, there is typically a normalization process that removes a lot of punctuation, capitalizes beginnings of sentences, and standardizes things like numbers (i.e. 'thirteen’ and '13’), all to make sure that the S, D, C, and I calculations are as fair as possible.

If you are looking to calculate WER on your own, there are a number of great services to do just that, including Jiwer.

pip install jiwer

from jiwer import wer

reference = "speech to text"
hypothesis = "Speak my text"

error = wer(reference, hypothesis)

What are the Shortcomings of Word Error Rate?

Word Error Rate (WER) doesn't take into account the magnitude of the underlying errors. In the example above with dolly and DALL·E (an OpenAI image model), while that is technically a mistake, it is a much more understandable mistake than something like DALL·E being transcribed as elephant. It is nearly impossible to provide a closeness score for word errors, which is why WER should always be taken with a grain of salt.

What Causes a Word Error?

Similar Sounding Words

Like the example above, dolly and DALL·E can be mixed up. So too can to, two, and too. Most of the spellings of these words are situation-specific, which machine learning systems are not as good at adjusting for as a human would be, since their scope is much more limited to the sounds as opposed to the meaning of a phrase. Additionally, some words may be out of scope for a model. If a new word was invented, or industry-specific terms are used (such as DALL·E) that the model has never seen before, the chance of getting the words correct are much smaller.

Tough to Decipher Dialogue

Sometimes dialogue can be hard to understand. This can be caused by overlapping speakers, background noise, dialects, audio quality, and more. While humans are sometimes capable of dealing with these differences, systems trained on specific constraints can run into their limits.

What is the Word Error Rate for Modern ASR Tools?

If you're looking for accuracy benchmarks, this piece on free open source software accuracy benchmarks should help clarify. Whisper Large is one of the most accurate of the open source models, with Kaldi, Wav2vec 2.0, and SpeechBrain all as other great open source, accurate options.

Additionally, here are some accuracy benchmarks for commercial offerings. Whisper Large is one of the most accurate commercial offerings, with Amazon Transcribe and Azure Speech-to-Text as other great offerings on the market.

In general, offerings range with WER as low as 2% to as high as 40%.

What is the Word Error Rate for Humans?

This really depends! An often-cited figure is that humans have a WER of about 4%. It's hard to understand every word in an audio transcription, especially when context is unclear, or audio quality isn't the best. However, one study goes even deeper on human WER and finds WERs as high as 6.8%. To put that into perspective, on the same datasets, Whisper Large makes almost 3x as many mistakes at 17.6% WER. While modern ASR tools have come a long way, they aren't quite as good as humans.

What are Typical Datasets for Testing Word Error Rate?

There are many common datasets for Word Error Rate, including:

Common Voice

Common Voice is a wonderful Mozilla project for gathering transcripted speech. It allows anyone to either provide spoken word, or listen to existing spoken word and provide feedback on transcriptions to help with the ground-truth transcriptions.

LibriSpeech

LibriSpeech is 1,000 hours of audio book audio, carefully segmented by OpenSLR, an organization devoted to making publicly available speech and language datasets.

Fleurs

Fleurs is a dataset that is great for many languages, as it has over 12 hours of audio per language and over 102 languages in the corpus.

CHiME6

CHiME6 is a dataset of over 20 dinner parties, testing the abilities of models to overcome noise, overlapping dialogue, and more.

Switchboard

Switchboard, originally released by Texas Instruments as part of a DARPA sponsorship, is now a dataset of over 2,400 two-sided telephone conversations across the US. The topics of the conversations were carefully picked to avoid any speaker speaking on a topic more than once and it was ensured that no speakers would speak together more than once.

CallHome

CallHome was developed by Linguistic Data Consortium (LDC) as a series of 120 30-minute phone calls between native English speakers, primarily calls between close friends or family.

As you can tell, depending on the use case of your transcription needs, any one of these comprehensive datasets could be good to utilize. If you are looking for speech in a crowded public space, then CHiME6 could be a great option. If you are looking for a diverse set of audio around the world, then Fleurs could be a great solution. It is wonderful to have benchmarks to compare speech-to-text offerings to be as informed as possible when choosing a solution for your needs.

Are There Other Ways to Measure ASR Mistakes than WER?

Yes, there are. Other metrics include:

Match Error Rate (MER)

This is the words that were incorrectly predicted or inserted, defined as:
(S + D + I) / (S + D + C + I)

Word Information Loss (WIL)

This is the percentage of words that were incorrectly predicted, defined as:
(1 - C/(S + D + C) * C/(C + S + I))

Word Information Preserved (WIP)

This is the percentage of correctly predicted words, defined as:
C/(S + D + C) * C/(C + S + I)

Character Error Rate (CER)

This is the WER, but for the character level, defined with the same formula, just for characters as opposed to words.

While each of these metrics can be helpful, WER is the most common metric of calculation in the market. Additionally, none of these other measurements materially tackle a lot of the underlying concerns with WER.

Are ASR Word Error Rates Generally Improving?

Yes, absolutely! With each passing year, speech-to-text, particularly with open source models, is only improving. As the community continues to share insights into best understanding and deriving insight from audio, speech-to-text and ASR technology in general should only improve. It is likely in the near future that call transcription by machines will be as good as that done by humans. With the rise of Large Language Models (LLMs) and Artificial Intelligence (AI) in general, the context of the speech can always be included, which further increases the chances of accuracy. Further, since LLMs can have a large corpus of training, it is possible these systems will have a wide variety of industry-specific terminology, to avoid the dreaded word similarity issue in today's speech-to-text offerings.

Conclusion

Word Error Rate is an important metric in evaluating speech-to-text offerings, since it provides a numerical evaluation that can be used to compare each offering against another. That being said, Word Error Rate is not necessarily perfect, as each ASR system performs better or worse depending on the variables of the underlying audio and each system can make mistakes in different ways. Overall, WER is the metric that is most likely to have continued influence in the industry, since it is objective and well-defined.

Whisper API