What is ASR?

ASR (automatic speech recognition) is a class of technologies to take audio input and output information about the contents of that audio. Typically, the output will be a transcript. This transcript will oftentimes include timestamps of each word, metadata around who spoke which words known as diarization, the sentiment of the speech, and more.

ASR is a broad category that is oftentimes subset into sentiment analysis, speech-to-text, and diarization. Additional new-wave creations such as Large Language Models (LLMs) and text-to-speech technology have augmented the ASR experience in recent years.

How do AI and Machine Learning Relate to ASR?

Automatic Speech Recognition (ASR) is, at its core, a machine learning problem. Let’s look at one of the most common options on the market today, OpenAI Whisper.

Whisper is an encoder-decoder transformer. What this means is the input is encoded into another format, fed into the model, then decoded into a human-usable output. The input here is the audio from the audio file. Next, the audio is broken into 30-second intervals and converted into a format that the model can use, known as a log-Mel spectrogram. That sounds quite fancy, but let’s break that phrase into easier-to-digest component parts to better understand.

Spectrogram: At its core, a spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time. It's essentially a 3D plot, with one axis representing time, another representing frequency, and the color or intensity representing the amplitude (or power) of a particular frequency at a particular time.

Mel Scale: The "mel" in log-mel spectrogram refers to the mel scale, which is a scale of pitches judged by listeners to be equal in distance from one another. The purpose of the mel scale is to mimic the way humans perceive sound. Humans don't perceive frequencies linearly; we are better at detecting differences in lower frequencies than higher ones. The mel scale takes this into account by warping the frequency scale to be more representative of human hearing.

Logarithmic Scale: The "log" part refers to the use of a logarithmic scale to represent amplitude. In a log-mel spectrogram, the amplitude of the frequencies is converted to a logarithmic scale. This is done because our ears perceive loudness in a logarithmic fashion (which is why we measure sound in decibels, which are logarithmic). This conversion makes the spectrogram more aligned with human hearing perception.

The model creators fundamentally believed that the best way to decode speech is to get it into a format extremely close to how humans perceive speech (as in, in how our own brains decode the language).

Next, this is fed into an encoder for the decoder to then predict the text, timestamps, and more of the associated text. Whisper was trained on over 680,000 hours of diverse audio across the internet, making it highly robust to diverse sets of datasets with different languages, dialects, background noise, audio quality, and more.

How is ASR Used in the Market Today?

Automatic Speech Recognition is used for a variety of use cases today. SRT and VTT files are quite common for video transcription, which enables captioning on video media we consume across television, internet videos, social media, and more. Speech-to-text is also becoming more common with video calls helping us record what happened on a given work call. With the rise of ChatGPT and Large Language Models in general, we can even use these associated transcripts to summarize these calls in bulk and discover patterns that are difficult to see across sessions. Finally, with the rise of text-to-speech models, it has now become even easier to create AI assistants that interact directly with humans, using speech-to-text to transcribe the audio, an LLM to come up with a response, and a text-to-speech model to bring the text to an audio format.

What are the Latest Developments in ASR Technology?

OpenAI has led the way when it comes to developments in the ASR space. The Whisper model has been a game changer when it comes to speech-to-text technology, enabling developers to offer a fast, cheap, and robust option for audio transcription and speech recognition. Additionally, OpenAI’s text-to-speech models have enabled a new interface for ChatGPT that can be more robust than Siri or Alexa, enabling better interactions with AI agents.

Is Speech-to-Text a Form of ASR?

Yes, speech-to-text is a form of automatic speech recognition technology. In many ways, speech-to-text is the cornerstone of the ASR industry, as most tooling outside of that within the ASR umbrella is used to augment the abilities of speech-to-text on its own.

Tradeoffs in ASR Technology

When choosing to build or buy a speech-to-text option, you need to consider tradeoffs both with internally building, and with what’s available on the market.

At a high-level, ASR technologies tradeoff speed and accuracy, price, and feature sets. OpenAI Whisper and derivative commercial offerings are oftentimes the best options for those looking for an ASR solution. The cost to self-host or buy audio transcription time for Whisper services is oftentimes much less than other offerings. Additionally, the zero-shot performance of the Whisper model is wonderful. It is often a great idea to use an external speech-to-text API to start with, then self-host if the costs become too high. By using a similar model to what you can self-host, you can be confident you can make the switch without creating user complaints. All-in-all, Whisper provides some of the best trade-offs when it comes to speed, price, and accuracy. And, with the many different open-source models for things like diarization, it is straightforward to add the additional functionality you need on a case-by-case basis.

ASR Conclusion

ASR is becoming a pivotal part of the LLM and AI revolution we’ve seen over the last two years, tying together the way that people communicate with the models that will improve the way we digest and incorporate data into our everyday lives. Additionally, ASR has amazing applicability to everyday parts of our lives, such as in traditional media and video which we consume on a daily basis. Regardless, ASR is going to have a lasting impact on our society and recent improvements to ASR technology will only help push the AI, machine learning, and ASR revolution forward.

Whisper API