First month for free!
Get started
Speech-to-Text (STT) APIs are becoming a lot more common amongst companies, especially with the push towards more artificial intelligence workflows and use cases. However, running your own infrastructure for doing translation or transcription can be time-consuming and not a part of your core business. That's where state-of-the-art speech-to-text solutions come into play. However, there are a lot of different choices and it can feel impossible to select amongst the various options, especially with the constantly changing landscape. This article dives into the complexities of the landscape and aims to simplify the choices, while providing a list of the STT APIs and why you'd choose one over another.
Cost - Cost can quickly balloon with speech-to-text APIs, as audio transcription is resource-intensive and the amount of audio being transcribed can be massive. Given this, cost should be a major consideration when it comes to the ROI of a transcription service. In general, it is best to find the most affordable solution that meets your minimum predetermined thresholds of speed and accuracy.
Speed - The importance of this evaluation point largely depends on your workflow. Are you looking for close to real-time transcription, or are you doing batch jobs that may not require a speedy return? Regardless, getting results back from an STT API quickly is always preferred. It is good to test speech-to-text APIs at different times of day, with different file types, audio lengths, and parameters to ensure that the speed consistently meets your minimum needs.
Reliability - You want an STT API that has great uptime and the ability to handle errors gracefully. The top speech-to-text APIs pride themselves on the ability to do this. Downtime for STT APIs can be devastating to around-the-clock workflows, especially as a business scales. It is best to integrate with multiple STT APIs, one as a primary and another as a backup in the case of downtime. However, the primary offering should still be highly reliable.
Accuracy - For almost all use cases, accuracy is one of the most important factors in choosing a speech-to-text API in 2024. The typical means by which an STT API is evaluated is word error rate (i.e. how many words are transcribed incorrectly). This is a great objective measure of STT API performance, allowing you to have as close to an apples-to-apples comparison as possible. However, it is important to test on as close to your exact use case as possible to prevent issues in the future.
Features - Advanced features are the most important potential differentiators of a speech-to-text API in 2024. A few of these features include:
Diarization for mapping speakers to text
Translation for transcription across languages
Initial Prompt for passing in keywords for spelling and context
Language Detection for automatically transcribing into the right language
Callback URLs for sending results to an external URL upon completion
Scalability - The underlying architecture for a speech-to-text API should allow for borderline infinite scalability (within reason). Most of the jobs should be highly parallelized for an STT API, resulting in the ability to do dozens or hundreds of simultaneous transcriptions.
Ease-of-Use - It's important that speech-to-text APIs are easy to use. It should only require a few parameters to get started and should be compatible with most if not all programming languages. As a general rule of thumb, you should be able to get started with an STT API within a few minutes.
Support - If you ever have questions, comments, or concerns with an STT API, you should feel confident that your challenges will be resolved quickly. The best speech-to-text APIs offer great support not just for the top enterprise customers, but also for the companies or individuals just getting started with STT.
AI Assistant - with the rise of artificial intelligence in 2022 and 2023, 2024 is shaping up to be the year for AI assistants. With that need comes the need for mechanisms to receive input from the user. Typically, this input has been a text input, but human speech is oftentimes the more natural medium for communication. With STT APIs, it has become easier than ever to incorporate speech-to-text into an AI workflow.
Sales - With the rise of artificial intelligence, it is becoming easier to augment or replace parts of the typical software sales workflow with an AI agent, augmented by speech-to-text. For example, AI agents may be capable of providing real-time insights to account executives to help close deals or alleviate client concerns. Additionally, AI agents may be capable of doing small parts of the sales process to reduce friction and improve productivity of the team.
Support - As users of an application or service have questions, it can be difficult to scale a human-to-human interaction model, which is where AI and speech-to-text workflows come in handy. With speech-to-text offerings, you can spin up voice support with a corpus of data about common support issues, allowing your users to chat with an AI assistant to receive support, potentially escalating solutions to humans for more complex intervention. As AI improves, the need for these escalations may reduce, allowing companies to scale support faster.
Voice Data Aggregation - There are a lot of examples of speech available in the world (call recordings, webinars, podcasts, videos, and more.) With all this information available, speech-to-text APIs are the perfect tools for gathering this voice to text, after which complex aggregations, summaries, or sentiment analyses can be done to provide helpful insight. For example, someone can summarize the reason deals aren't closing for a particular product or service by transcribing sales calls.
Word Error Rate - This is the most objective measure of the accuracy performance of a speech-to-text API. Basically, the way to evaluate each API is to pick a sample of audio (oftentimes from pre-curated datasets) with the fully correct transcription already done. Then, test each potential API against that sample of audio and calculate the rate of mistakes being made (i.e. the word error rate). You then can pick the speech-to-text API with the lowest rate of error, or find a tradeoff between error rate and the other variables. For more information on calculating error rates, you can look into Accuracy Testing a Transcription API.
Pricing - Take a look at the publicly listed prices for the various STT API options. Prices can range from about as low as $0.15/hr of audio all the way up to $1.20/hr of audio. In general, reducing costs for transcription and speech-to-text is always a good idea. Also, make sure that the pricing listed includes all of your needs. Is diarization included? Are the additional features you need included? Is the standard model offered by the service the one you need, or is there a more expensive model needed for your use case? Is the pricing scaled, or flat? These are all factors to consider.
Email Support - If you have questions, it's great to email support. A great speech-to-text API should have support that responds quickly with great answers on both technical and non-technical questions. For example, if you run into errors on starting or scaling your system, you want a company that cares about your issues and is able to quickly offer technical support to resolve the issue at hand.
Test the STT API - The absolute best way to evaluate an API is to test it yourself. Whisper API offers 30 hours of free credits to test the API. You should try to get started and see if the API delivers results in the speed and with the accuracy you require for your use case. Once you have tested the APIs you are looking into, you will have a good idea of the different STT API speeds, accuracy rates, and prices, at which point you can more objectively evaluate the solutions for your particular needs. Make sure to test the APIs with a typical production use case and workload, including similar audio lengths, dialects, languages, background noise, volume of audio, and more.
Below is a comprehensive outlook on the speech-to-text market, with prices for the APIs taken from the publicly available pricing pages of each offering. Additionally, utilizing this study on LibriSpeech accuracy benchmarks for speech-to-text offerings and this benchmark for the other Whisper Models for LibriSpeech performance that were outside the original study, we are able to get fairly close to an apples-to-apples comparison between each speech-to-text offering for accuracy. Accuracy is calculated as 1 - Word Error Rate (WER) for the LibriSpeech test-other audio set.
Whisper API is the best tradeoff between cost and accuracy. Whisper API is the lowest cost major speech-to-text vendor on the market. And, with a recent update to the Whisper Small model via Faster Whisper, Whisper API offers an incredible speed and accuracy for everyday speech, which is what the OpenAI model was trained on. Further, out-of-the-box, Whisper API provides free diarization, the ability to pass in keywords, amazing speed, and amazing scalability.
Cost Per Hour of Audio: $0.17
Accuracy Percentage: 92.8%
The OpenAI API utilizes a larger model for their transcription as compared to Whisper API, which may result in more accurate transcription. Additionally, OpenAI offers a text-to-speech option for a conversational agent use case. Unfortunately, OpenAI does not appear to offer any form of diarization, which is a major limitation. Additionally, the API is more than twice as expensive as Whisper API and likely does not utilize Faster Whisper, a state-of-the-art improvement on the core model.
Cost Per Hour of Audio: $0.36
Accuracy Percentage: 96.7%
Deepgram has a wonderful range of services within speech-to-text, offering good accuracy and blazing speed for diarization. However, the number of languages supported may not be as comprehensive as other offerings (but the number is growing!) That being said, Deepgram does offer their own OpenAI Whisper. The Deepgram Whisper offering is more expensive than Whisper API, as is Deepgram's standard API offering.
Cost Per Hour of Audio: $0.216
Accuracy Percentage: 87.2%
AssemblyAI offers an amazing suite of features for their speech-to-text API, including sentiment analysis, PII redaction, and more. Additionally, AssemblyAI offers good accuracy with good speed, making it a usable offering. The biggest challenge with AssemblyAI is pricing: the offering starts at $0.65/hr of audio, with many additional upsells that could bring the transcription pricing much higher. If you are okay with higher prices for your use case, AssemblyAI could be a good fit, but more budget-minded individuals may be concerned about the pricing for the service, especially as speech transcription needs scale.
Cost Per Hour of Audio: $0.65
Accuracy Percentage: 95.8%
The biggest benefit of Azure STT API is the scalability and vast number of languages supported. However, Azure's STT offering can be an expensive offering, starting at $0.36/hr of audio, but scaling quickly if the use case is real-time, custom, or involves diarization. The custom model offerings may be worthwhile to look into for specific use cases, as it enables some custom training of models. If your speech is not transcribed well by other services, this could be a great backup option for making the use case feasible.
Cost Per Hour of Audio: $0.36
Accuracy Percentage: 94.1%
Rev AI is an amazing option for highly accurate, english-language use cases. Rev AI is limited to just 36+ languages and may not attain the same level of accuracy across all languages outside of English. The API is also quite expensive, starting at $1.20/hr of audio and scaling with additional features. Rev AI additionally has non-automated offerings for needs for high accuracy. If accuracy is the #1 concern for an English use case, Rev AI is one of the best offerings on the market.
Cost Per Hour of Audio: $1.20
Accuracy Percentage: 93.0%
Google Speech-to-Text delivers 125+ languages of support and the robustness you would expect from a Google product. However, the pricing starts at $1.44/hr (or $0.96/hr of audio for data logging), making it a more expensive option, which the accuracy does not necessarily justify as compared to other models and STT APIs. Overall, Google Speech-to-Text is an offering that is best utilized as a GCP customer that would like to keep everything within one ecosystem. Otherwise, the pricing may be a bit prohibitive for the average user.
Cost Per Hour of Audio: $0.96
Accuracy Percentage: 76.4%
Amazon Transcribe offers many languages of support and the scalability of AWS. However, the speed of the transcriptions could improve and the entry pricing is quite high at $1.44/hr of audio (going down to $0.468/hr of audio at scale). Overall, it is an offering that is best consumed by AWS users looking to keep everything within one ecosystem as opposed to a standalone offering, since the pricing is quite expensive for entry-level users.
Cost Per Hour of Audio: $0.468
Accuracy Percentage: 93.9%
IBM Watson Speech to Text API is by a well-established company, but can be lacking when it comes to speed and accuracy. Additionally, the offering does not differentiate on pricing as compared to less expensive options. That being said, the offering starts as less expensive than other major cloud offerings, so it could be a good option if you want to pick a very large company as the provider.
Cost Per Hour of Audio: $0.60
Accuracy Percentage: 86.8%
Speechmatics has a good offering in the space, but may not have the best speed compared to other offerings. At $0.30/hr, Speechmatics has one of the more affordable options in the market and should not be discounted as a potential solution. However, with this cheaper offering, Speechmatics may retain some data and also may offer lower accuracy and lower speeds. The enhanced offering is more expensive at $1.04/hr of audio, but may alleviate some of these data privacy, speed, and accuracy concerns.
Cost Per Hour of Audio: $0.30
Accuracy Percentage: 96.4%
Speech-to-text offerings are becoming more common in today's market, especially with the entrance of a number of great open-source offerings, namely OpenAI Whisper and the associated improvements and community contributions. However, the mechanisms to evaluate a speech-to-text API have remained relatively constant and are possible to do even by new entrants evaluating the market. With AI application offerings that require a speech-to-text solution, it's more important now than ever to evaluate services in the market to match the cutting-edge offerings of new companies. There are a lot of great offerings in the space, but only a few that offer a great trade-off amongst speed, accuracy, and price. 2024 will be a great year for audio transcription, speech-to-text offerings, and diariziation. Best of luck in evaluating solutions and finding one that matches your needs! In case you aren't convinced you'd like to use a paid third-party transcription service, check out this piece on open source speech-to-text accuracy benchmarks to be better informed on your options.