First month for free!
Get started
First, we need to make sure our environment has all the tools we need. In this case, we have two main categories of tools we are setting up:
We need to install all the necessary tools for the AI model which encompass:
FFMPEG - Audio and video manipulation library
setuptools-rust - A module dependency for the OpenAI Whisper library
openai-whisper - The actual repository for the OpenAI Whisper library
We need to install all the necessary tools for the Web Server elements which are:
FastAPI - A modern, fast (high-performance), web framework for building APIs with Python
Uvicorn - an ASGI web server implementation for Python
Gunicorn - a Python WSGI HTTP Server for UNIX
python-multipart - For handling sending large files to a server
Below are the commands we need to execute to get our system set up. Note: the brew install command can vary based on operating system.
Before diving into the code, it's important to understand how we are wrapping the OpenAI Whisper project with web servers.
WSGI (Web Server Gateway Interface) and ASGI (Asynchronous Server Gateway Interface) are two kinds of web servers for receiving requests from outside of a server. The biggest difference is that WSGI handles requests sequentially (as in, waiting for one to finish before jumping onto another), whereas ASGI handles each request asynchronously (as in, each task is handled at the same time and can finish in any order). ASGI can theoretically scale much more quickly with traffic.
FastAPI is the service for actually writing our API code. Gunicorn and Uvicorn handle the incoming requests and send them to FastAPI to process. Uvicorn is a newer application that has the benefits of ASGI, which theoretically should be able to handle more traffic. In practice, Gunicorn is a more robust framework. Gunicorn has production-grade abilities to handle multiple 'workers', meaning it can balance multiple instances of the final application and send requests to each of them. So, how do we get the best of both worlds? We can actually combine Gunicorn's ability to handle workers with Uvicorns advanced ASGI features, which is what we're doing here. Additionally, FastAPI and Gunicorn cannot directly communicate, so Uvicorn is a great solution.
Below is a very basic setup of a FastAPI implementation with OpenAI. Essentially, we are loading our OpenAI Whisper speech model on app startup, then setting up an endpoint at the "/" location. We are sent a file from the client, which we then write to our file system, then send to the AI speech model, returning the text from the model back to the client.
Next, we run our application. We use Gunicorn to create 1 Uvicorn worker with a timeout of 60 seconds (to prevent slow requests). We bind this to our localhost 3000 port. We specify the python file MySampleSpeechToTextAPI.py and will utilize the app FastAPI. Note: a production-grade use case may need an NGINX reverse proxy in front of the Gunicorn instance.
Congratulations, we just sent our first file to our API server to run our newly created OpenAI Whisper speech-to-text API!
We have only scratched the surface of setting up a production-grade speech-to-text API with OpenAI Whisper. In order to make this a production-grade application, we would need to create an autoscaling system that utilizes a GPU. There are many ways to accomplish this task, but a general way is to:
1. Utilize a cloud provider such as Google Cloud, AWS, or Azure.
2. Set up a template virtual machine that will run virtual machine setup steps on start, likely utilizing a cloud-init file.
4. Metric monitor our virtual machines to autoscale when we hit a certain usage threshold.
5. Put the autoscaling virtual machines behind a load balancer instance to handle incoming traffic.
Obviously, setting this system up may be incredibly complicated depending on needs. Take a look at this piece on OpenAI Whisper that goes a bit deeper on autoscaling learning resources.
Getting started with setting up an API with OpenAI Whisper is a fairly straightforward process, largely thanks to the wonderful work provided by OpenAI and other open source projects like FastAPI, Uvicorn, and Gunicorn. Scaling this system can be quite difficult, however, as security, scaling, error handling, and more must be considered. Oftentimes, it can be advantageous to start with a public speech-to-text API offering to begin with and, as the use case scales, bring this process in-house if it makes sense to make the time investment into setting up a production-ready speech-to-text API.