Create Your Own OpenAI Whisper Speech-to-Text API

OpenAI has released a revolutionary speech-to-text model called Whisper. Running this model is also relatively straightforward, with just a few lines of code. However, for most real-world use cases, it's important to be able to run workflows remotely, likely on-demand. To that end, you need to spin up an autoscaling system that can take in API requests and return results. Easier said than done! This piece aims to show a basic implementation for setting up a system for accepting API requests for OpenAI's speech-to-text model and returning results back, with a section on scaling this system for a production use case.

Speech-to-Text API Package Setup

First, we need to make sure our environment has all the tools we need. In this case, we have two main categories of tools we are setting up:

AI Model

We need to install all the necessary tools for the AI model which encompass:

FFMPEG - Audio and video manipulation library

setuptools-rust - A module dependency for the OpenAI Whisper library

openai-whisper - The actual repository for the OpenAI Whisper library

Web Server

We need to install all the necessary tools for the Web Server elements which are:

FastAPI - A modern, fast (high-performance), web framework for building APIs with Python

Uvicorn - an ASGI web server implementation for Python

Gunicorn - a Python WSGI HTTP Server for UNIX

python-multipart - For handling sending large files to a server

Below are the commands we need to execute to get our system set up. Note: the brew install command can vary based on operating system.

brew install ffmpeg
pip install fastapi
pip install -U openai-whisper
pip install setuptools-rust
pip install uvicorn
pip install gunicorn
pip install python-multipart

Web Servers Overview

Before diving into the code, it's important to understand how we are wrapping the OpenAI Whisper project with web servers.

What's the difference between ASGI and WSGI?

WSGI (Web Server Gateway Interface) and ASGI (Asynchronous Server Gateway Interface) are two kinds of web servers for receiving requests from outside of a server. The biggest difference is that WSGI handles requests sequentially (as in, waiting for one to finish before jumping onto another), whereas ASGI handles each request asynchronously (as in, each task is handled at the same time and can finish in any order). ASGI can theoretically scale much more quickly with traffic.

Why Do We Need FastAPI, Gunicorn, and Uvicorn?

FastAPI is the service for actually writing our API code. Gunicorn and Uvicorn handle the incoming requests and send them to FastAPI to process. Uvicorn is a newer application that has the benefits of ASGI, which theoretically should be able to handle more traffic. In practice, Gunicorn is a more robust framework. Gunicorn has production-grade abilities to handle multiple 'workers', meaning it can balance multiple instances of the final application and send requests to each of them. So, how do we get the best of both worlds? We can actually combine Gunicorn's ability to handle workers with Uvicorns advanced ASGI features, which is what we're doing here. Additionally, FastAPI and Gunicorn cannot directly communicate, so Uvicorn is a great solution.

Setting Up our OpenAI Whisper Speech-to-Text API with FastAPI

Below is a very basic setup of a FastAPI implementation with OpenAI. Essentially, we are loading our OpenAI Whisper speech model on app startup, then setting up an endpoint at the "/" location. We are sent a file from the client, which we then write to our file system, then send to the AI speech model, returning the text from the model back to the client.

#MySampleSpeechToTextAPI.py
from fastapi import FastAPI, File, UploadFile
import whisper

app = FastAPI()
model = None

@app.on_event("startup")
async def startup_event():
  global model
  model = whisper.load_model("tiny")

@app.post("/")
async def transcription(file: UploadFile):
  with open("audio.wav", 'wb') as f:
    while contents := file.file.read(1024 * 1024):
      f.write(contents)
  file.file.close()
  return model.transcribe("audio.wav")["text"]

Running our OpenAI Whisper Speech-to-Text API with Gunicorn and Uvicorn

Next, we run our application. We use Gunicorn to create 1 Uvicorn worker with a timeout of 60 seconds (to prevent slow requests). We bind this to our localhost 3000 port. We specify the python file MySampleSpeechToTextAPI.py and will utilize the app FastAPI. Note: a production-grade use case may need an NGINX reverse proxy in front of the Gunicorn instance.

gunicorn --timeout 60 -w 1 -k uvicorn.workers.UvicornWorker -b 127.0.0.1:3000 MySampleSpeechToTextAPI:app

Accessing Our OpenAI Whisper Speech-to-Text API Via Curl

curl --request POST --url 'http://localhost:3000' -F "file=@YOUR_FILE_PATH"

Congratulations, we just sent our first file to our API server to run our newly created OpenAI Whisper speech-to-text API!

Next Steps: Using a GPU and Scaling our Speech-to-Text API

We have only scratched the surface of setting up a production-grade speech-to-text API with OpenAI Whisper. In order to make this a production-grade application, we would need to create an autoscaling system that utilizes a GPU. There are many ways to accomplish this task, but a general way is to:
1. Utilize a cloud provider such as Google Cloud, AWS, or Azure.
2. Set up a template virtual machine that will run virtual machine setup steps on start, likely utilizing a cloud-init file.
4. Metric monitor our virtual machines to autoscale when we hit a certain usage threshold.
5. Put the autoscaling virtual machines behind a load balancer instance to handle incoming traffic.

Obviously, setting this system up may be incredibly complicated depending on needs. Take a look at this piece on OpenAI Whisper that goes a bit deeper on autoscaling learning resources.

Conclusion on Setting up Our Own Speech-to-Text API using OpenAI Whisper

Getting started with setting up an API with OpenAI Whisper is a fairly straightforward process, largely thanks to the wonderful work provided by OpenAI and other open source projects like FastAPI, Uvicorn, and Gunicorn. Scaling this system can be quite difficult, however, as security, scaling, error handling, and more must be considered. Oftentimes, it can be advantageous to start with a public speech-to-text API offering to begin with and, as the use case scales, bring this process in-house if it makes sense to make the time investment into setting up a production-ready speech-to-text API.

Whisper API