How to Build a Streaming Whisper WebSocket Service

How to Build a Streaming Open-Source Whisper WebSocket Service

In today’s fast-paced digital world, the ability to convert spoken language into text in real-time has become essential for various applications, including live captioning, voice-controlled interfaces, meeting transcriptions, and more. The need for low-latency and full-duplex communication in transcription services can effectively be met by implementing a WebSocket-based architecture. In this article, we’ll explore the setup of a Streaming Transcription WebSocket Service in Python using the OpenAI whisper python library.

Why WebSockets?

WebSockets give us the opportunity to have a bi-directional realtime conversion between the client and server. This bi-directional conversation in this context will be audio chunk data from the client and partial transcription text back from the server. With this method of transcription, we can have transcribe most of the audio before the recording is complete, giving us a rich user experience with minimal delay. By the time the user stops recording, we only have a process just a small amount of audio to complete the transcription, because we already processed up to the last 500ms of audio.

Setting Up the Environment

Running this application will require some dependencies be installed on your system. The first is ffmpeg which can be installed with brew or package manager of your choice. Make sure this is working by running the command ffmpeg --help before moving forward.

brew install ffmpeg

The Build

First, we need to setup a Python WebSocket Flask server. This server will buffer audio data and return transcription text. We also need a simple web page that will initiate the web socket connection, record audio, and forward those audio slices to the web socket server.

To begin setup of the Flask server, let’s install our python requirements requirements.txt

Flask
gevent-websocket
flask_sockets
openai-whisper

Create a file called app.py to setup a WSGI server for serving the web socket handler.

from flask import Flask, render_template
app = Flask(__name__)
sockets = Sockets(app)

if __name__ == "__main__":
from gevent import pywsgi
from geventwebsocket.handler import WebSocketHandler

server = pywsgi.WSGIServer(('', 5003), app, handler_class=WebSocketHandler)
server.serve_forever()

Next, let’s create the web socket handler in the same file. We will be handling bytearray or base64 string and converting to bytearray. Once we have the data in bytearray we will convert to ndarray that can be processed by whisper.

model = whisper.load_model('base.en')

def process_wav_bytes(webm_bytes: bytes, sample_rate: int = 16000):
with tempfile.NamedTemporaryFile(suffix='.wav', delete=True) as temp_file:
temp_file.write(webm_bytes)
temp_file.flush()
waveform = whisper.load_audio(temp_file.name, sr=sample_rate)
return waveform

def transcribe_socket(ws):
while not ws.closed:
message = ws.receive()
if message:
print('message received', len(message), type(message))
try:
if isinstance(message, str):
message = base64.b64decode(message)
audio = process_wav_bytes(bytes(message)).reshape(1, -1)
audio = whisper.pad_or_trim(audio)
transcription = whisper.transcribe(
model,
audio
)
except Exception as e:
traceback.print_exc()

sockets.url_map.add(Rule('/transcribe', endpoint=transcribe_socket, websocket=True))

Now that our backend server is setup, we can setup our simple web page to send audio data. FYI if your sound and recording is both coming from Google Chrome, the sound will not be recorded.

Create the file index.html and either open this with your browser or serve the file from your Flask server. If you decide to serve the file using flask, save it to templates/index.html and add a new handler to our app.py file.

@app.route('/')
def index():
return render_template('index.html')

Add the following code to our index.html to create a simple page for forwarding audio and displaying the transcription text. We are using RecordRTC to ensure our format is output correctly in audio/wav format, 1 channel, and 16000 sample rate. Our timeSlice parameter is setup to send audio chunks every 500ms to our backend server.

<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Realtime WebSocket Audio Streaming</title>
<style>
body {
background-color: black;
color: green;
}
</style>
</head>
<body>
<h1>Realtime WebSocket Audio Streaming</h1>
<button id="startButton">Start Streaming</button>
<button id="stopButton">Stop Streaming</button>
<div id="responseContainer"></div>
<script src="https://www.WebRTC-Experiment.com/RecordRTC.js"></script>
<script>
let ws = new WebSocket(‘ws://localhost:5003/transcription’);
let mediaRecorder;

ws.onmessage = event => {
let responseContainer = document.getElementById('responseContainer');
responseContainer.innerHTML += `<p>${event}</p>`;
};

let handleDataAvailable = (event) => {
if (event.size > 0) {
console.log('blob', event)
blobToBase64(event).then(b64 => {
ws.send(b64)
})
}
};

function blobToBase64(blob) {
return new Promise((resolve, reject) => {
const reader = new FileReader();
reader.readAsDataURL(blob);
reader.onload = () => {
const base64String = reader.result.split(',')[1];
resolve(base64String);
};
reader.onerror = (error) => reject(error);
});
}

navigator.mediaDevices.getUserMedia({ audio: true })
.then(stream => {
let recorder = RecordRTC(stream, {
type: 'audio',
recorderType: StereoAudioRecorder,
mimeType: 'audio/wav',
timeSlice: 500,
desiredSampRate: 16000,
numberOfAudioChannels: 1,
ondataavailable: handleDataAvailable
});

document.getElementById('startButton').addEventListener('click', () => {
recorder.startRecording();
});

document.getElementById('stopButton').addEventListener('click', () => {
recorder.stopRecording();
});
});

ws.onopen = () => {
console.log('WebSocket connection opened');
};

ws.onclose = () => {
console.log('WebSocket connection closed');
};
</script>
</body>
</html>

Last, let’s start our server and test the performance. Like most AI models, Whisper will run best using a GPU, but will still work on most computers. For the test I used an M2 MacBook Pro.

python app.py

Considerations

  • Testing optimized builds of Whisper like whisper.cpp or insanely-fast-whisper could make this solution even faster
  • Make sure you have a dedicated GPU when running in production to ensure speed and concurrency can keep up with users
  • Add some type of authentication for your web sockets endpoint to ensure you can control traffic. Some services require an auth message be sent first then you can start to send the audio when the channel has been authenticated.
  • Implementing additional AI libraries diart to this endpoint could improve accuracy of the predictions by merging and slicing on silent sections of the audio. Diart also includes features like speaker diarization and segmentation.

A WebSocket-based streaming transcription service can unlock new possibilities for real-time audio processing applications. The code snippets provided offer a starting point for technical audiences to implement their own WebSocket Whisper servers.

Happy Coding!

References

Insanely fast whisper https://github.com/Vaibhavs10/insanely-fast-whisper/tree/main

whisper.cpp https://github.com/ggerganov/whisper.cpp

Diart https://github.com/juanmc2005/diart

RecordRTC https://www.npmjs.com/package/recordrtc

Contact

Open for contract projects as a Project Leader or Individual Contributor. Let’s chat!

LinkedIn: https://www.linkedin.com/in/davidrichards5/
Email: david.richards.tech (@) gmail.com