AI Audio Conversations With OpenAI Whisper
AI Audio Conversations Using OpenAI Whisper
Goal: Create an AI Chat Bot that you can talk to
Tools: Javascript (React), Python (Backend API Alternate), NodeJS (Backend API)
Frontend Client
Lets start by setting up our frontend service to record audio and forward those recordings to our backend service. Once the backend service receives the audio data it can be transcribed using OpenAI’s Whisper API. With the newly transcribed data we can use this for our prompt. Completing the loop from audio to question asked by the OpenAI Completions API.
Here are few resources I found to get started on browser audio recording:
Hark — Recognizes when someone starts and stops speaking
using elevated audio and intervals. This library can be
used to generate speaking/stopped triggers for your
audio stream. Once you know when speaking started and
stopped you can send snippets of audio to your backend
service for transcription.
https://github.com/latentflip/hark
React Speech Recognition — This library can be used to
record audio and send to native browser transcription
service. It is noted that Chrome desktop browser works
the best for this service. This library could be
expanded to include Whisper API as the backend for
transcription.
https://www.npmjs.com/package/react-speech-recognition
MediaRecorder — This built-in browser library can be
used to record audio or video from the browser to a
webm
format. This
webm
format is already supported by OpenAI so no need to
convert.
https://developer.mozilla.org/en-US/docs/Web/API/MediaRecorder
var mediaRecorder: MediaRecorder;
export default () => {
const [messages, setMessages] = useState([])
const sendRecording = async (audioData) => {
// first convert the audio data to base64
var reader = new FileReader();
reader.readAsDataURL(audioData)
return new Promise((resolve, reject) => {
reader.onloadend = function() {
// Send base64 string data backend service
axios.post('localhost:3000/whisper', {audio: reader.result})
.then(res => {
resolve(res.data)
})
.catch(err => {
reject(err)
})
}
})
}
const sendPrompt = async (prompt) => {
return axios.post('localhost:3000/completions', { prompt })
}
const record = async () => {
if (navigator.getUserMedia) {
console.log("Starting to record");
// Get audio stream
const stream = await navigator.mediaDevices.getUserMedia({
audio: true,
video: false,
});
// Generate the media recorder with stream from media devices
// Starting position is paused recording
mediaRecorder = new MediaRecorder(stream);
// Also pass the stream to hark to create speaking events
var speech = hark(stream, {});
// Start the recording when hark recognizes someone is speakign in the mic
speech.on("speaking", function () {
console.log("Speaking!");
mediaRecorder.start();
});
// When hark recognizes the speaking has stopped we can stop recording
// The stop action will generate ondataavailable() to be triggered
speech.on("stopped_speaking", function () {
console.log("Not Speaking");
if (mediaRecorder.state === "recording") mediaRecorder.stop();
});
//
mediaRecorder.ondataavailable = (e) => {
sendRecordedAudio(e.data).then((newMessage) => {
sendPrompt(newMessage).then(aiRes => {
setMessages([...chat, newMessage, aiRes.data.message]);
})
});
};
} else {
console.log("recording not supported");
}
};
const stopRecording = async () => {
if (mediaRecorder) {
if (mediaRecorder.state === "recording") mediaRecorder.stop();
mediaRecorder.stream.getTracks().forEach((track) => track.stop());
}
};
return (
<view>
<button onClick={record}>Record</button
<button onClick={stop}>Stop</button>
{chat.map(message =>
<p>message.text</p>
)}
</view
)
}
Backend
Our backend service will be used to proxy the requests from our browser so the OpenAI key is not exposed. We will show two different options for backend service so you only need to run one.
Why are we writing base64 data to file and reading? The API for whisper requires a file object so to use the library we need to provide a file object. We could also send the API request without the OpenAI library like below.
function createFormDataFromBase64(base64String, fieldName, fileName) {
const byteString = atob(base64String.split(',')[1]);
const mimeType = dataUri.split(';')[0].split(':')[1];
const arrayBuffer = new ArrayBuffer(byteString.length);
const intArray = new Uint8Array(arrayBuffer);
for (let i = 0; i < byteString.length; i += 1) {
intArray[i] = byteString.charCodeAt(i);
}
const blob = new Blob([intArray], { type: mimeType });
const formData = new FormData();
formData.append(fieldName, blob, fileName);
return formData;
}
axios({
method: 'post',
url: 'https://api.openai.com/v1/audio/transcriptions',
data: createFormDataFromBase64(base64Str, 'file', 'audio.webm'),
headers: {
'Content-Type': 'multipart/form-data',
'Authorization': 'Bearer {OPEN_AI_API_KEY}'
},
})
.then(function (response) {
//handle success
console.log(response);
})
.catch(function (response) {
//handle error
console.log(response);
});
NodeJS
const express = require('express')
const fs = require('fs')
const { Configuration, OpenAIApi } = require("openai");
const configuration = new Configuration({
apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
const app = express()
const port = 3000
app.post('/whisper', (req, res) => {
await fs.writeFileSync(
"/tmp/tmp.webm",
Buffer.from(
body.audio.replace("data:audio/webm;codecs=opus;base64,", ""),
"base64"
)
);
return openai
.createTranscription(fs.createReadStream("/tmp/tmp.webm"), "whisper-1")
.then((res) => {
console.log("audio res", res.data.text);
return res.json(res.data)
})
.catch((err) => {
console.log(err);
console.log(err.response.data.error);
});
}
app.post('/completions', (req, res) => {
const response = openai.createCompletion({
model: "text-davinci-003",
prompt: req.body.prompt,
max_tokens: 100,
temperature: 0
});
return res.json(response.data)
})
app.listen(port, () => {
console.log(`Example app listening on port ${port}`)
})
Python (Flask)
import os
import flask
import openai
import base64
from flask import Flask, request
openai.api_key = os.environ.get('OPENAI_API_KEY')
app = Flask(__name__)
@app.route('/whisper', methods=['POST'])
def completion_api():
json = request.json
decoded_data = base64.b64decode(json['audio'])
with open('/tmp/audio.webm', 'wb') as f:
f.write(decoded_data)
audio_file = open('/tmp/audio.webm', "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)
return flask.Response(transcribe)
@app.route('/completions', methods=['POST'])
def completion_api():
json = request.json
completion = openai.Completion.create(engine="text-davinci-003", prompt=request.prompt)
return flask.Response(completion)
Thank you for reading! Stay tuned for more.