May 10, 2024

Speech-to-text Multimodal Experience in NodeJS

Introduction

Large Language Models excel at text-related tasks. But what if you need to make a model multimodal? How can you teach a text model to process an audio file, for example?

There is a solution: combine two different models. A model that can transcribe an audio recording and a model that can process it. The result of this processing would be a description of what is happening in the audio recording.

This can be easily implemented using the text and audio models of the AI/ML API.

Choosing a Text Model in AI/ML API

Since the text model needs to strictly follow instructions, the best candidate for this would be an Instruct-model.

By going to the models section, we find the right one for our purposes. One of the good candidates would be the Mixtral 8x22B Instruct model. For the speech-to-text model I will choose Nova 2.

Obtaining a Token in AI/ML API

You can get the key here.

Implementation

Make sure that NodeJS is installed on your machine. If necessary, you can find all the instructions for installing NodeJS here.

For a clear example of implementing multimodality, you can create a web server that will be able to accept the URL of an audio file and a brief "type" of this recording so that the models can understand the context of the speech.

Preparation

You need to create a new project. To do this, create a new folder named aimlapi-multimodal-example in any convenient location and navigate into it.

mkdir aimlapi-multimodal-example
cd ./aimlapi-multimodal-example


Here, create a new project using npm and install the required dependencies:

npm init -y
npm i express axios


Create a source file where all the necessary code will be and open the project in your preferred IDE. In my case, I will be using VSCode.

touch ./index.js
code .


Importing Dependencies

To create the required functionality, you will need to use the AI/ML API. As a web server, any framework or module can be used, but for simplicity, I suggest using express.

AI/ML API supports usage via the Openal SDK, but you can also use just HTTP requests with the Axios module:

const axios = require('axios').default;
const express = require('express');

API Interfaces and Prompts

The next step is to create all the constants, an express application, and interfaces for accessing the APIs:

const TOKEN = '<YOUR TOKEN HERE>'
const PORT = 8080;
// Speech-to-text model name
const STT_MODEL = '#g1_nova-2-general';
// Large language model name
const LLM_MODEL = 'mistralai/Mixtral-8x7B-Instruct-v0.1';

const api = new axios.create({
  baseURL: 'https://api.aimlapi.com',
  headers: { Authorization: `Bearer ${TOKEN}` },
});
const app = express();

Text models operate with prompts. Therefore, you need to create prompts that will give instructions to the model in processing audio recordings. There will be two prompts:

  • summary prompt: a detailed textual description of the audio file
  • context prompt: validation and editing of the description

Declare them in this manner:


const getSummaryPrompt =
  () => `Please provide a detailed report of the text transcription. The transcript of which I provide below in triple quotes, including key summary outcomes.
KEEP THESE RULES STRICTLY:
STRICTLY SPLIT OUTPUT IN PARAGRAPHS: Topic and the matter of discourse, Key outcomes, Ideas and Conclusions.
OUTPUT MUST BE STRICTLY LIMITED TO 2000 CHARACTERS!
STRICTLY KEEP THE SENTENCES COMPACT WITH BULLET POINTS! THIS IS IMPORTANT!
ALL CONTEXT OF THE TRANSCRIPT MUST BE INCLUDED IN OUTPUT!
DO NOT INCLUDE MESSAGES ABOUT CHARACTERS COUNT IN THE OUTPUT!`;

const getContextPrompt = (
  type,
) => `Ensure integrity and quality of the given summary, it is the summary of a ${type}, edit it accordingly.
OUTPUT MUST BE STRICTLY LIMITED TO 2000 CHARACTERS!
      STRICTLY KEEP THE SENTENCES COMPACT WITH BULLET POINTS! THIS IS IMPORTANT!
      ALL CONTEXT OF THE TRANSCRIPT MUST BE INCLUDED IN OUTPUT!
      DO NOT INCLUDE MESSAGES ABOUT CHARACTERS COUNT IN THE OUTPUT!`;

These will be template functions, returning the required string to us.

Express Endpoint

Our task will be handled by a GET HTTP endpoint at /summarize.

We declare it using express:

app.get('/summarize', async (req, res, next) => {})

Two parameters will be sent in the request: type and url. We will extract them from the request and perform basic validation.

const { type, url } = req.query;
if (!type || !url) {
  return res.status(400).send({ error: "'type' and 'url' parameters required" });
}

Next, we need to send a request to the text-to-speech API and obtain a transcription of the audio file:

const {
  data: {
    results: {
      channels: [
        {
          alternatives: [{ transcript }],
        },
      ],
    },
  },
} = await api.post('/stt', { model: STT_MODEL, url });

We are interested only in the first result, so we ignore all other possible alternatives and extract the data using destructuring assignment.

Next, we need to process the transcription using the AI/ML API. For this, we will use the chat/completions route.

const { data: summaryCompletion } = await api.post('/chat/completions', {
  model: LLM_MODEL,
  messages: [
    { role: 'system', content: getSummaryPrompt() },
    { role: 'user', content: transcript },
  ],
});

const { data: contextedCompletion } = await api.post('/chat/completions', {
  model: LLM_MODEL,
  messages: [
    { role: 'system', content: getContextPrompt(type) },
    { role: 'user', content: summaryCompletion.choices[0].message.content },
  ],
});


This will allow us to run the result twice, improving its quality and eliminating some errors the model might have made.

Now we need to return the response, formatting it visually:

const response = `<pre style="font-family: sans-serif; white-space: pre-line;">${contextedCompletion.choices[0].message.content}</pre>`;
res.send(response);

With this, the processing of the /summarize request is complete. All that remains is to launch the web server:

app.listen(PORT, () => {
  console.log(`listening on http://127.0.0.1:${PORT}`);
});


Result

Launch the application using the command:

node ./index.js

And you will see in the console a message about the running server and its address. You can check the result in the browser by going to the server's address and adding the API request path: https://127.0.0.1:8080/summarize.

You will immediately see an error:

{"error":"'type' and 'url' parameters required"}

This indicates that basic parameter validation is working. Now specify the necessary parameters in the URL for the request to be processed correctly:

http://127.0.0.1:8080/summarize?url=https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3&type=voice

This will return a result of approximately the following kind:

Summary:

* Speaker admires Mr. Rochester's beauty and devotion.
* Mr. Rochester is described as subdued and open to external influences.
* Speaker's admiration suggests a positive relationship.
* Use of language hints at Mr. Rochester's strength and control.

The text appears to be a fragmented transcription about a person named Mr. Rochester. The speaker expresses admiration for Mr. Rochester's beauty and will, describing him as subdued and devoted. The speaker's admiration and use of language suggest a positive relationship and impression of Mr. Rochester. The phrase "bowed to let might in" is unclear but may indicate Mr. Rochester's openness to external influences. The text's limited and fragmented nature makes definitive conclusions difficult, but the speaker's admiration and use of language hint at Mr. Rochester's strength and control.

Voila! We have created an application capable of making a transcription from an audio file and its brief description. Launched it on a web server, and now it can be used in completely different contexts. For example, instead of a browser, we can use the wget utility and see the result directly in the terminal:

wget -q -O - 'http://127.0.0.1:8080/summarize?url=https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3&type=speech'

Next steps

Here is a simplified example of an application. In a production environment, you will need to make many changes, such as:

  • File streaming: The ability to send data streams rather than just file URLs can be useful.
  • Parameterization: Change the model based on business requirements.
  • Data Validation: It's important to validate user input to ensure accuracy.
  • Containerization: Use containers to simplify deployment and scaling.

Conclusion

Using text models through a multimodal approach opens up the possibility of solving tasks that previously seemed impossible. For example, we can transcribe YouTube videos, explain complex diagrams in simple language, or conduct an entire study by explaining the instructions to the model in simple human language.

Get API Key