Send over Jina

In this example we'll build an audio-to-text app using Jina, DocArray and Whisper.

We will use:

  • DocArray >=0.30: To load and preprocess multimodal data such as image, text and audio.
  • Jina: To serve the model quickly and create a client.

Install packages

First let's install requirements:

pip install transformers
pip install openai-whisper
pip install jina

Import libraries

Let's import the necessary libraries:

import whisper
from jina import Executor, requests, Deployment
from docarray import BaseDoc, DocList
from docarray.typing import AudioUrl

Create schemeas

Now we need to create the schema of our input and output documents. Since our input is an audio URL, our input schema should contain an AudioUrl:

class AudioURL(BaseDoc):
    audio: AudioUrl

For the output schema we would like to receive the transcribed text:

class Response(BaseDoc):
    text: str

Create Executor

To create our model, we wrap our model into a Jina Executor, allowing us to serve the model later and expose the endpoint /transcribe:

class WhisperExecutor(Executor):
    def __init__(self, device: str, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.model = whisper.load_model("medium.en", device=device)

    def transcribe(self, docs: DocList[AudioURL], **kwargs) -> DocList[Response]:
        response_docs = DocList[Response]()
        for doc in docs:
            transcribed_text = self.model.transcribe(str(['text']

        return response_docs

Deploy Executor and get results

Now we can leverage Jina's Deployment object to deploy this Executor, then send a request to the /transcribe endpoint.

Here we are using an audio file that says, "A man reading a book", saved as resources/audio.mp3:

dep = Deployment(
    uses=WhisperExecutor, uses_with={'device': "cpu"}, port=12349, timeout_ready=-1

with dep:
    docs =


And we get the transcribed result:

A man reading a book