At the heart of
DocArray lies the concept of
A BaseDoc is very similar to a Pydantic
BaseModel -- in fact it is a specialized Pydantic
BaseModel. It allows you to define custom
Document schemas (or
the Pydantic world) to represent your data.
Naming convention: When we refer to a
BaseDoc, we refer to a class that inherits from BaseDoc.
When we refer to a
Document we refer to an instance of a
Before going into detail about what we can do with BaseDoc and how to use it, let's see what it looks like in practice.
The following Python code defines a
BannerDoc class that can be used to represent the data of a website banner:
from docarray import BaseDoc from docarray.typing import ImageUrl class BannerDoc(BaseDoc): image_url: ImageUrl title: str description: str
You can then instantiate a
BannerDoc object and access its attributes:
banner = BannerDoc( image_url='https://example.com/image.png', title='Hello World', description='This is a banner', ) assert banner.image_url == 'https://example.com/image.png' assert banner.title == 'Hello World' assert banner.description == 'This is a banner'
BaseDoc is a Pydantic
The BaseDoc class inherits from Pydantic BaseModel. This means you can use
all the features of
BaseModel in your
- Will perform data validation:
BaseDocwill check that the data you pass to it is valid. If not, it will raise an error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept later.
- Can be configured using a nested
Configclass, see Pydantic documentation for more detail on what kind of config Pydantic offers.
- Can be used as a drop-in replacement for
BaseModelin your code and is compatible with tools that use Pydantic, like FastAPI.
Representing multimodal and nested data
Let's say you want to represent a YouTube video in your application, perhaps to build a search system for YouTube videos. A YouTube video is not only composed of a video, but also has a title, description, thumbnail (and more, but let's keep it simple).
All of these elements are from different
modalities: the title and description are text, the thumbnail is an image, and the video itself is, well, a video.
DocArray lets you represent all of this multimodal data in a single object.
Let's first create a
BaseDoc for each of the elements that compose the YouTube video.
First for the thumbnail image:
from docarray import BaseDoc from docarray.typing import ImageUrl, ImageBytes class ImageDoc(BaseDoc): url: ImageUrl bytes: ImageBytes = ( None # bytes are not always loaded in memory, so we make it optional )
Then for the video itself:
from docarray import BaseDoc from docarray.typing import VideoUrl, VideoBytes class VideoDoc(BaseDoc): url: VideoUrl bytes: VideoBytes = ( None # bytes are not always loaded in memory, so we make it optional )
Then for the title and description (which are text) we'll just use a
All the elements that compose a YouTube video are ready:
from docarray import BaseDoc class YouTubeVideoDoc(BaseDoc): title: str description: str thumbnail: ImageDoc video: VideoDoc
We now have
YouTubeVideoDoc which is a pythonic representation of a YouTube video.
This representation can be used to send or store data. You can even use it directly to train a machine learning Pytorch model on this representation.
You see here that
VideoDoc are also BaseDoc, and they are later used inside another BaseDoc`.
This is what we call nested data representation.
BaseDoc can be nested to represent any kind of data hierarchy.