Document
At the heart of DocArray
lies the concept of BaseDoc
.
A BaseDoc is very similar to a Pydantic
BaseModel
-- in fact it is a specialized Pydantic BaseModel
. It allows you to define custom Document
schemas (or Model
s in
the Pydantic world) to represent your data.
Note
Naming convention: When we refer to a BaseDoc
, we refer to a class that inherits from BaseDoc.
When we refer to a Document
we refer to an instance of a BaseDoc
class.
Basic Doc
usage
Before going into detail about what we can do with BaseDoc and how to use it, let's see what it looks like in practice.
The following Python code defines a BannerDoc
class that can be used to represent the data of a website banner:
from docarray import BaseDoc
from docarray.typing import ImageUrl
class BannerDoc(BaseDoc):
image_url: ImageUrl
title: str
description: str
You can then instantiate a BannerDoc
object and access its attributes:
banner = BannerDoc(
image_url='https://example.com/image.png',
title='Hello World',
description='This is a banner',
)
assert banner.image_url == 'https://example.com/image.png'
assert banner.title == 'Hello World'
assert banner.description == 'This is a banner'
BaseDoc
is a Pydantic BaseModel
The BaseDoc class inherits from Pydantic BaseModel. This means you can use
all the features of BaseModel
in your Doc
class. BaseDoc
:
- Will perform data validation:
BaseDoc
will check that the data you pass to it is valid. If not, it will raise an error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept later. - Can be configured using a nested
Config
class, see Pydantic documentation for more detail on what kind of config Pydantic offers. - Can be used as a drop-in replacement for
BaseModel
in your code and is compatible with tools that use Pydantic, like FastAPI.
Representing multimodal and nested data
Let's say you want to represent a YouTube video in your application, perhaps to build a search system for YouTube videos. A YouTube video is not only composed of a video, but also has a title, description, thumbnail (and more, but let's keep it simple).
All of these elements are from different modalities
: the title and description are text, the thumbnail is an image, and the video itself is, well, a video.
DocArray lets you represent all of this multimodal data in a single object.
Let's first create a BaseDoc
for each of the elements that compose the YouTube video.
First for the thumbnail image:
from docarray import BaseDoc
from docarray.typing import ImageUrl, ImageBytes
class ImageDoc(BaseDoc):
url: ImageUrl
bytes: ImageBytes = (
None # bytes are not always loaded in memory, so we make it optional
)
Then for the video itself:
from docarray import BaseDoc
from docarray.typing import VideoUrl, VideoBytes
class VideoDoc(BaseDoc):
url: VideoUrl
bytes: VideoBytes = (
None # bytes are not always loaded in memory, so we make it optional
)
Then for the title and description (which are text) we'll just use a str
type.
All the elements that compose a YouTube video are ready:
from docarray import BaseDoc
class YouTubeVideoDoc(BaseDoc):
title: str
description: str
thumbnail: ImageDoc
video: VideoDoc
We now have YouTubeVideoDoc
which is a pythonic representation of a YouTube video.
This representation can be used to send or store data. You can even use it directly to train a machine learning Pytorch model on this representation.
Note
You see here that ImageDoc
and VideoDoc
are also BaseDoc, and they are later used inside another BaseDoc`.
This is what we call nested data representation.
BaseDoc can be nested to represent any kind of data hierarchy.
Setting a Pydantic Config
class
Documents support setting a custom configuration
like any other Pydantic BaseModel
.
Here is an example to extend the Config of a Document dependong on which version of Pydantic you are using.
See also: