Array of documents
DocArray allows users to represent and manipulate multimodal data to build AI applications such as neural search and generative AI.
However, in machine learning we often need to work with an array of documents, and an array of data points.
This section introduces the concept of
AnyDocArray which is an (abstract) collection of
BaseDoc. This name of this library --
DocArray -- is derived from this concept and is short for
We provide two concrete implementations of
DocListwhich is a Python list of
DocVecwhich is a column based representation of
We will go into the difference between
DocVec in the next section, but let's first focus on what they have in common.
The spirit of
AnyDocArrays is to extend the
BaseModel concepts to the array level in a seamless way.
Before going into detail let's look at a code example.
DocVec are both
AnyDocArrays. The following section will use
DocList as an example, but the same
First you need to create a
Doc class, our data schema. Let's say you want to represent a banner with an image, a title and a description:
Let's instantiate several
You can now collect them into a
docs here is an array-like collection of
You can access documents inside it with the usual Python array API:
or iterate over it:
DocList[BannerDoc] might surprise you in this context.
It is actually at the heart of DocArray, but we'll come back to it later and continue with this example for now.
As we said earlier,
DocList (or more generally
AnyDocArray) extends the
BaseDoc API at the array level.
What this means concretely is you can access your data at the Array level in just the same way you would access your data at the document level.
Let's see what that looks like:
At the document level:
At the Array level:
All the attributes of
BannerDoc are accessible at the Array level.
Whereas this is true at runtime, static type analyzers like Mypy or IDEs like PyCharm will not be be aware of it. This limitation is known and will be fixed in the future by the introduction of plugins for Mypy, PyCharm and VSCode.
This even works when you have a nested
from docarray import BaseDoc, DocList from docarray.typing import ImageUrl class BannerDoc(BaseDoc): image: ImageUrl title: str description: str class PageDoc(BaseDoc): banner: BannerDoc content: str page1 = PageDoc( banner=BannerDoc( image='https://example.com/image1.png', title='Hello World', description='This is a banner', ), content='Hello world is the most used example in programming, but do you know that? ...', ) page2 = PageDoc( banner=BannerDoc( image='https://example.com/image2.png', title='Bye Bye World', description='This is (distopic) banner', ), content='What if the most used example in programming was Bye Bye World, would programming be that much fun? ...', ) docs = DocList[PageDoc]([page1, page2]) docs.summary()
╭─────── DocList Summary ───────╮ │ │ │ Type DocList[PageDoc] │ │ Length 2 │ │ │ ╰───────────────────────────────╯ ╭────── Document Schema ───────╮ │ │ │ PageDoc │ │ ├── banner: BannerDoc │ │ │ ├── image: ImageUrl │ │ │ ├── title: str │ │ │ └── description: str │ │ └── content: str │ │ │ ╰──────────────────────────────╯
docs.banner returns a nested
You can even access the attributes of the nested
BaseDoc at the Array level:
This is just the same way that you would do it with BaseDoc:
As you have seen in the previous section,
AnyDocArray will expose the same attributes as the
BaseDocs it contains.
But this concept only works if (and only if) all of the
BaseDocs in the
AnyDocArray have the same schema.
If one of your
BaseDocs has an attribute that the others don't, you will get an error if you try to access it at
the Array level.
To extend your schema to the Array level,
AnyDocArray needs to contain a homogenous Document.
This is where the custom syntax
DocList[DocType] comes into play.
DocList[DocType] creates a custom
DocList that can only contain
This syntax is inspired by more statically typed languages, and even though it might offend Python purists, we believe that it is a good user experience to think of an Array of
BaseDocs rather than just an array of heterogeneous
AnyDocArray can also be used to create a heterogeneous
DocList can be used to create a heterogeneous list of
DocVec cannot store heterogeneous
BaseDoc and always needs the
The usage of a heterogeneous
DocList is similar to a normal Python list but still offers DocArray functionality
like serialization and sending over the wire. However, it won't be able to extend the API of your custom schema to the Array level.
Here is how you can instantiate a heterogeneous
But this is not possible:
They share almost everything that has been said in the previous sections, but they have some conceptual differences.
DocList is based on Python Lists.
You can append, extend, insert, pop, and so on. In DocList, data is individually owned by each
BaseDoc collect just
different Document references. Use
DocList when you want to be able
to rearrange or re-rank your data. One flaw of
DocList is that none of the data is contiguous in memory, so you cannot
leverage functions that require contiguous data without first copying the data in a continuous array.
DocVec is a columnar data structure.
DocVec is always an array
of homogeneous Documents. The idea is that every attribute of the
BaseDoc will be stored in a contiguous array: a column.
This means that when you access the attribute of a
BaseDoc at the Array level, we don't collect the data under the hood
from all the documents (like
DocList) before giving it back to you. We just return the column that is stored in memory.
This really matters when you need to handle multimodal data that you will feed into an algorithm that requires contiguous data, like matrix multiplication which is at the heart of Machine Learning, especially in Deep Learning.
Let's take an example to illustrate the difference:
Let's say you want to work with an Image:
And that you have a function that takes a contiguous array of images as input (like a deep learning model):
Let's create a
ImageDocs and pass it to the function:
When you call
DocList loops over the ten documents and collects the image attribute of each document in a list. It is similar to doing:
this means that if you call
docs.image multiple times, under the hood you will collect the image from each document and stack them several times. This is not optimal.
Let's see how it will work with
The first difference is that you don't need to call
docs.image is already a contiguous array.
The second difference is that you just get the column and don't need to create it at each call.
One of the other main differences between both of them is how you can access documents inside them.
If you access a document inside a
DocList you will get a
BaseDoc instance, i.e. a document.
If you access a document inside a
DocVec you will get a document view. A document view is a view of the columnar data structure which
looks and behaves like a
BaseDoc instance. It is a
BaseDoc instance but with a different way to access the data.
When you make a change at the view level it will be reflected at the DocVec level:
whereas with DocList:
To summarize: you should use
DocVec when you need to work with contiguous data, and you should use
DocList when you need to rearrange
or extend your data.
Dealing with Optional fields
Nested optional field
By a "nested optional field" we mean a document that is contained within another document, and declared as
Using nested optional fields differs slightly between DocList and DocVes, so watch out. But in a nutshell:
When accessing a nested BaseDoc:
- DocList will return a list of documents if the field is optional and a DocList if the field is not optional
- DocVec will return a DocVec if all documents are there, or None if all docs are None. No mix of docs and None allowed!
- DocVec will behave the same for a tensor field instead of a BaseDoc
DocList with nested optional Field
Let's take an example to illustrate the exact behavior:
In this example
ArticleDoc has an optional field
image which is an
ImageDoc. This means that this field can either
be None or be a
Remember that for both DocList and DocVec calling
docs.image will return a list-like object of all the images of the documents.
For DocList this call will iterate over all the documents and collect the image attribute of each document in a sequence, and for DocVec it will return the already stacked column of the
For DocList it will return a list of
Optional[ImageDoc] instead of a
DocList[ImageDoc], this is because the list can contain None and DocList can't.
DocVec with nested optional Field
For DocVec it is a bit different. Indeed, a DocVec stores the data for each filed as contiguous column. This means that DocVec can create a column in only two cases: either all the data for a field is None or all the data is not None.
For the first case the whole column will just be None. In the second case the column will be a
Or it can be None:
But if you try a mix you will get an error: