Array of documents
DocArray allows users to represent and manipulate multimodal data to build AI applications such as neural search and generative AI.
As you have seen in the previous section, the fundamental building block of DocArray is the BaseDoc
class which represents a single document, a single datapoint.
However, in machine learning we often need to work with an array of documents, and an array of data points.
This section introduces the concept of AnyDocArray
which is an (abstract) collection of BaseDoc
. This name of this library --
DocArray
-- is derived from this concept and is short for DocumentArray
.
AnyDocArray
AnyDocArray
is an abstract class that represents an array of BaseDoc
s which is not meant to be used directly, but to be subclassed.
We provide two concrete implementations of AnyDocArray
:
DocList
which is a Python list ofBaseDoc
sDocVec
which is a column based representation ofBaseDoc
s
We will go into the difference between DocList
and DocVec
in the next section, but let's first focus on what they have in common.
The spirit of AnyDocArray
s is to extend the BaseDoc
and BaseModel
concepts to the array level in a seamless way.
Example
Before going into detail let's look at a code example.
Note
DocList
and DocVec
are both AnyDocArray
s. The following section will use DocList
as an example, but the same
applies to DocVec
.
First you need to create a Doc
class, our data schema. Let's say you want to represent a banner with an image, a title and a description:
from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl
class BannerDoc(BaseDoc):
image: ImageUrl
title: str
description: str
Let's instantiate several BannerDoc
s:
banner1 = BannerDoc(
image='https://example.com/image1.png',
title='Hello World',
description='This is a banner',
)
banner2 = BannerDoc(
image='https://example.com/image2.png',
title='Bye Bye World',
description='This is (distopic) banner',
)
You can now collect them into a DocList
of BannerDoc
s:
╭──────── DocList Summary ────────╮
│ │
│ Type DocList[BannerDoc] │
│ Length 2 │
│ │
╰─────────────────────────────────╯
╭──── Document Schema ─────╮
│ │
│ BannerDoc │
│ ├── image: ImageUrl │
│ ├── title: str │
│ └── description: str │
│ │
╰──────────────────────────╯
docs
here is an array-like collection of BannerDoc
.
You can access documents inside it with the usual Python array API:
BannerDoc(image='https://example.com/image1.png', title='Hello World', description='This is a banner')
or iterate over it:
BannerDoc(image='https://example.com/image1.png', title='Hello World', description='This is a banner')
BannerDoc(image='https://example.com/image2.png', title='Bye Bye World', description='This is (distopic) banner')
Note
The syntax DocList[BannerDoc]
might surprise you in this context.
It is actually at the heart of DocArray, but we'll come back to it later and continue with this example for now.
As we said earlier, DocList
(or more generally AnyDocArray
) extends the BaseDoc
API at the array level.
What this means concretely is you can access your data at the Array level in just the same way you would access your data at the document level.
Let's see what that looks like:
At the document level:
At the Array level:
Important
All the attributes of BannerDoc
are accessible at the Array level.
Warning
Whereas this is true at runtime, static type analyzers like Mypy or IDEs like PyCharm will not be be aware of it. This limitation is known and will be fixed in the future by the introduction of plugins for Mypy, PyCharm and VSCode.
This even works when you have a nested BaseDoc
:
from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl
class BannerDoc(BaseDoc):
image: ImageUrl
title: str
description: str
class PageDoc(BaseDoc):
banner: BannerDoc
content: str
page1 = PageDoc(
banner=BannerDoc(
image='https://example.com/image1.png',
title='Hello World',
description='This is a banner',
),
content='Hello world is the most used example in programming, but do you know that? ...',
)
page2 = PageDoc(
banner=BannerDoc(
image='https://example.com/image2.png',
title='Bye Bye World',
description='This is (distopic) banner',
),
content='What if the most used example in programming was Bye Bye World, would programming be that much fun? ...',
)
docs = DocList[PageDoc]([page1, page2])
docs.summary()
╭─────── DocList Summary ───────╮
│ │
│ Type DocList[PageDoc] │
│ Length 2 │
│ │
╰───────────────────────────────╯
╭────── Document Schema ───────╮
│ │
│ PageDoc │
│ ├── banner: BannerDoc │
│ │ ├── image: ImageUrl │
│ │ ├── title: str │
│ │ └── description: str │
│ └── content: str │
│ │
╰──────────────────────────────╯
Yes, docs.banner
returns a nested DocList
of BannerDoc
s!
You can even access the attributes of the nested BaseDoc
at the Array level:
This is just the same way that you would do it with BaseDoc:
DocList[DocType]
syntax
As you have seen in the previous section, AnyDocArray
will expose the same attributes as the BaseDoc
s it contains.
But this concept only works if (and only if) all of the BaseDoc
s in the AnyDocArray
have the same schema.
If one of your BaseDoc
s has an attribute that the others don't, you will get an error if you try to access it at
the Array level.
Note
To extend your schema to the Array level, AnyDocArray
needs to contain a homogenous Document.
This is where the custom syntax DocList[DocType]
comes into play.
Note
DocList[DocType]
creates a custom DocList
that can only contain DocType
Documents.
This syntax is inspired by more statically typed languages, and even though it might offend Python purists, we believe that it is a good user experience to think of an Array of BaseDoc
s rather than just an array of heterogeneous BaseDoc
s.
That said, AnyDocArray
can also be used to create a heterogeneous AnyDocArray
:
Note
The default DocList
can be used to create a heterogeneous list of BaseDoc
.
Warning
DocVec
cannot store heterogeneous BaseDoc
and always needs the DocVec[DocType]
syntax.
The usage of a heterogeneous DocList
is similar to a normal Python list but still offers DocArray functionality
like serialization and sending over the wire. However, it won't be able to extend the API of your custom schema to the Array level.
Here is how you can instantiate a heterogeneous DocList
:
from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl, AudioUrl
class ImageDoc(BaseDoc):
url: ImageUrl
class AudioDoc(BaseDoc):
url: AudioUrl
docs = DocList(
[
ImageDoc(url='https://example.com/image1.png'),
AudioDoc(url='https://example.com/audio1.mp3'),
]
)
But this is not possible:
try:
docs = DocList[ImageDoc](
[
ImageDoc(url='https://example.com/image1.png'),
AudioDoc(url='https://example.com/audio1.mp3'),
]
)
except ValueError as e:
print(e)
ValueError: AudioDoc(
id='e286b10f58533f48a0928460f0206441',
url=AudioUrl('https://example.com/audio1.mp3', host_type='domain')
) is not a <class '__main__.ImageDoc'>
DocList
vs DocVec
DocList
and DocVec
are both
AnyDocArray
but they have different use cases, and differ in how
they store data in memory.
They share almost everything that has been said in the previous sections, but they have some conceptual differences.
DocList
is based on Python Lists.
You can append, extend, insert, pop, and so on. In DocList, data is individually owned by each BaseDoc
collect just
different Document references. Use DocList
when you want to be able
to rearrange or re-rank your data. One flaw of DocList
is that none of the data is contiguous in memory, so you cannot
leverage functions that require contiguous data without first copying the data in a continuous array.
DocVec
is a columnar data structure. DocVec
is always an array
of homogeneous Documents. The idea is that every attribute of the BaseDoc
will be stored in a contiguous array: a column.
This means that when you access the attribute of a BaseDoc
at the Array level, we don't collect the data under the hood
from all the documents (like DocList
) before giving it back to you. We just return the column that is stored in memory.
This really matters when you need to handle multimodal data that you will feed into an algorithm that requires contiguous data, like matrix multiplication which is at the heart of Machine Learning, especially in Deep Learning.
Let's take an example to illustrate the difference:
Let's say you want to work with an Image:
from docarray import BaseDoc
from docarray.typing import NdArray
class ImageDoc(BaseDoc):
image: NdArray[
3, 224, 224
] = None # [3, 224, 224] this just mean we know in advance the shape of the tensor
And that you have a function that takes a contiguous array of images as input (like a deep learning model):
Let's create a DocList
of ImageDoc
s and pass it to the function:
from docarray import DocList
import numpy as np
docs = DocList[ImageDoc](
[ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]
)
predict(np.stack(docs.image))
...
predict(np.stack(docs.image))
When you call docs.image
, DocList
loops over the ten documents and collects the image attribute of each document in a list. It is similar to doing:
this means that if you call docs.image
multiple times, under the hood you will collect the image from each document and stack them several times. This is not optimal.
Let's see how it will work with DocVec
:
from docarray import DocVec
import numpy as np
docs = DocVec[ImageDoc](
[ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]
)
predict(docs.image)
...
predict(docs.image)
The first difference is that you don't need to call np.stack
on docs.image
because docs.image
is already a contiguous array.
The second difference is that you just get the column and don't need to create it at each call.
One of the other main differences between both of them is how you can access documents inside them.
If you access a document inside a DocList
you will get a BaseDoc
instance, i.e. a document.
If you access a document inside a DocVec
you will get a document view. A document view is a view of the columnar data structure which
looks and behaves like a BaseDoc
instance. It is a BaseDoc
instance but with a different way to access the data.
When you make a change at the view level it will be reflected at the DocVec level:
from docarray import DocVec
docs = DocVec[ImageDoc](
[ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]
)
my_doc = docs[0]
assert my_doc.is_view() # True
whereas with DocList:
docs = DocList[ImageDoc](
[ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]
)
my_doc = docs[0]
assert not my_doc.is_view() # False
Note
To summarize: you should use DocVec
when you need to work with contiguous data, and you should use DocList
when you need to rearrange
or extend your data.
Dealing with Optional fields
Both DocList
and DocVec
support nested optional fields but they behave slightly differently.
Nested optional field
By a "nested optional field" we mean a document that is contained within another document, and declared as Optional
:
Using nested optional fields differs slightly between DocList and DocVes, so watch out. But in a nutshell:
When accessing a nested BaseDoc:
- DocList will return a list of documents if the field is optional and a DocList if the field is not optional
- DocVec will return a DocVec if all documents are there, or None if all docs are None. No mix of docs and None allowed!
- DocVec will behave the same for a tensor field instead of a BaseDoc
DocList with nested optional Field
Let's take an example to illustrate the exact behavior:
from typing import Optional
from docarray.typing import NdArray
import numpy as np
class ImageDoc(BaseDoc):
tensor: NdArray
class ArticleDoc(BaseDoc):
image: Optional[ImageDoc] = None
title: str
In this example ArticleDoc
has an optional field image
which is an ImageDoc
. This means that this field can either
be None or be a ImageDoc
instance.
Remember that for both DocList and DocVec calling docs.image
will return a list-like object of all the images of the documents.
For DocList this call will iterate over all the documents and collect the image attribute of each document in a sequence, and for DocVec it will return the already stacked column of the .image
attribute.
For DocList it will return a list of Optional[ImageDoc]
instead of a DocList[ImageDoc]
, this is because the list can contain None and DocList can't.
from docarray import DocList
docs = DocList[ArticleDoc](
[
ArticleDoc(image=ImageDoc(tensor=np.ones((3, 224, 224))), title="Hello"),
ArticleDoc(image=None, title="World"),
]
)
assert docs.image == [ImageDoc(tensor=np.ones((3, 224, 224))), None]
DocVec with nested optional Field
For DocVec it is a bit different. Indeed, a DocVec stores the data for each filed as contiguous column. This means that DocVec can create a column in only two cases: either all the data for a field is None or all the data is not None.
For the first case the whole column will just be None. In the second case the column will be a DocList[ImageDoc]
from docarray import DocVec
docs = DocVec[ArticleDoc](
[
ArticleDoc(image=ImageDoc(tensor=np.zeros((3, 224, 224))), title="Hello")
for _ in range(10)
]
)
assert (docs.image.tensor == np.zeros((3, 224, 224))).all()
Or it can be None:
But if you try a mix you will get an error:
try:
docs = DocVec[ArticleDoc](
[
ArticleDoc(image=ImageDoc(tensor=np.ones((3, 224, 224))), title="Hello"),
ArticleDoc(image=None, title="World"),
]
)
except ValueError as e:
print(e)
See also:
- First step of the representing section
- API Reference for the
DocList
class - API Reference for the
DocVec
class - The Storing section on how to store your data
- The Sending section on how to send your data