Array of documents

DocArray allows users to represent and manipulate multimodal data to build AI applications such as neural search and generative AI.

As you have seen in the previous section, the fundamental building block of DocArray is the BaseDoc class which represents a single document, a single datapoint.

However, in machine learning we often need to work with an array of documents, and an array of data points.

This section introduces the concept of AnyDocArray which is an (abstract) collection of BaseDoc. This name of this library -- DocArray -- is derived from this concept and is short for DocumentArray.

AnyDocArray

AnyDocArray is an abstract class that represents an array of BaseDocs which is not meant to be used directly, but to be subclassed.

We provide two concrete implementations of AnyDocArray :

DocList which is a Python list of BaseDocs
DocVec which is a column based representation of BaseDocs

We will go into the difference between DocList and DocVec in the next section, but let's first focus on what they have in common.

The spirit of AnyDocArrays is to extend the BaseDoc and BaseModel concepts to the array level in a seamless way.

Example

Before going into detail let's look at a code example.

Note

DocList and DocVec are both AnyDocArrays. The following section will use DocList as an example, but the same applies to DocVec.

First you need to create a Doc class, our data schema. Let's say you want to represent a banner with an image, a title and a description:

from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl


class BannerDoc(BaseDoc):
    image: ImageUrl
    title: str
    description: str

Let's instantiate several BannerDocs:

banner1 = BannerDoc(
    image='https://example.com/image1.png',
    title='Hello World',
    description='This is a banner',
)

banner2 = BannerDoc(
    image='https://example.com/image2.png',
    title='Bye Bye World',
    description='This is (distopic) banner',
)

You can now collect them into a DocList of BannerDocs:

docs = DocList[BannerDoc]([banner1, banner2])

docs.summary()

╭──────── DocList Summary ────────╮
│                                 │
│   Type     DocList[BannerDoc]   │
│   Length   2                    │
│                                 │
╰─────────────────────────────────╯
╭──── Document Schema ─────╮
│                          │
│   BannerDoc              │
│   ├── image: ImageUrl    │
│   ├── title: str         │
│   └── description: str   │
│                          │
╰──────────────────────────╯

docs here is an array-like collection of BannerDoc.

You can access documents inside it with the usual Python array API:

print(docs[0])

BannerDoc(image='https://example.com/image1.png', title='Hello World', description='This is a banner')

or iterate over it:

for doc in docs:
    print(doc)

BannerDoc(image='https://example.com/image1.png', title='Hello World', description='This is a banner')
BannerDoc(image='https://example.com/image2.png', title='Bye Bye World', description='This is (distopic) banner')

Note

The syntax DocList[BannerDoc] might surprise you in this context. It is actually at the heart of DocArray, but we'll come back to it later and continue with this example for now.

As we said earlier, DocList (or more generally AnyDocArray) extends the BaseDoc API at the array level.

What this means concretely is you can access your data at the Array level in just the same way you would access your data at the document level.

Let's see what that looks like:

At the document level:

print(banner1.image)

https://example.com/image1.png'

At the Array level:

print(docs.image)

['https://example.com/image1.png', 'https://example.com/image2.png']

Important

All the attributes of BannerDoc are accessible at the Array level.

Warning

Whereas this is true at runtime, static type analyzers like Mypy or IDEs like PyCharm will not be be aware of it. This limitation is known and will be fixed in the future by the introduction of plugins for Mypy, PyCharm and VSCode.

This even works when you have a nested BaseDoc:

from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl


class BannerDoc(BaseDoc):
    image: ImageUrl
    title: str
    description: str


class PageDoc(BaseDoc):
    banner: BannerDoc
    content: str


page1 = PageDoc(
    banner=BannerDoc(
        image='https://example.com/image1.png',
        title='Hello World',
        description='This is a banner',
    ),
    content='Hello world is the most used example in programming, but do you know that? ...',
)

page2 = PageDoc(
    banner=BannerDoc(
        image='https://example.com/image2.png',
        title='Bye Bye World',
        description='This is (distopic) banner',
    ),
    content='What if the most used example in programming was Bye Bye World, would programming be that much fun? ...',
)

docs = DocList[PageDoc]([page1, page2])

docs.summary()

╭─────── DocList Summary ───────╮
│                               │
│   Type     DocList[PageDoc]   │
│   Length   2                  │
│                               │
╰───────────────────────────────╯
╭────── Document Schema ───────╮
│                              │
│   PageDoc                    │
│   ├── banner: BannerDoc      │
│   │   ├── image: ImageUrl    │
│   │   ├── title: str         │
│   │   └── description: str   │
│   └── content: str           │
│                              │
╰──────────────────────────────╯

print(docs.banner)

<DocList[BannerDoc] (length=2)>

Yes, docs.banner returns a nested DocList of BannerDocs!

You can even access the attributes of the nested BaseDoc at the Array level:

print(docs.banner.image)

['https://example.com/image1.png', 'https://example.com/image2.png']

This is just the same way that you would do it with BaseDoc:

print(page1.banner.image)

'https://example.com/image1.png'

`DocList[DocType]` syntax

As you have seen in the previous section, AnyDocArray will expose the same attributes as the BaseDocs it contains.

But this concept only works if (and only if) all of the BaseDocs in the AnyDocArray have the same schema.

If one of your BaseDocs has an attribute that the others don't, you will get an error if you try to access it at the Array level.

Note

To extend your schema to the Array level, AnyDocArray needs to contain a homogenous Document.

This is where the custom syntax DocList[DocType] comes into play.

Note

DocList[DocType] creates a custom DocList that can only contain DocType Documents.

This syntax is inspired by more statically typed languages, and even though it might offend Python purists, we believe that it is a good user experience to think of an Array of BaseDocs rather than just an array of heterogeneous BaseDocs.

That said, AnyDocArray can also be used to create a heterogeneous AnyDocArray:

Note

The default DocList can be used to create a heterogeneous list of BaseDoc.

Warning

DocVec cannot store heterogeneous BaseDoc and always needs the DocVec[DocType] syntax.

The usage of a heterogeneous DocList is similar to a normal Python list but still offers DocArray functionality like serialization and sending over the wire. However, it won't be able to extend the API of your custom schema to the Array level.

Here is how you can instantiate a heterogeneous DocList:

from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl, AudioUrl


class ImageDoc(BaseDoc):
    url: ImageUrl


class AudioDoc(BaseDoc):
    url: AudioUrl


docs = DocList(
    [
        ImageDoc(url='https://example.com/image1.png'),
        AudioDoc(url='https://example.com/audio1.mp3'),
    ]
)

But this is not possible:

try:
    docs = DocList[ImageDoc](
        [
            ImageDoc(url='https://example.com/image1.png'),
            AudioDoc(url='https://example.com/audio1.mp3'),
        ]
    )
except ValueError as e:
    print(e)

ValueError: AudioDoc(
    id='e286b10f58533f48a0928460f0206441',
    url=AudioUrl('https://example.com/audio1.mp3', host_type='domain')
) is not a <class '__main__.ImageDoc'>

`DocList` vs `DocVec`

DocList and DocVec are both AnyDocArray but they have different use cases, and differ in how they store data in memory.

They share almost everything that has been said in the previous sections, but they have some conceptual differences.

DocList is based on Python Lists. You can append, extend, insert, pop, and so on. In DocList, data is individually owned by each BaseDoc collect just different Document references. Use DocList when you want to be able to rearrange or re-rank your data. One flaw of DocList is that none of the data is contiguous in memory, so you cannot leverage functions that require contiguous data without first copying the data in a continuous array.

DocVec is a columnar data structure. DocVec is always an array of homogeneous Documents. The idea is that every attribute of the BaseDoc will be stored in a contiguous array: a column.

This means that when you access the attribute of a BaseDoc at the Array level, we don't collect the data under the hood from all the documents (like DocList) before giving it back to you. We just return the column that is stored in memory.

This really matters when you need to handle multimodal data that you will feed into an algorithm that requires contiguous data, like matrix multiplication which is at the heart of Machine Learning, especially in Deep Learning.

Let's take an example to illustrate the difference:

Let's say you want to work with an Image:

from docarray import BaseDoc
from docarray.typing import NdArray


class ImageDoc(BaseDoc):
    image: NdArray[
        3, 224, 224
    ] = None  # [3, 224, 224] this just mean we know in advance the shape of the tensor

And that you have a function that takes a contiguous array of images as input (like a deep learning model):

def predict(image: NdArray['batch_size', 3, 224, 224]):
    ...

Let's create a DocList of ImageDocs and pass it to the function:

from docarray import DocList
import numpy as np

docs = DocList[ImageDoc](
    [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]
)

predict(np.stack(docs.image))
...
predict(np.stack(docs.image))

When you call docs.image, DocList loops over the ten documents and collects the image attribute of each document in a list. It is similar to doing:

images = []
for doc in docs:
    images.append(doc.image)

this means that if you call docs.image multiple times, under the hood you will collect the image from each document and stack them several times. This is not optimal.

Let's see how it will work with DocVec:

from docarray import DocVec
import numpy as np

docs = DocVec[ImageDoc](
    [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]
)

predict(docs.image)
...
predict(docs.image)

The first difference is that you don't need to call np.stack on docs.image because docs.image is already a contiguous array. The second difference is that you just get the column and don't need to create it at each call.

One of the other main differences between both of them is how you can access documents inside them.

If you access a document inside a DocList you will get a BaseDoc instance, i.e. a document.

If you access a document inside a DocVec you will get a document view. A document view is a view of the columnar data structure which looks and behaves like a BaseDoc instance. It is a BaseDoc instance but with a different way to access the data.

When you make a change at the view level it will be reflected at the DocVec level:

from docarray import DocVec

docs = DocVec[ImageDoc](
    [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]
)

my_doc = docs[0]

assert my_doc.is_view()  # True

whereas with DocList:

docs = DocList[ImageDoc](
    [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]
)

my_doc = docs[0]

assert not my_doc.is_view()  # False

Note

To summarize: you should use DocVec when you need to work with contiguous data, and you should use DocList when you need to rearrange or extend your data.

Dealing with Optional fields

Both DocList and DocVec support nested optional fields but they behave slightly differently.

Nested optional field

By a "nested optional field" we mean a document that is contained within another document, and declared as Optional:

from typing import Optional
from docarray import BaseDoc


class MyDoc(BaseDoc):
    nested_doc: Optional[BaseDoc] = None

Using nested optional fields differs slightly between DocList and DocVes, so watch out. But in a nutshell:

When accessing a nested BaseDoc:

DocList will return a list of documents if the field is optional and a DocList if the field is not optional
DocVec will return a DocVec if all documents are there, or None if all docs are None. No mix of docs and None allowed!
DocVec will behave the same for a tensor field instead of a BaseDoc

DocList with nested optional Field

Let's take an example to illustrate the exact behavior:

from typing import Optional
from docarray.typing import NdArray
import numpy as np


class ImageDoc(BaseDoc):
    tensor: NdArray


class ArticleDoc(BaseDoc):
    image: Optional[ImageDoc] = None
    title: str

In this example ArticleDoc has an optional field image which is an ImageDoc. This means that this field can either be None or be a ImageDoc instance.

Remember that for both DocList and DocVec calling docs.image will return a list-like object of all the images of the documents.

For DocList this call will iterate over all the documents and collect the image attribute of each document in a sequence, and for DocVec it will return the already stacked column of the .image attribute.

For DocList it will return a list of Optional[ImageDoc] instead of a DocList[ImageDoc], this is because the list can contain None and DocList can't.

from docarray import DocList


docs = DocList[ArticleDoc](
    [
        ArticleDoc(image=ImageDoc(tensor=np.ones((3, 224, 224))), title="Hello"),
        ArticleDoc(image=None, title="World"),
    ]
)

assert docs.image == [ImageDoc(tensor=np.ones((3, 224, 224))), None]

DocVec with nested optional Field

For DocVec it is a bit different. Indeed, a DocVec stores the data for each filed as contiguous column. This means that DocVec can create a column in only two cases: either all the data for a field is None or all the data is not None.

For the first case the whole column will just be None. In the second case the column will be a DocList[ImageDoc]

from docarray import DocVec

docs = DocVec[ArticleDoc](
    [
        ArticleDoc(image=ImageDoc(tensor=np.zeros((3, 224, 224))), title="Hello")
        for _ in range(10)
    ]
)
assert (docs.image.tensor == np.zeros((3, 224, 224))).all()

Or it can be None:

docs = DocVec[ArticleDoc]([ArticleDoc(title="Hello") for _ in range(10)])
assert docs.image is None

But if you try a mix you will get an error:

try:
    docs = DocVec[ArticleDoc](
        [
            ArticleDoc(image=ImageDoc(tensor=np.ones((3, 224, 224))), title="Hello"),
            ArticleDoc(image=None, title="World"),
        ]
    )
except ValueError as e:
    print(e)

None is not a <class '__main__.ImageDoc'>

Array of documents

AnyDocArray

Example

DocList[DocType] syntax

DocList vs DocVec

Dealing with Optional fields

DocList with nested optional Field

DocVec with nested optional Field

`DocList[DocType]` syntax

`DocList` vs `DocVec`