Skip to content

In-Memory Document Index

InMemoryExactNNIndex stores all Documents in DocLists in memory. It is a great starting point for small datasets, where you may not want to launch a database server.

For vector search and filtering the InMemoryExactNNIndex utilizes DocArray's find() and filter_docs() functions.

Basic usage

To see how to create a InMemoryExactNNIndex instance, add Documents, perform search, etc. see the general user guide.

You can initialize the index as follows:

from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import NdArray


class MyDoc(BaseDoc):
    tensor: NdArray = None


docs = DocList[MyDoc](MyDoc() for _ in range(10))

doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)

# or in one step:
doc_index = InMemoryExactNNIndex[MyDoc](docs)

Additionally, you can preserve your index as a binary file and instantiate a new one using this file:

# Save your existing index as a binary file
doc_index.persist('docs.bin')

# Initialize a new document index using the saved binary file
new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin')

Configuration

This section lays out the configurations and options that are specific to InMemoryExactNNIndex.

RuntimeConfig

The RuntimeConfig of InMemoryExactNNIndex contains only one entry: the default mapping from Python types to column configurations.

You can see in the section below how to override configurations for specific fields. If you want to set configurations globally, i.e. for all vector fields in your Documents, you can do that using RuntimeConfig:

from collections import defaultdict
from docarray.typing import AbstractTensor

index.configure(
    default_column_config=defaultdict(
        dict,
        {
            AbstractTensor: {'space': 'cosine_sim'},
        },
    )
)

This will set the default configuration for all vector fields to the one specified in the example above.

For more information on these settings, see below.

Fields that are not vector fields (e.g. of type str or int etc.) do not offer any configuration.

Field-wise configurations

For a vector field you can adjust the space parameter. It can be one of:

  • 'cosine_sim' (default)
  • 'euclidean_dist'
  • 'sqeuclidean_dist'

You pass it using the field: Type = Field(...) syntax:

from docarray import BaseDoc
from pydantic import Field


class Schema(BaseDoc):
    tensor_1: NdArray[100] = Field(space='euclidean_dist')
    tensor_2: NdArray[100] = Field(space='sqeuclidean_dist')

In the example above you can see how to configure two different vector fields, with two different sets of settings.

Nested index

When using the index, you can define multiple fields and their nested structure. In the following example, you have YouTubeVideoDoc including the tensor field calculated based on the description. YouTubeVideoDoc has thumbnail and video fields, each with their own tensor.

import numpy as np
from docarray import BaseDoc
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import ImageUrl, VideoUrl, AnyTensor
from pydantic import Field


class ImageDoc(BaseDoc):
    url: ImageUrl
    tensor: AnyTensor = Field(space='cosine_sim')


class VideoDoc(BaseDoc):
    url: VideoUrl
    tensor: AnyTensor = Field(space='cosine_sim')


class YouTubeVideoDoc(BaseDoc):
    title: str
    description: str
    thumbnail: ImageDoc
    video: VideoDoc
    tensor: AnyTensor = Field(space='cosine_sim')


doc_index = InMemoryExactNNIndex[YouTubeVideoDoc]()
index_docs = [
    YouTubeVideoDoc(
        title=f'video {i+1}',
        description=f'this is video from author {10*i}',
        thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)),
        video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)),
        tensor=np.ones(256),
    )
    for i in range(8)
]
doc_index.index(index_docs)

Search Documents

To search Documents, the InMemoryExactNNIndex uses DocArray's find function.

You can use the search_field to specify which field to use when performing the vector search. You can use the dunder operator to specify the field defined in nested data. In the following code, you can perform vector search on the tensor field of the YouTubeVideoDoc or the tensor field of the thumbnail and video fields:

# find by the youtubevideo tensor
query = parse_obj_as(NdArray, np.ones(256))
docs, scores = doc_index.find(query, search_field='tensor', limit=3)

# find by the thumbnail tensor
query = parse_obj_as(NdArray, np.ones(64))
docs, scores = doc_index.find(query, search_field='thumbnail__tensor', limit=3)

# find by the video tensor
query = parse_obj_as(NdArray, np.ones(128))
docs, scores = doc_index.find(query, search_field='video__tensor', limit=3)

Filter Documents

To filter Documents, the InMemoryExactNNIndex uses DocArray's filter_docs() function.

You can filter your documents by using the filter() or filter_batched() method with a corresponding filter query. The query should follow the query language of the DocArray's filter_docs() function.

In the following example let's filter for all the books that are cheaper than 29 dollars:

from docarray import BaseDoc, DocList


class Book(BaseDoc):
    title: str
    price: int


books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)])
book_index = InMemoryExactNNIndex[Book](books)

# filter for books that are cheaper than 29 dollars
query = {'price': {'$lte': 29}}
cheap_books = book_index.filter(query)

assert len(cheap_books) == 3
for doc in cheap_books:
    doc.summary()
Output
📄 Book : 1f7da15 ...
╭──────────────────────┬───────────────╮
│ Attribute            │ Value         │
├──────────────────────┼───────────────┤
│ title: str           │ title 0       │
│ price: int           │ 0             │
╰──────────────────────┴───────────────╯
📄 Book : 63fd13a ...
╭──────────────────────┬───────────────╮
│ Attribute            │ Value         │
├──────────────────────┼───────────────┤
│ title: str           │ title 1       │
│ price: int           │ 10            │
╰──────────────────────┴───────────────╯
📄 Book : 49b21de ...
╭──────────────────────┬───────────────╮
│ Attribute            │ Value         │
├──────────────────────┼───────────────┤
│ title: str           │ title 2       │
│ price: int           │ 20            │
╰──────────────────────┴───────────────╯

Delete Documents

To delete nested data, you need to specify the id.

Note

You can only delete Documents at the top level. Deletion of Documents on lower levels is not yet supported.

# example of deleting nested and flat index
del doc_index[index_docs[6].id]