Skip to content

Hnswlib Document Index

Note

To use HnswDocumentIndex, you need to install the extra dependency with the following command:

pip install "docarray[hnswlib]"

HnswDocumentIndex is a lightweight Document Index implementation that runs fully locally and is best suited for small- to medium-sized datasets. It stores vectors on disk in hnswlib, and stores all other data in SQLite.

Production readiness

HnswDocumentIndex is a great starting point for small- to medium-sized datasets, but it is not battle tested in production. If scalability, uptime, etc. are important to you, we recommend you eventually transition to one of our database-backed Document Index implementations:

Basic Usage

To see how to create a HnswDocumentIndex instance, add Documents, perform search, etc. see the general user guide.

Configuration

This section lays out the configurations and options that are specific to HnswDocumentIndex.

DBConfig

The DBConfig of HnswDocumentIndex expects only one argument: work_dir.

This is the location where all of the Index's data will be stored, namely the various HNSWLib indexes and the SQLite database.

You can pass this directly to the constructor:

from docarray import BaseDoc
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray


class MyDoc(BaseDoc):
    embedding: NdArray[128]
    text: str


db = HnswDocumentIndex[MyDoc](work_dir='./path/to/db')

To load existing data, you can specify a directory that stores data from a previous session.

Hnswlib file lock

Hnswlib uses a file lock to prevent multiple processes from accessing the same index at the same time. This means that if you try to open an index that is already open in another process, you will get an error. To avoid this, you can specify a different work_dir for each process.

RuntimeConfig

The RuntimeConfig of HnswDocumentIndex contains only one entry: the default mapping from Python types to column configurations.

You can see in the section below how to override configurations for specific fields. If you want to set configurations globally, i.e. for all vector fields in your documents, you can do that using RuntimeConfig:

import numpy as np

db = HnswDocumentIndex[MyDoc](work_dir='/tmp/my_db')

db.configure(
    default_column_config={
        np.ndarray: {
            'dim': -1,
            'index': True,
            'space': 'ip',
            'max_elements': 2048,
            'ef_construction': 100,
            'ef': 15,
            'M': 8,
            'allow_replace_deleted': True,
            'num_threads': 5,
        },
        None: {},
    }
)

This will set the default configuration for all vector fields to the one specified in the example above.

Note

Even if your vectors come from PyTorch or TensorFlow, you can (and should) still use the np.ndarray configuration. This is because all tensors are converted to np.ndarray under the hood.

Note

max_elements is considered to have the initial maximum capacity of the index. However, the capacity of the index is doubled every time that the number of Documents in the index exceeds this capacity. Expanding the capacity is an expensive operation, therefore it can be important to choose an appropiate max_elements value at init time.

For more information on these settings, see below.

Fields that are not vector fields (e.g. of type str or int etc.) do not offer any configuration, as they are simply stored as-is in a SQLite database.

Field-wise configurations

There are various setting that you can tweak for every vector field that you index into Hnswlib.

You pass all of those using the field: Type = Field(...) syntax:

from pydantic import Field


class Schema(BaseDoc):
    tens: NdArray[100] = Field(max_elements=12, space='cosine')
    tens_two: NdArray[10] = Field(M=4, space='ip')


db = HnswDocumentIndex[Schema](work_dir='/tmp/my_db')

In the example above you can see how to configure two different vector fields, with two different sets of settings.

In this way, you can pass all options that Hnswlib supports:

Keyword Description Default
max_elements Maximum number of vector that can be stored 1024
space Vector space (distance metric) the index operates in. Supports 'l2', 'ip', and 'cosine'.
Note: In contrast to the other backends, for HnswDocumentIndex 'cosine' refers to cosine distance, not cosine similarity. To transform one to the other, you can use: cos_sim = 1 - cos_dist. For more details see here.
'l2'
index Whether or not an index should be built for this field. True
ef_construction defines a construction time/accuracy trade-off 200
ef parameter controlling query time/accuracy trade-off 10
M parameter that defines the maximum number of outgoing connections in the graph 16
allow_replace_deleted enables replacing of deleted elements with new added ones True
num_threads sets the number of cpu threads to use 1

Note

In HnswLibDocIndex space='cosine' refers to cosine distance, not to cosine similarity, as it does for the other backends.

You can find more details on the parameters here.

Nested Index

When using the index, you can define multiple fields and their nested structure. In the following example, you have YouTubeVideoDoc including the tensor field calculated based on the description. YouTubeVideoDoc has thumbnail and video fields, each with their own tensor.

from docarray.typing import ImageUrl, VideoUrl, AnyTensor


class ImageDoc(BaseDoc):
    url: ImageUrl
    tensor: AnyTensor = Field(space='cosine', dim=64)


class VideoDoc(BaseDoc):
    url: VideoUrl
    tensor: AnyTensor = Field(space='cosine', dim=128)


class YouTubeVideoDoc(BaseDoc):
    title: str
    description: str
    thumbnail: ImageDoc
    video: VideoDoc
    tensor: AnyTensor = Field(space='cosine', dim=256)


doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp2')
index_docs = [
    YouTubeVideoDoc(
        title=f'video {i+1}',
        description=f'this is video from author {10*i}',
        thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)),
        video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)),
        tensor=np.ones(256),
    )
    for i in range(8)
]
doc_index.index(index_docs)

You can use the search_field to specify which field to use when performing the vector search. You can use the dunder operator to specify the field defined in the nested data. In the following code, you can perform vector search on the tensor field of the YouTubeVideoDoc or on the tensor field of the thumbnail and video field:

# example of find nested and flat index
query_doc = YouTubeVideoDoc(
    title=f'video query',
    description=f'this is a query video',
    thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)),
    video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)),
    tensor=np.ones(256),
)
# find by the youtubevideo tensor
docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3)
# find by the thumbnail tensor
docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3)
# find by the video tensor
docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3)

To delete nested data, you need to specify the id.

Note

You can only delete Doc at the top level. Deletion of the Doc on lower levels is not yet supported.

# example of deleting nested and flat index
del doc_index[index_docs[6].id]

Check here for nested data with subindex.

Update elements

In order to update a Document inside the index, you only need to reindex it with the updated attributes.

First lets create a schema for our Index

import numpy as np
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from docarray.index import HnswDocumentIndex
class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]
Now we can instantiate our Index and index some data.
docs = DocList[MyDoc](
    [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)]
)
index = HnswDocumentIndex[MyDoc]()
index.index(docs)
assert index.num_docs() == 100

Now we can find relevant documents

res = index.find(query=docs[0], search_field='tens', limit=100)
assert len(res.documents) == 100
for doc in res.documents:
    assert 'I am the first version' in doc.text

and update all of the text of this documents and reindex them

for i, doc in enumerate(docs):
    doc.text = f'I am the second version of Document {i}'

index.index(docs)
assert index.num_docs() == 100

When we retrieve them again we can see that their text attribute has been updated accordingly

res = index.find(query=docs[0], search_field='tens', limit=100)
assert len(res.documents) == 100
for doc in res.documents:
    assert 'I am the second version' in doc.text