Hnswlib Document Index
Note
To use HnswDocumentIndex, you need to install the extra dependency with the following command:
HnswDocumentIndex is a lightweight Document Index implementation that runs fully locally and is best suited for small- to medium-sized datasets. It stores vectors on disk in hnswlib, and stores all other data in SQLite.
Production readiness
HnswDocumentIndex is a great starting point for small- to medium-sized datasets, but it is not battle tested in production. If scalability, uptime, etc. are important to you, we recommend you eventually transition to one of our database-backed Document Index implementations:
Basic Usage
To see how to create a HnswDocumentIndex instance, add Documents, perform search, etc. see the general user guide.
Configuration
This section lays out the configurations and options that are specific to HnswDocumentIndex.
DBConfig
The DBConfig
of HnswDocumentIndex expects only one argument:
work_dir
.
This is the location where all of the Index's data will be stored, namely the various HNSWLib indexes and the SQLite database.
You can pass this directly to the constructor:
from docarray import BaseDoc
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
class MyDoc(BaseDoc):
embedding: NdArray[128]
text: str
db = HnswDocumentIndex[MyDoc](work_dir='./path/to/db')
To load existing data, you can specify a directory that stores data from a previous session.
Hnswlib file lock
Hnswlib uses a file lock to prevent multiple processes from accessing the same index at the same time.
This means that if you try to open an index that is already open in another process, you will get an error.
To avoid this, you can specify a different work_dir
for each process.
RuntimeConfig
The RuntimeConfig
of HnswDocumentIndex contains only one entry:
the default mapping from Python types to column configurations.
You can see in the section below how to override configurations for specific fields.
If you want to set configurations globally, i.e. for all vector fields in your documents, you can do that using RuntimeConfig
:
import numpy as np
db = HnswDocumentIndex[MyDoc](work_dir='/tmp/my_db')
db.configure(
default_column_config={
np.ndarray: {
'dim': -1,
'index': True,
'space': 'ip',
'max_elements': 2048,
'ef_construction': 100,
'ef': 15,
'M': 8,
'allow_replace_deleted': True,
'num_threads': 5,
},
None: {},
}
)
This will set the default configuration for all vector fields to the one specified in the example above.
Note
Even if your vectors come from PyTorch or TensorFlow, you can (and should) still use the np.ndarray
configuration.
This is because all tensors are converted to np.ndarray
under the hood.
Note
max_elements is considered to have the initial maximum capacity of the index. However, the capacity of the index is doubled every time that the number of Documents in the index exceeds this capacity. Expanding the capacity is an expensive operation, therefore it can be important to choose an appropiate max_elements value at init time.
For more information on these settings, see below.
Fields that are not vector fields (e.g. of type str
or int
etc.) do not offer any configuration, as they are simply
stored as-is in a SQLite database.
Field-wise configurations
There are various setting that you can tweak for every vector field that you index into Hnswlib.
You pass all of those using the field: Type = Field(...)
syntax:
from pydantic import Field
class Schema(BaseDoc):
tens: NdArray[100] = Field(max_elements=12, space='cosine')
tens_two: NdArray[10] = Field(M=4, space='ip')
db = HnswDocumentIndex[Schema](work_dir='/tmp/my_db')
In the example above you can see how to configure two different vector fields, with two different sets of settings.
In this way, you can pass all options that Hnswlib supports:
Keyword | Description | Default |
---|---|---|
max_elements |
Maximum number of vector that can be stored | 1024 |
space |
Vector space (distance metric) the index operates in. Supports 'l2', 'ip', and 'cosine'. Note: In contrast to the other backends, for HnswDocumentIndex 'cosine' refers to cosine distance, not cosine similarity. To transform one to the other, you can use: cos_sim = 1 - cos_dist . For more details see here. |
'l2' |
index |
Whether or not an index should be built for this field. | True |
ef_construction |
defines a construction time/accuracy trade-off | 200 |
ef |
parameter controlling query time/accuracy trade-off | 10 |
M |
parameter that defines the maximum number of outgoing connections in the graph | 16 |
allow_replace_deleted |
enables replacing of deleted elements with new added ones | True |
num_threads |
sets the number of cpu threads to use | 1 |
Note
In HnswLibDocIndex space='cosine'
refers to cosine distance, not to cosine similarity, as it does for the other backends.
You can find more details on the parameters here.
Nested Index
When using the index, you can define multiple fields and their nested structure. In the following example, you have YouTubeVideoDoc
including the tensor
field calculated based on the description. YouTubeVideoDoc
has thumbnail
and video
fields, each with their own tensor
.
from docarray.typing import ImageUrl, VideoUrl, AnyTensor
class ImageDoc(BaseDoc):
url: ImageUrl
tensor: AnyTensor = Field(space='cosine', dim=64)
class VideoDoc(BaseDoc):
url: VideoUrl
tensor: AnyTensor = Field(space='cosine', dim=128)
class YouTubeVideoDoc(BaseDoc):
title: str
description: str
thumbnail: ImageDoc
video: VideoDoc
tensor: AnyTensor = Field(space='cosine', dim=256)
doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp2')
index_docs = [
YouTubeVideoDoc(
title=f'video {i+1}',
description=f'this is video from author {10*i}',
thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)),
video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)),
tensor=np.ones(256),
)
for i in range(8)
]
doc_index.index(index_docs)
You can use the search_field
to specify which field to use when performing the vector search. You can use the dunder operator to specify the field defined in the nested data. In the following code, you can perform vector search on the tensor
field of the YouTubeVideoDoc
or on the tensor
field of the thumbnail
and video
field:
# example of find nested and flat index
query_doc = YouTubeVideoDoc(
title=f'video query',
description=f'this is a query video',
thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)),
video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)),
tensor=np.ones(256),
)
# find by the youtubevideo tensor
docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3)
# find by the thumbnail tensor
docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3)
# find by the video tensor
docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3)
To delete nested data, you need to specify the id
.
Note
You can only delete Doc
at the top level. Deletion of the Doc
on lower levels is not yet supported.
Check here for nested data with subindex.
Update elements
In order to update a Document inside the index, you only need to reindex it with the updated attributes.
First lets create a schema for our Index
import numpy as np
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from docarray.index import HnswDocumentIndex
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = DocList[MyDoc](
[MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)]
)
index = HnswDocumentIndex[MyDoc]()
index.index(docs)
assert index.num_docs() == 100
Now we can find relevant documents
res = index.find(query=docs[0], search_field='tens', limit=100)
assert len(res.documents) == 100
for doc in res.documents:
assert 'I am the first version' in doc.text
and update all of the text of this documents and reindex them
for i, doc in enumerate(docs):
doc.text = f'I am the second version of Document {i}'
index.index(docs)
assert index.num_docs() == 100
When we retrieve them again we can see that their text attribute has been updated accordingly