Annlite#
You can use Annlite as a document store for DocumentArray. It’s suitable for faster Document retrieval on embeddings, i.e. .match()
, .find()
.
Tip
This feature requires annlite
. You can install it via pip install "docarray[annlite]".
Usage#
You can instantiate a DocumentArray with Annlite storage like so:
from docarray import DocumentArray
da = DocumentArray(storage='annlite', config={'n_dim': 10})
The usage would be the same as the ordinary DocumentArray.
To access a DocumentArray formerly persisted, you can specify the data_path
in config
.
from docarray import DocumentArray
da = DocumentArray(storage='annlite', config={'data_path': './data', 'n_dim': 10})
da.summary()
Note that specifying the n_dim
is mandatory before using Annlite
as a backend for DocumentArray.
Other functions behave the same as inmemory DocumentArray.
Config#
The following configs can be set:
Name 
Description 
Default 


Number of dimensions of embeddings to be stored and retrieved 
This is always required 

The data folder where the data is located 
A random temp folder 

Distance metric to be used during search. Can be ‘cosine’, ‘dot’ or ‘euclidean’ 
‘cosine’ 

The size of the dynamic list for the nearest neighbors (used during the construction) 


The size of the dynamic list for the nearest neighbors (used during the search) 


The number of bidirectional links created for every new element during construction. 


The output dimension of PCA model. Should be a positive number and less than 


Controls if ordering of Documents is persisted in the Database. Disabling this breaks listlike features, but can improve performance. 
True 

Boolean flag indicating whether to store 
True 
*You can check the default values in the AnnLite source code
Vector search with filter#
Search with .find
can be restricted by userdefined filters.
Filters can be constructed following the guidelines provided in the AnnLite source repository.
Example of .find
with a filter only#
Consider you store Documents with a certain tag price
into annlite and you want to retrieve all Documents
with price
lower or equal to some max_price
value.
You can index such Documents as follows:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
da = DocumentArray(
storage='annlite',
config={
'n_dim': n_dim,
'columns': {'price': 'float'},
},
)
with da:
da.extend([Document(id=f'r{i}', tags={'price': i}) for i in range(10)])
print('\nIndexed Prices:\n')
for price in da[:, 'tags__price']:
print(f'\t price={price}')
Then you can retrieve all documents whose price is lower than or equal to max_price
by applying the following
filter:
max_price = 3
n_limit = 4
filter = {'price': {'$lte': max_price}}
results = da.find(filter=filter)
print('\n Returned examples that verify filter "price at most 3":\n')
for price in results[:, 'tags__price']:
print(f'\t price={price}')
This would print
Returned examples that satisfy condition "price at most 3":
price=0
price=1
price=2
price=3
Example of .find
with query vector and filter#
Consider Documents with embeddings [0,0,0]
up to [9,9,9]
where the document with embedding [i,i,i]
has as tag price
with value i
. We can create such example with the following code:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
metric = 'Euclidean'
da = DocumentArray(
storage='annlite',
config={'n_dim': n_dim, 'columns': {'price': 'float'}, 'metric': metric},
)
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
for i in range(10)
]
)
Consider we want the nearest vectors to the embedding [8. 8. 8.]
, with the restriction that
prices must follow a filter. As an example, let’s consider that retrieved documents must have price
value lower
or equal than max_price
. We can encode this information in annlite using filter = {'price': {'$lte': max_price}}
.
Then the search with the proposed filter can be implemented and used with the following code:
max_price = 7
n_limit = 4
np_query = np.ones(n_dim) * 8
print(f'\nQuery vector: \t{np_query}')
filter = {'price': {'$lte': max_price}}
results = da.find(np_query, filter=filter, limit=n_limit)
print('\nEmbeddings Nearest Neighbours with "price" at most 7:\n')
for embedding, price in zip(results.embeddings, results[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
This would print:
Query vector: [8. 8. 8.]
Embeddings Nearest Neighbours with "price" at most 7:
embedding=[7. 7. 7.], price=7
embedding=[6. 6. 6.], price=6
embedding=[5. 5. 5.], price=5
embedding=[4. 4. 4.], price=4