Weaviate Document Index
Install dependencies
To use WeaviateDocumentIndex, you need to install extra dependencies with the following command:
This is the user guide for the WeaviateDocumentIndex, focusing on special features and configurations of Weaviate.
For general usage of a Document Index, see the general user guide.
1. Start Weaviate service
To use WeaviateDocumentIndex, DocArray needs to hook into a running Weaviate service. There are multiple ways to start a Weaviate instance, depending on your use case.
1.1. Options - Overview
Instance type | General use case | Configurability | Notes |
---|---|---|---|
Weaviate Cloud Services (WCS) | Development and production | Limited | Recommended for most users |
Embedded Weaviate | Experimentation | Limited | Experimental (as of Apr 2023) |
Docker-Compose | Development | Yes | Recommended for development + customizability |
Kubernetes | Production | Yes |
1.2. Instantiation instructions
1.2.1. WCS (managed instance)
Go to the WCS console and create an instance using the visual interface, following this guide.
Weaviate instances on WCS come pre-configured, so no further configuration is required.
1.2.2. Docker-Compose (self-managed)
Get a configuration file (docker-compose.yaml
). You can build it using this interface, or download it directly with:
curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?modules=standalone&runtime=docker-compose&weaviate_version=v<WEAVIATE_VERSION>"
Where v<WEAVIATE_VERSION>
is the actual version, such as v1.18.3
.
curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?modules=standalone&runtime=docker-compose&weaviate_version=v1.18.3"
1.2.2.1 Start up Weaviate with Docker-Compose
Then you can start up Weaviate by running from a shell:
1.2.2.2 Shut down Weaviate
Then you can shut down Weaviate by running from a shell:
Notes
Unless data persistence or backups are set up, shutting down the Docker instance will remove all its data.
See documentation on Persistent volume and Backups to prevent this if persistence is desired.
1.2.3. Embedded Weaviate (from the application)
With Embedded Weaviate, Weaviate database server can be launched from the client, using:
1.3. Authentication
Weaviate offers multiple authentication options, as well as authorization options.
With DocArray, you can use any of:
- Anonymous access (public instance),
- OIDC with username & password, and
- API-key based authentication.
To access a Weaviate instance. In general, Weaviate recommends using API-key based authentication for balance between security and ease of use. You can create, for example, read-only keys to distribute to certain users, while providing read/write keys to administrators.
See below for examples of connection to Weaviate for each scenario.
1.4. Connect to Weaviate
Public instance
If using Embedded Weaviate:
from docarray.index.backends.weaviate import EmbeddedOptions
dbconfig = WeaviateDocumentIndex.DBConfig(embedded_options=EmbeddedOptions())
For all other options:
dbconfig = WeaviateDocumentIndex.DBConfig(
host="http://localhost:8080"
) # Replace with your endpoint)
OIDC with username + password
To authenticate against a Weaviate instance with OIDC username & password:
dbconfig = WeaviateDocumentIndex.DBConfig(
username="username", # Replace with your username
password="password", # Replace with your password
host="http://localhost:8080", # Replace with your endpoint
)
# dbconfig = WeaviateDocumentIndex.DBConfig(
# username="username", # Replace with your username
# password="password", # Replace with your password
# host="http://localhost:8080", # Replace with your endpoint
# )
API key-based authentication
To authenticate against a Weaviate instance an API key:
dbconfig = WeaviateDocumentIndex.DBConfig(
auth_api_key="apikey", # Replace with your own API key
host="http://localhost:8080", # Replace with your endpoint
)
2. Configure Weaviate
2.1. Overview
WCS instances come pre-configured, and as such additional settings are not configurable outside of those chosen at creation, such as whether to enable authentication.
For other cases, such as Docker-Compose deployment, its settings can be modified through the configuration file, such as the docker-compose.yaml
file.
Some of the more commonly used settings include:
- Persistent volume: Set up data persistence so that data from inside the Docker container is not lost on shutdown
- Enabling a multi-node setup
- Backups
- Authentication (server-side)
- Modules enabled
And a list of environment variables is available on this page.
2.2. DocArray instantiation configuration options
Additionally, you can specify the below settings when you instantiate a configuration object in DocArray.
name | type | explanation | default | example |
---|---|---|---|---|
Category: General | ||||
host | str | Weaviate instance url | http://localhost:8080 | |
Category: Authentication | ||||
username | str | Username known to the specified authentication provider (e.g. WCS) | None | jp@weaviate.io |
password | str | Corresponding password | None | p@ssw0rd |
auth_api_key | str | API key known to the Weaviate instance | None | mys3cretk3y |
Category: Data schema | ||||
index_name | str | Class name to use to store the document | The document class name, e.g. MyDoc for WeaviateDocumentIndex[MyDoc] |
Document |
Category: Embedded Weaviate | ||||
embedded_options | EmbeddedOptions | Options for embedded weaviate | None |
The type EmbeddedOptions
can be specified as described here
2.3. Runtime configuration
Weaviate strongly recommends using batches to perform bulk operations such as importing data, as it will significantly impact performance. You can specify a batch configuration as in the below example, and pass it on as runtime configuration.
batch_config = {
"batch_size": 20,
"dynamic": False,
"timeout_retries": 3,
"num_workers": 1,
}
runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config)
dbconfig = WeaviateDocumentIndex.DBConfig(
host="http://localhost:8080"
) # Replace with your endpoint and/or auth settings
store = WeaviateDocumentIndex[Document](db_config=dbconfig)
store.configure(runtimeconfig) # Batch settings being passed on
name | type | explanation | default |
---|---|---|---|
batch_config | Dict[str, Any] | dictionary to configure the weaviate client's batching logic | see below |
Read more:
3. Available column types
Python data types are mapped to Weaviate type according to the below conventions.
Python type | Weaviate type |
---|---|
docarray.typing.ID | string |
str | text |
int | int |
float | number |
bool | boolean |
np.ndarray | number[] |
AbstractTensor | number[] |
bytes | blob |
You can override this default mapping by passing a col_type
to the Field
of a schema.
For example to map str
to string
you can:
A list of available Weaviate data types is here.
4. Adding example data
Putting it together, we can add data below using Weaviate as the Document Index:
import numpy as np
from pydantic import Field
from docarray import BaseDoc
from docarray.typing import NdArray
from docarray.index.backends.weaviate import WeaviateDocumentIndex
# Define a document schema
class Document(BaseDoc):
text: str
embedding: NdArray[2] = Field(
dims=2, is_embedding=True
) # Embedding column -> vector representation of the document
file: NdArray[100] = Field(dims=100)
# Make a list of 3 docs to index
docs = [
Document(
text="Hello world", embedding=np.array([1, 2]), file=np.random.rand(100), id="1"
),
Document(
text="Hello world, how are you?",
embedding=np.array([3, 4]),
file=np.random.rand(100),
id="2",
),
Document(
text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut",
embedding=np.array([5, 6]),
file=np.random.rand(100),
id="3",
),
]
batch_config = {
"batch_size": 20,
"dynamic": False,
"timeout_retries": 3,
"num_workers": 1,
}
runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config)
store = WeaviateDocumentIndex[Document](db_config=dbconfig)
store.configure(runtimeconfig) # Batch settings being passed on
store.index(docs)
4.1. Notes
- To use vector search, you need to specify
is_embedding
for exactly one field.- This is because Weaviate is configured to allow one vector per data object.
- If you would like to see Weaviate support multiple vectors per object, upvote the issue which will help to prioritize it.
- For a field to be considered as an embedding, its type needs to be of subclass
np.ndarray
orAbstractTensor
andis_embedding
needs to be set toTrue
.- If
is_embedding
is set toFalse
or not provided, the field will be treated as anumber[]
, and as a result, it will not be added to Weaviate's vector index.
- If
- It is possible to create a schema without specifying
is_embedding
for any field.- This will however mean that the document will not be vectorized and cannot be searched using vector search.
5. Query Builder/Hybrid Search
5.1. Text search
To perform a text search, follow the below syntax.
This will perform a text search for the word "hello" in the field "text" and return the first two results:
q = store.build_query().text_search("world", search_field="text").limit(2).build()
docs = store.execute_query(q)
docs
5.2. Vector similarity search
To perform a vector similarity search, follow the below syntax.
This will perform a vector similarity search for the vector [1, 2] and return the first two results:
5.3. Hybrid search
To perform a hybrid search, follow the below syntax.
This will perform a hybrid search for the word "hello" and the vector [1, 2] and return the first two results:
Note: Hybrid search searches through the object vector and all fields. Accordingly, the search_field
keyword it will have no effect.
q = store.build_query().text_search("world").find([1, 2]).limit(2).build()
docs = store.execute_query(q)
docs
5.4. GraphQL query
You can also perform a raw GraphQL query using any syntax as you might natively in Weaviate. This allows you to run any of the full range of queries that you might wish to.
The below will perform a GraphQL query to obtain the count of Document
objects.
graphql_query = """
{
Aggregate {
Document {
meta {
count
}
}
}
}
"""
store.execute_query(graphql_query)
Note that running a raw GraphQL query will return Weaviate-type responses, rather than a DocArray object type.
You can find the documentation for Weaviate's GraphQL API here.
6. Other notes
6.1. DocArray IDs vs Weaviate IDs
As you saw earlier, the id
field is a special field that is used to identify a document in BaseDoc
.
This is not the same as Weaviate's own id
, which is a reserved keyword and can't be used as a field name.
Accordingly, the DocArray document id is stored internally in Weaviate as docarrayid
.