Migration guide
If you are using DocArray v<0.30.0, you will be familiar with its dataclass API.
DocArray >=0.30 is that idea, taken seriously. Every document is created through a dataclass-like interface, courtesy of Pydantic.
This gives the following advantages:
- Flexibility: No need to conform to a fixed set of fields -- your data defines the schema.
- Multi-modality: Easily store multiple modalities and multiple embeddings in the same document.
- Language agnostic: At their core, documents are just dictionaries. This makes it easy to create and send them from any language, not just Python.
You may also be familiar with our old Document Stores for vector DB integration. They are now called Document Indexes and offer the following improvements:
- Hybrid search: You can now combine vector search with text search, and even filter by arbitrary fields.
- Production-ready: The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain.
- Increased flexibility: We strive to support any configuration or setting that you could perform through the DB's first-party client.
For now, Document Indexes support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.
Changes to Document
Documenthas been renamed toBaseDoc.BaseDoccannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.- Following from the previous point, the extending of
BaseDocallows for a flexible schema while theDocumentclass in v1 only allowed for a fixed schema, with one oftensor,textandblob, and additionalchunksandmatches. - Due to the added flexibility, one can not know what fields your document class will provide.
Therefore, various methods from v1 (such as
.load_uri_to_image_tensor()) are not supported in v2. Instead, we provide some of those methods on the typing-level. - In v2 we have the
LegacyDocumentclass, which extendsBaseDocwhile following the same schema as v1'sDocument. TheLegacyDocumentcan be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray <=0.21Document. Indeed, none of the methods associated withDocumentare present. Only the schema of the data is similar.
Changes to DocumentArray
DocList
- The
DocumentArrayclass from v1 has been renamed toDocList, to be more descriptive of its actual functionality, since it is a list ofBaseDocs.
DocVec
- Additionally, we have introduced the class
DocVec, which is a column-based representation ofBaseDocs. BothDocVecandDocListextendAnyDocArray. DocVecis a container of Documents appropriate for performing computation that requires batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).- A
DocVechas a similar interface asDocListbut with an underlying implementation that is column-based instead of row-based. Each field of the schema of theDocVec(the.doc_typewhich is aBaseDoc) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a singledoc_vec(torch/np/tf) tensor. If the tensor field isAnyTensoror a Union of tensor types, the.tensor_typewill be used to determine the type of thedoc_veccolumn.
Parameterized DocList
- With the added flexibility of your document schema, and therefore endless options to design your document schema,
when initializing a
DocListit does not necessarily have to be homogenous. -
If you want a homogenous
DocListyou can parameterize it at initialization time: -
Methods like
.from_csv()or.pull()only work with parameterizedDocLists.
Access attributes of your DocumentArray
- In v1 you could access an attribute of all Documents in your DocumentArray by calling the plural of the attribute's name on your DocArray instance.
- In v2 you don't have to use the plural, but instead just use the document's attribute name,
since
AnyDocArraywill expose the same attributes as theBaseDocs it contains. This will return a list oftype(attribute). However, this works if (and only if) all theBaseDocs in theAnyDocArrayhave the same schema. Therfore only this works:
from docarray import BaseDoc, DocList
class Book(BaseDoc):
title: str
author: str = None
docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)])
book_titles = docs.title # returns a list[str]
# this would fail
# docs = DocList([Book(title=f'title {i}') for i in range(5)])
# book_titles = docs.title
Changes to Document Store
In v2 the Document Store has been renamed to DocIndex and can be used for fast retrieval using vector similarity.
DocArray >=0.30 DocIndex supports:
Instead of creating a DocumentArray instance and setting the storage parameter to a vector database of your choice,
in v2 you can initialize a DocIndex object of your choice, such as:
In contrast, DocStore in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.