Skip to content

Serialization

DocArray offers various serialization options for all of its main data classes: BaseDoc, DocList, and DocVec

BaseDoc

You need to serialize a BaseDoc before you can store or send it.

Note

BaseDoc supports serialization to protobuf and json formats.

JSON

from typing import List
from docarray import BaseDoc


class MyDoc(BaseDoc):
    text: str
    tags: List[str]


doc = MyDoc(text='hello world', tags=['hello', 'world'])
json_str = doc.json()
new_doc = MyDoc.parse_raw(json_str)
assert doc == new_doc  # True

Protobuf

from typing import List
from docarray import BaseDoc


class MyDoc(BaseDoc):
    text: str
    tags: List[str]


doc = MyDoc(text='hello world', tags=['hello', 'world'])
proto_message = doc.to_protobuf()
new_doc = MyDoc.from_protobuf(proto_message)
assert doc == new_doc  # True

DocList

When sending or storing DocList, you need to use serialization. DocList supports multiple ways to serialize the data.

JSON

  • to_json() serializes a DocList to JSON. It returns the binary representation of the JSON object.
  • from_json() deserializes a DocList from JSON. It can load from either a str or binary representation of the JSON object.
from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

with open('simple-dl.json', 'wb') as f:
    json_dl = dl.to_json()
    print(json_dl)
    f.write(json_dl)

with open('simple-dl.json', 'r') as f:
    dl_load_from_json = DocList[SimpleDoc].from_json(f.read())
    print(dl_load_from_json)
b'[{"id":"5540e72d407ae81abb2390e9249ed066","text":"doc 0"},{"id":"fbe9f80d2fa03571e899a2887af1ac1b","text":"doc 1"}]'

Protobuf

from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

proto_message_dl = dl.to_protobuf()
dl_from_proto = DocList[SimpleDoc].from_protobuf(proto_message_dl)
print(type(proto_message_dl))
print(dl_from_proto)

Base64

When transferring data over the network, use Base64 format to serialize the DocList. Serializing a DocList in Base64 supports both the pickle and protobuf protocols. You can also choose different compression methods.

You can multiple compression methods: lz4, bz2, lzma, zlib, and gzip.

from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

base64_repr_dl = dl.to_base64(compress=None, protocol='pickle')

dl_from_base64 = DocList[SimpleDoc].from_base64(
    base64_repr_dl, compress=None, protocol='pickle'
)

Save binary

These methods serialize and save your data:

You can choose between multiple compression methods: lz4, bz2, lzma, zlib, and gzip.

from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

dl.save_binary('simple-dl.pickle', compress=None, protocol='pickle')

dl_from_binary = DocList[SimpleDoc].load_binary(
    'simple-dl.pickle', compress=None, protocol='pickle'
)

In the above snippet, the DocList is stored as the file simple-dl.pickle.

Bytes

These methods just serialize your data, without saving it to a file:

Note

These methods are used under the hood by save_binary() and load_binary() to prepare/load/save to a binary file. You can also use them directly to work with byte files.

Like working with binary files:

  • You can use protocol to choose between pickle and protobuf.
  • You can use multiple compression methods: lz4, bz2, lzma, zlib, and gzip.
from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

bytes_dl = dl.to_bytes(protocol='pickle', compress=None)

dl_from_bytes = DocList[SimpleDoc].from_bytes(
    bytes_dl, compress=None, protocol='pickle'
)

CSV

Use the dialect parameter to choose the dialect of the CSV format:

from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

dl.to_csv('simple-dl.csv')
dl_from_csv = DocList[SimpleDoc].from_csv('simple-dl.csv')
print(dl_from_csv)

Pandas.Dataframe

from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

df = dl.to_dataframe()
dl_from_dataframe = DocList[SimpleDoc].from_dataframe(df)
print(dl_from_dataframe)

DocVec

When sending or storing DocVec, you need to use protobuf serialization.

Note

We plan to add more serialization formats in the future, notably JSON.

Protobuf

  • to_protobuf serializes a DocVec to protobuf. It returns a protobuf object of docarray_pb2.DocVecProto class.
  • from_protobuf deserializes a DocVec from protobuf. It accepts a protobuf message object to construct a DocVec.
import numpy as np

from docarray import BaseDoc, DocVec
from docarray.typing import AnyTensor


class SimpleVecDoc(BaseDoc):
    tensor: AnyTensor


dv = DocVec[SimpleVecDoc]([SimpleVecDoc(tensor=np.ones(16)) for _ in range(8)])

proto_message_dv = dv.to_protobuf()

dv_from_proto = DocVec[SimpleVecDoc].from_protobuf(proto_message_dv)