Skip to content

Serialization

DocArray offers various serialization options for all of its main data classes: BaseDoc, DocList, and DocVec

BaseDoc

You need to serialize a BaseDoc before you can store or send it.

Note

BaseDoc supports serialization to protobuf and json formats.

JSON

from typing import List
from docarray import BaseDoc


class MyDoc(BaseDoc):
    text: str
    tags: List[str]


doc = MyDoc(text='hello world', tags=['hello', 'world'])
json_str = doc.json()
new_doc = MyDoc.parse_raw(json_str)
assert doc == new_doc  # True

Protobuf

from typing import List
from docarray import BaseDoc


class MyDoc(BaseDoc):
    text: str
    tags: List[str]


doc = MyDoc(text='hello world', tags=['hello', 'world'])
proto_message = doc.to_protobuf()
new_doc = MyDoc.from_protobuf(proto_message)
assert doc == new_doc  # True

DocList

When sending or storing DocList, you need to use serialization. DocList supports multiple ways to serialize the data.

JSON

  • to_json() serializes a DocList to JSON. It returns the binary representation of the JSON object.
  • from_json() deserializes a DocList from JSON. It can load from either a str or binary representation of the JSON object.
from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

with open('simple-dl.json', 'wb') as f:
    json_dl = dl.to_json()
    print(json_dl)
    f.write(json_dl.encode())

with open('simple-dl.json', 'r') as f:
    dl_load_from_json = DocList[SimpleDoc].from_json(f.read())
    print(dl_load_from_json)
'[{"id":"5540e72d407ae81abb2390e9249ed066","text":"doc 0"},{"id":"fbe9f80d2fa03571e899a2887af1ac1b","text":"doc 1"}]'

Protobuf

from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

proto_message_dl = dl.to_protobuf()
dl_from_proto = DocList[SimpleDoc].from_protobuf(proto_message_dl)
print(type(proto_message_dl))
print(dl_from_proto)

Base64

When transferring data over the network, use Base64 format to serialize the DocList. Serializing a DocList in Base64 supports both the pickle and protobuf protocols. You can also choose different compression methods.

You can multiple compression methods: lz4, bz2, lzma, zlib, and gzip.

from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

base64_repr_dl = dl.to_base64(compress=None, protocol='pickle')

dl_from_base64 = DocList[SimpleDoc].from_base64(
    base64_repr_dl, compress=None, protocol='pickle'
)

Save binary

These methods serialize and save your data:

You can choose between multiple compression methods: lz4, bz2, lzma, zlib, and gzip.

from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

dl.save_binary('simple-dl.pickle', compress=None, protocol='pickle')

dl_from_binary = DocList[SimpleDoc].load_binary(
    'simple-dl.pickle', compress=None, protocol='pickle'
)

In the above snippet, the DocList is stored as the file simple-dl.pickle.

Bytes

These methods just serialize your data, without saving it to a file:

Note

These methods are used under the hood by save_binary() and load_binary() to prepare/load/save to a binary file. You can also use them directly to work with byte files.

Like working with binary files:

  • You can use protocol to choose between pickle and protobuf.
  • You can use multiple compression methods: lz4, bz2, lzma, zlib, and gzip.
from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

bytes_dl = dl.to_bytes(protocol='pickle', compress=None)

dl_from_bytes = DocList[SimpleDoc].from_bytes(
    bytes_dl, compress=None, protocol='pickle'
)

CSV

Use the dialect parameter to choose the dialect of the CSV format:

from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

dl.to_csv('simple-dl.csv')
dl_from_csv = DocList[SimpleDoc].from_csv('simple-dl.csv')
print(dl_from_csv)

Pandas.Dataframe

from docarray import BaseDoc, DocList


class SimpleDoc(BaseDoc):
    text: str


dl = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

df = dl.to_dataframe()
dl_from_dataframe = DocList[SimpleDoc].from_dataframe(df)
print(dl_from_dataframe)

DocVec

For sending or storing DocVec it offers a very similar interface to that of DocList.

Tensor type and (de)serialization

You can deserialize any serialized DocVec to any tensor type (NdArray, TorchTensor, or TensorFlowTensor), by passing the tensor_type=... parameter to the appropriate deserialization method. This is analogous to the tensor_type=... parameter in the [DocVec][docarray.array.doc_list.doc_list.DocVec.__init__] constructor.

This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.

If no tensor_type is passed, the default is NdArray.

JSON

  • to_json() serializes a DocVec to JSON. It returns the binary representation of the JSON object.
  • from_json() deserializes a DocList from JSON. It can load from either a str or binary representation of the JSON object.

In contrast to DocList's JSON format, DocVec.to_json() outputs a column oriented JSON file:

import torch
from docarray import BaseDoc, DocVec
from docarray.typing import TorchTensor


class SimpleDoc(BaseDoc):
    text: str
    tensor: TorchTensor


dv = DocVec[SimpleDoc](
    [SimpleDoc(text=f'doc {i}', tensor=torch.rand(64)) for i in range(2)]
)

with open('simple-dv.json', 'wb') as f:
    json_dv = dv.to_json()
    print(json_dv)
    f.write(json_dv.encode())

with open('simple-dv.json', 'r') as f:
    dv_load_from_json = DocVec[SimpleDoc].from_json(f.read(), tensor_type=TorchTensor)
    print(dv_load_from_json)
'{"tensor_columns":{},"doc_columns":{},"docs_vec_columns":{},"any_columns":{"id":["005a208a0a9a368c16bf77913b710433","31d65f02cb94fc9756c57b0dbaac3a2c"],"text":["doc 0","doc 1"]}}'
<DocVec[SimpleDoc] (length=2)>

Protobuf

  • to_protobuf serializes a DocVec to protobuf. It returns a protobuf object of docarray_pb2.DocVecProto class.
  • from_protobuf deserializes a DocVec from protobuf. It accepts a protobuf message object to construct a DocVec.
import numpy as np

from docarray import BaseDoc, DocVec
from docarray.typing import AnyTensor


class SimpleVecDoc(BaseDoc):
    tensor: AnyTensor


dv = DocVec[SimpleVecDoc]([SimpleVecDoc(tensor=np.ones(16)) for _ in range(8)])

proto_message_dv = dv.to_protobuf()

dv_from_proto = DocVec[SimpleVecDoc].from_protobuf(proto_message_dv)

You can deserialize any DocVec protobuf message to any tensor type, by passing the tensor_type=... parameter to from_protobuf

This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.

If no tensor_type is passed, the default is NdArray.

import torch

from docarray import BaseDoc, DocVec
from docarray.typing import TorchTensor, NdArray, AnyTensor


class AnyTensorDoc(BaseDoc):
    tensor: AnyTensor


dv = DocVec[AnyTensorDoc](
    [AnyTensorDoc(tensor=torch.ones(16)) for _ in range(8)], tensor_type=TorchTensor
)

proto_message_dv = dv.to_protobuf()

# deserialize to torch
dv_from_proto_torch = DocVec[AnyTensorDoc].from_protobuf(
    proto_message_dv, tensor_type=TorchTensor
)
assert dv_from_proto_torch.tensor_type == TorchTensor
assert isinstance(dv_from_proto_torch.tensor, TorchTensor)

# deserialize to numpy (default)
dv_from_proto_numpy = DocVec[AnyTensorDoc].from_protobuf(proto_message_dv)
assert dv_from_proto_numpy.tensor_type == NdArray
assert isinstance(dv_from_proto_numpy.tensor, NdArray)

Note

Serialization to protobuf is not supported for union types involving BaseDoc types.

Base64

When transferring data over the network, use Base64 format to serialize the DocVec. Serializing a DocVec in Base64 supports both the pickle and protobuf protocols. You can also choose different compression methods.

You can multiple compression methods: lz4, bz2, lzma, zlib, and gzip.

from docarray import BaseDoc, DocVec
from docarray.typing import TorchTensor
import torch


class SimpleDoc(BaseDoc):
    text: str
    tensor: TorchTensor


dv = DocVec[SimpleDoc](
    [SimpleDoc(text=f'doc {i}', tensor=torch.rand(64)) for i in range(2)]
)

base64_repr_dv = dv.to_base64(compress=None, protocol='pickle')

dl_from_base64 = DocVec[SimpleDoc].from_base64(
    base64_repr_dv, compress=None, protocol='pickle', tensor_type=TorchTensor
)

Save binary

These methods serialize and save your data:

You can choose between multiple compression methods: lz4, bz2, lzma, zlib, and gzip.

from docarray import BaseDoc, DocVec
from docarray.typing import TorchTensor
import torch


class SimpleDoc(BaseDoc):
    text: str
    tensor: TorchTensor


dv = DocVec[SimpleDoc](
    [SimpleDoc(text=f'doc {i}', tensor=torch.rand(64)) for i in range(2)]
)

dv.save_binary('simple-dl.pickle', compress=None, protocol='pickle')

dv_from_binary = DocVec[SimpleDoc].load_binary(
    'simple-dv.pickle', compress=None, protocol='pickle', tensor_type=TorchTensor
)

In the above snippet, the DocVec is stored as the file simple-dv.pickle.

Bytes

These methods just serialize your data, without saving it to a file:

Note

These methods are used under the hood by save_binary() and load_binary() to prepare/load/save to a binary file. You can also use them directly to work with byte files.

Like working with binary files:

  • You can use protocol to choose between pickle and protobuf.
  • You can use multiple compression methods: lz4, bz2, lzma, zlib, and gzip.
from docarray import BaseDoc, DocVec
from docarray.typing import TorchTensor
import torch


class SimpleDoc(BaseDoc):
    text: str
    tensor: TorchTensor


dv = DocVec[SimpleDoc](
    [SimpleDoc(text=f'doc {i}', tensor=torch.rand(64)) for i in range(2)]
)

bytes_dv = dv.to_bytes(protocol='pickle', compress=None)

dv_from_bytes = DocVec[SimpleDoc].from_bytes(
    bytes_dv, compress=None, protocol='pickle', tensor_type=TorchTensor
)

CSV

Warning

DocVec does not support .to_csv() or from_csv(). This is because CSV is a row-based format while DocVec has a column-based data layout. To overcome this, you can convert your DocVec to a DocList.

from docarray import BaseDoc, DocList, DocVec


class SimpleDoc(BaseDoc):
    text: str


dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

dv.to_doc_list().to_csv('simple-dl.csv')
dv_from_csv = DocList[SimpleDoc].from_csv('simple-dl.csv').to_doc_vec()

For more details you can check the DocList section on CSV serialization

Pandas.Dataframe

from docarray import BaseDoc, DocVec
from docarray.typing import TorchTensor
import torch


class SimpleDoc(BaseDoc):
    text: str
    tensor: TorchTensor


dv = DocVec[SimpleDoc](
    [SimpleDoc(text=f'doc {i}', tensor=torch.rand(64)) for i in range(2)]
)

df = dv.to_dataframe()
dv_from_dataframe = DocVec[SimpleDoc].from_dataframe(df, tensor_type=TorchTensor)
print(dv_from_dataframe)