Skip to content

๐Ÿ”ค Text

DocArray supports many different modalities including Text. This section will show you how to load and handle text data using DocArray.

Tip

Check out our predefined TextDoc to get started and play around with our text features.

You can store text in DocArray like this:

from docarray import BaseDoc


class MyText(BaseDoc):
    text: str = None


doc = MyText(text='Hello world!')

Text can include any type of character, including emojis:

doc.text = '๐Ÿ‘‹   เคจเคฎเคธเฅเคคเฅ‡ เคฆเฅเคจเคฟเคฏเคพ!  ไฝ ๅฅฝไธ–็•Œ๏ผใ“ใ‚“ใซใกใฏไธ–็•Œ๏ผ   ะŸั€ะธะฒะตั‚ ะผะธั€!'

Load text file

If your text data is too long to be written inline or if it is stored in a file, you can first define the URL as a TextUrl and then load the text data.

Let's first define a schema:

from docarray import BaseDoc
from docarray.typing import TextUrl


class MyText(BaseDoc):
    text: str = None
    url: TextUrl = None
Next, instantiate a MyText object with a url attribute and load its content to the text field.
doc = MyText(
    url='https://www.w3.org/History/19921103-hypertext/hypertext/README.html',
)
doc.text = doc.url.load()

assert doc.text.startswith('<TITLE>Read Me</TITLE>')

Segment long texts

When you index or search text data, you often donโ€™t want to consider thousands of words as one huge string. Instead, some finer granularity would be nice. You can do this by leveraging nested fields. For example, letโ€™s split some page content into its sentences by '.':

from docarray import BaseDoc, DocList


class Sentence(BaseDoc):
    text: str


class Page(BaseDoc):
    content: DocList[Sentence]


long_text = 'First sentence. Second sentence. And many many more sentences.'
page = Page(content=[Sentence(text=t) for t in long_text.split('.')])

page.summary()
Output
๐Ÿ“„ Page : 13d909a ...
โ””โ”€โ”€ ๐Ÿ’  content: DocList[Sentence]
    โ”œโ”€โ”€ ๐Ÿ“„ Sentence : 6725382 ...
    โ”‚   โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚   โ”‚ Attribute      โ”‚ Value               โ”‚
    โ”‚   โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚   โ”‚ text: str      โ”‚ First sentence      โ”‚
    โ”‚   โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    โ”œโ”€โ”€ ๐Ÿ“„ Sentence : 17a934c ...
    โ”‚   โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚   โ”‚ Attribute     โ”‚ Value                โ”‚
    โ”‚   โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚   โ”‚ text: str     โ”‚  Second sentence     โ”‚
    โ”‚   โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    โ””โ”€โ”€ ... 2 more Sentence documents

Getting started - Predefined TextDoc

To get started and play around with your text data, DocArray provides a predefined TextDoc, which includes all of the previously mentioned functionalities:

class TextDoc(BaseDoc):
    text: Optional[str] = None
    url: Optional[TextUrl] = None
    embedding: Optional[AnyEmbedding] = None
    bytes_: Optional[bytes] = None