๐ค Text
DocArray supports many different modalities including Text
.
This section will show you how to load and handle text data using DocArray.
Tip
Check out our predefined TextDoc
to get started and play around with our text features.
You can store text in DocArray like this:
from docarray import BaseDoc
class MyText(BaseDoc):
text: str = None
doc = MyText(text='Hello world!')
Text can include any type of character, including emojis:
doc.text = '๐ เคจเคฎเคธเฅเคคเฅ เคฆเฅเคจเคฟเคฏเคพ! ไฝ ๅฅฝไธ็๏ผใใใซใกใฏไธ็๏ผ ะัะธะฒะตั ะผะธั!'
Load text file
If your text data is too long to be written inline or if it is stored in a file, you can first define the URL as a TextUrl
and then load the text data.
Let's first define a schema:
from docarray import BaseDoc
from docarray.typing import TextUrl
class MyText(BaseDoc):
text: str = None
url: TextUrl = None
MyText
object with a url
attribute and load its content to the text
field.
doc = MyText(
url='https://www.w3.org/History/19921103-hypertext/hypertext/README.html',
)
doc.text = doc.url.load()
assert doc.text.startswith('<TITLE>Read Me</TITLE>')
Segment long texts
When you index or search text data, you often donโt want to consider thousands of words as one huge string.
Instead, some finer granularity would be nice. You can do this by leveraging nested fields. For example, letโs split some page content into its sentences by '.'
:
from docarray import BaseDoc, DocList
class Sentence(BaseDoc):
text: str
class Page(BaseDoc):
content: DocList[Sentence]
long_text = 'First sentence. Second sentence. And many many more sentences.'
page = Page(content=[Sentence(text=t) for t in long_text.split('.')])
page.summary()
Output
๐ Page : 13d909a ...
โโโ ๐ content: DocList[Sentence]
โโโ ๐ Sentence : 6725382 ...
โ โญโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฎ
โ โ Attribute โ Value โ
โ โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ โ text: str โ First sentence โ
โ โฐโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโฏ
โโโ ๐ Sentence : 17a934c ...
โ โญโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โ Attribute โ Value โ
โ โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโค
โ โ text: str โ Second sentence โ
โ โฐโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโฏ
โโโ ... 2 more Sentence documents
Getting started - Predefined TextDoc
To get started and play around with your text data, DocArray provides a predefined TextDoc
, which includes all of the previously mentioned functionalities: