📊 Table
DocArray supports many different modalities including tabular data. This section will show you how to load and handle tabular data using DocArray.
Load CSV table
A common way to store tabular data is via CSV
(comma-separated values) files.
You can load such data from a given CSV
file into a DocList
.
Let's take a look at the following example file, which includes data about books and their authors and year of publication:
title,author,year
Harry Potter and the Philosopher's Stone,J. K. Rowling,1997
Klara and the sun,Kazuo Ishiguro,2020
A little life,Hanya Yanagihara,2015
First, define the Document schema describing the data:
Next, load the content of the CSV file to aDocList
instance of Book
s via .from_csv()
:
from docarray import DocList
docs = DocList[Book].from_csv(
file_path='https://github.com/docarray/docarray/blob/main/tests/toydata/books.csv?raw=true'
)
docs.summary()
Output
The resulting DocList
object contains three Book
s, since each row of the CSV file corresponds to one book and is assigned to one Book
instance.
Save to CSV file
Vice versa, you can also store your DocList
data in a .csv
file using .to_csv()
:
Tabular data is often not the best choice to represent nested Documents. Hence, nested Documents will be stored flattened and can be accessed by their '__'
-separated access paths.
Let's take a look at an example. We now want to store not only the book data but moreover book review data. To do so, we define a BookReview
class that has a nested book
attribute as well as the non-nested attributes n_ratings
and stars
:
class BookReview(BaseDoc):
book: Book
n_ratings: int
stars: float
review_docs = DocList[BookReview](
[BookReview(book=book, n_ratings=12345, stars=5) for book in docs]
)
review_docs.summary()
Output
╭───────── DocList Summary ─────────╮
│ │
│ Type DocList[BookReview] │
│ Length 3 │
│ │
╰───────────────────────────────────╯
╭──── Document Schema ────╮
│ │
│ BookReview │
│ ├── book: Book │
│ │ ├── title: str │
│ │ ├── author: str │
│ │ └── year: int │
│ ├── n_ratings: int │
│ └── stars: float │
│ │
╰─────────────────────────╯
As expected all nested attributes will be stored by their access path:
id,book__id,book__title,book__author,book__year,n_ratings,stars
d6363aa3b78b4f4244fb976570a84ff7,8cd85fea52b3a3bc582cf56c9d612cbb,Harry Potter and the Philosopher's Stone,J. K. Rowling,1997,12345,5.0
5b53fff67e6b6cede5870f2ee09edb05,87b369b93593967226c525cf226e3325,Klara and the sun,Kazuo Ishiguro,2020,12345,5.0
addca0475756fc12cdec8faf8fb10d71,03194cec1b75927c2259b3c0fff1ab6f,A little life,Hanya Yanagihara,2015,12345,5.0
Handle TSV tables
Not only can you load and save comma-separated values (CSV
) data, but also tab-separated values (TSV
),
by adjusting the dialect
parameter in .from_csv()
and .to_csv()
.
The dialect defaults to 'excel'
, which refers to comma-separated values. For tab-separated values, you can use
'excel-tab'
.
Let's take a look at what this would look like with a tab-separated file:
docs = DocList[Book].from_csv(
file_path='https://github.com/docarray/docarray/blob/main/tests/toydata/books.tsv?raw=true',
dialect='excel-tab',
)
for doc in docs:
doc.summary()
Output
📄 Book : c1ac9d4 ...
╭──────────────────────┬───────────────╮
│ Attribute │ Value │
├──────────────────────┼───────────────┤
│ title: str │ Title1 │
│ author: str │ author1 │
│ year: int │ 2020 │
╰──────────────────────┴───────────────╯
📄 Book : c1ac9d4 ...
╭──────────────────────┬───────────────╮
│ Attribute │ Value │
├──────────────────────┼───────────────┤
│ title: str │ Title1 │
│ author: str │ author1 │
│ year: int │ 2020 │
╰──────────────────────┴───────────────╯
Great! All the data is correctly read and stored in Book
instances.
Other separators
If your values are separated by yet another separator, you can create your own class that inherits from csv.Dialect
.
Within this class, you can define your dialect's behavior by setting the provided formatting parameters.
For instance, let's assume you have a semicolon-separated table:
Now, let's define our SemicolonSeparator
class. Next to the delimiter
parameter, we have to set some more formatting parameters such as doublequote
and lineterminator
.
import csv
class SemicolonSeparator(csv.Dialect):
delimiter = ';'
doublequote = True
lineterminator = '\r\n'
quotechar = '"'
quoting = csv.QUOTE_MINIMAL
Finally, you can load your data by setting the dialect
parameter in .from_csv()
to an instance of your SemicolonSeparator
.
docs = DocList[Book].from_csv(
file_path='https://github.com/docarray/docarray/blob/main/tests/toydata/books_semicolon_sep.csv?raw=true',
dialect=SemicolonSeparator(),
)
for doc in docs:
doc.summary()
Output
📄 Book : 321e9fd ...
╭──────────────────────┬───────────────╮
│ Attribute │ Value │
├──────────────────────┼───────────────┤
│ title: str │ Title1 │
│ author: str │ author1 │
│ year: int │ 2020 │
╰──────────────────────┴───────────────╯
📄 Book : 16d2097 ...
╭──────────────────────┬───────────────╮
│ Attribute │ Value │
├──────────────────────┼───────────────┤
│ title: str │ Title2 │
│ author: str │ author2 │
│ year: int │ 1234 │
╰──────────────────────┴───────────────╯