Store on S3
When you want to use your DocList in another place, you can use the
.push method to push the DocList to S3 and later use the
.pull function to pull its content back.
Note
To store on S3, you need to install the extra dependency with the following line
Push & pull
To use the store DocList on S3, you need to pass an S3 path to the function starting with 's3://'.
In the following demo, we use MinIO as a local S3 service. You could use the following docker-compose file to start the service in a Docker container.
version: "3"
services:
minio:
container_name: minio
image: "minio/minio:RELEASE.2023-03-13T19-46-17Z"
ports:
- "9005:9000"
command: server /data
docker-compose.yml and run the following line in the same folder as the file.
from docarray import BaseDoc, DocList
class SimpleDoc(BaseDoc):
text: str
if __name__ == '__main__':
import boto3
from botocore.client import Config
BUCKET = 'tmp_bucket'
my_session = boto3.session.Session()
s3 = my_session.resource(
service_name='s3',
region_name="us-east-1",
use_ssl=False,
endpoint_url="http://localhost:9005",
aws_access_key_id="minioadmin",
aws_secret_access_key="minioadmin",
config=Config(signature_version="s3v4"),
)
# make a bucket
s3.create_bucket(Bucket=BUCKET)
store_docs = [SimpleDoc(text=f'doc {i}') for i in range(8)]
docs = DocList[SimpleDoc]()
docs.extend([SimpleDoc(text=f'doc {i}') for i in range(8)])
# .push() and .pull() use the default boto3 client
boto3.Session.client.__defaults__ = (
"us-east-1",
None,
False,
None,
"http://localhost:9005",
"minioadmin",
"minioadmin",
None,
Config(signature_version="s3v4"),
)
docs.push(f's3://{BUCKET}/simple_docs')
docs_pull = DocList[SimpleDoc].pull(f's3://{BUCKET}/simple_docs')
Under the bucket tmp_bucket, there is a file with the name of simple_docs.docs being created to store the DocList.
Note
When using .push() and .pull(), DocList calls the default boto3 client. Be sure your default session is correctly set up.
Push & pull with streaming
When you have a large amount of documents to push and pull, you could use the streaming function.
.push_stream() and
.pull_stream() can help you to stream the
DocList in order to save the memory usage. You set multiple DocList to pull from the same source as well. The usage is the same as using streaming with local files. Please refer to Push & Pull with streaming with local files.
Delete
To delete the store, you need to use the static method .delete() of S3DocStore class.
from docarray import BaseDoc, DocList
class SimpleDoc(BaseDoc):
text: str
if __name__ == '__main__':
import boto3
from botocore.client import Config
BUCKET = 'tmp_bucket'
my_session = boto3.session.Session()
s3 = my_session.resource(
service_name='s3',
region_name="us-east-1",
use_ssl=False,
endpoint_url="http://localhost:9005",
aws_access_key_id="minioadmin",
aws_secret_access_key="minioadmin",
config=Config(signature_version="s3v4"),
)
# make a bucket
s3.create_bucket(Bucket=BUCKET)
store_docs = [SimpleDoc(text=f'doc {i}') for i in range(8)]
docs = DocList[SimpleDoc]()
docs.extend([SimpleDoc(text=f'doc {i}') for i in range(8)])
# .push() and .pull() use the default boto3 client
boto3.Session.client.__defaults__ = (
"us-east-1",
None,
False,
None,
"http://localhost:9005",
"minioadmin",
"minioadmin",
None,
Config(signature_version="s3v4"),
)
docs.push(f's3://{BUCKET}/simple_docs')
# delete bucket
from docarray.store import S3DocStore
success = S3DocStore.delete('{BUCKET}/simple_docs')