Datasets

Creating Datasets

deeplake.dataset Returns a Dataset object referencing either a new or existing dataset.
deeplake.empty Creates an empty dataset
deeplake.like Creates a new dataset by copying the source dataset’s structure to a new location.
deeplake.ingest_classification Ingest a dataset of images from a local folder to a Deep Lake Dataset.
deeplake.ingest_coco Ingest images and annotations in COCO format to a Deep Lake Dataset.
deeplake.ingest_yolo Ingest images and annotations (bounding boxes or polygons) in YOLO format to a Deep Lake Dataset.
deeplake.ingest_kaggle Download and ingest a kaggle dataset and store it as a structured dataset to destination.
deeplake.ingest_dataframe Convert pandas dataframe to a Deep Lake Dataset.
deeplake.ingest_huggingface Converts Hugging Face datasets to Deep Lake format.

Loading Datasets

deeplake.load Loads an existing dataset

Deleting and Renaming Datasets

deeplake.delete Deletes a dataset at a given path.
deeplake.rename Renames dataset at old_path to new_path.

Copying Datasets

deeplake.copy Copies dataset at src to dest.
deeplake.deepcopy Copies dataset at src to dest including version control history.

Dataset Operations

Dataset.summary Prints a summary of the dataset.
Dataset.append Append samples to mutliple tensors at once.
Dataset.extend Appends multiple rows of samples to mutliple tensors at once.
Dataset.query Returns a sliced Dataset with given query results.
Dataset.copy Copies this dataset or dataset view to dest.
Dataset.delete Deletes the entire dataset from the cache layers (if any) and the underlying storage.
Dataset.rename Renames the dataset to path.
Dataset.connect Connect a Deep Lake cloud dataset through a deeplake path.
Dataset.visualize Visualizes the dataset in the Jupyter notebook.
Dataset.pop Removes a sample from all the tensors of the dataset.
Dataset.rechunk Rewrites the underlying chunks to make their sizes optimal.
Dataset.flush Necessary operation after writes if caches are being used.
Dataset.clear_cache
  • Flushes (see Dataset.flush()) the contents of the cache layers (if any) and then deletes contents of all the layers of it.
Dataset.size_approx Estimates the size in bytes of the dataset.

Dataset Visualization

Dataset.visualize Visualizes the dataset in the Jupyter notebook.

Dataset Credentials

Dataset.add_creds_key Adds a new creds key to the dataset.
Dataset.populate_creds Populates the creds key added in add_creds_key with the given creds.
Dataset.update_creds_key Updates the name and/or management status of a creds key.
Dataset.change_creds_management
Dataset.get_creds_keys Returns the list of creds keys added to the dataset.

Dataset Properties

Dataset.tensors All tensors belonging to this group, including those within sub groups.
Dataset.groups All sub groups in this group
Dataset.num_samples Returns the length of the smallest tensor.
Dataset.read_only Returns True if dataset is in read-only mode and False otherwise.
Dataset.info Returns the information about the dataset.
Dataset.max_len Return the maximum length of the tensor.
Dataset.min_len Return the minimum length of the tensor.

Dataset Version Control

Dataset.commit Stores a snapshot of the current state of the dataset.
Dataset.diff Returns/displays the differences between commits/branches.
Dataset.checkout Checks out to a specific commit_id or branch.
Dataset.merge Merges the target_id into the current dataset.
Dataset.log Displays the details of all the past commits.
Dataset.reset Resets the uncommitted changes present in the branch.
Dataset.get_commit_details Get details of a particular commit.
Dataset.commit_id The lasted committed commit id of the dataset.
Dataset.branch The current branch of the dataset
Dataset.pending_commit_id The commit_id of the next commit that will be made to the dataset.
Dataset.has_head_changes Returns True if currently at head node and uncommitted changes are present.
Dataset.commits Lists all the commits leading to the current dataset state.
Dataset.branches Lists all the branches of the dataset.

Dataset Views

A dataset view is a subset of a dataset that points to specific samples (indices) in an existing dataset. Dataset views can be created by indexing a dataset, filtering a dataset with Dataset.filter(), querying a dataset with Dataset.query() or by sampling a dataset with Dataset.sample_by(). Filtering is done with user-defined functions or simplified expressions whereas query can perform SQL-like queries with our Tensor Query Language. See the full TQL spec here.

Dataset views can only be saved when a dataset has been committed and has no changes on the HEAD node, in order to preserve data lineage and prevent the underlying data from changing after the query or filter conditions have been evaluated.

Example

>>> import deeplake
>>> # load dataset
>>> ds = deeplake.load("hub://activeloop/mnist-train")
>>> # filter dataset
>>> zeros = ds.filter("labels == 0")
>>> # save view
>>> zeros.save_view(id="zeros")
>>> # load_view
>>> zeros = ds.load_view(id="zeros")
>>> len(zeros)
5923
Dataset.query Returns a sliced Dataset with given query results.
Dataset.sample_by Returns a sliced Dataset with given weighted sampler applied.
Dataset.filter Filters the dataset in accordance of filter function f(x: sample) -> bool
Dataset.save_view Saves a dataset view as a virtual dataset (VDS)
Dataset.get_view Returns the dataset view corresponding to id.
Dataset.load_view Loads the view and returns the Dataset by id.
Dataset.delete_view Deletes the view with given view id.
Dataset.get_views Returns list of views stored in this Dataset.
Dataset.is_view Returns True if this dataset is a view and False otherwise.
Dataset.min_view Returns a view of the dataset in which all tensors are sliced to have the same length as the shortest tensor.
Dataset.max_view Returns a view of the dataset in which shorter tensors are padded with None s to have the same length as the longest tensor.