Dataloader¶
Train your models using the new high performance C++ dataloader.
See the dataloader method on how to create dataloaders from your datasets:
Dataset.dataloader |
Returns a DeepLakeDataLoader object. |
DeepLakeDataLoader¶
-
class
deeplake.enterprise.DeepLakeDataLoader¶ -
batch(batch_size: int, drop_last: bool = False)¶ Returns a batched
DeepLakeDataLoaderobject.Parameters: - batch_size (int) – Number of samples in each batch.
- drop_last (bool) – If True, the last batch will be dropped if its size is less than batch_size. Defaults to False.
Returns: A
DeepLakeDataLoaderobject.Return type: Raises: ValueError– If .batch() has already been called.
-
close()¶ Shuts down the workers and releases the resources.
-
numpy(num_workers: int = 0, tensors: Optional[List[str]] = None, num_threads: Optional[int] = None, prefetch_factor: int = 2, decode_method: Optional[Dict[str, str]] = None, persistent_workers: bool = False)¶ Returns a
DeepLakeDataLoaderobject.Parameters: - num_workers (int) – Number of workers to use for transforming and processing the data. Defaults to 0.
- tensors (List[str], Optional) – List of tensors to load. If None, all tensors are loaded. Defaults to None.
- num_threads (int, Optional) – Number of threads to use for fetching and decompressing the data. If None, the number of threads is automatically determined. Defaults to None.
- prefetch_factor (int) – Number of batches to transform and collate in advance per worker. Defaults to 2.
- persistent_workers (bool) – If
True, the data loader will not shutdown the worker processes after a dataset has been consumed once. Defaults toFalse. - decode_method (Dict[str, str], Optional) –
A dictionary of decode methods for each tensor. Defaults to None.
- Supported decode methods are:-
‘numpy’: Default behaviour. Returns samples as numpy arrays. ’tobytes’: Returns raw bytes of the samples. ’pil’: Returns samples as PIL images. Especially useful when transformation use torchvision transforms, that require PIL images as input. Only supported for tensors with sample_compression=’jpeg’ or ‘png’.
- Supported decode methods are:-
Returns: A
DeepLakeDataLoaderobject.Return type: Raises: ValueError– If .pytorch() or .tensorflow() or .numpy() has already been called.
-
pytorch(num_workers: int = 0, collate_fn: Optional[Callable] = None, tensors: Optional[List[str]] = None, num_threads: Optional[int] = None, prefetch_factor: int = 2, distributed: bool = False, return_index: bool = True, decode_method: Optional[Dict[str, str]] = None, persistent_workers: bool = False)¶ Returns a
DeepLakeDataLoaderobject.Parameters: - num_workers (int) – Number of workers to use for transforming and processing the data. Defaults to 0.
- collate_fn (Callable, Optional) – merges a list of samples to form a mini-batch of Tensor(s).
- tensors (List[str], Optional) – List of tensors to load. If None, all tensors are loaded. Defaults to
None. - num_threads (int, Optional) – Number of threads to use for fetching and decompressing the data. If
None, the number of threads is automatically determined. Defaults toNone. - prefetch_factor (int) – Number of batches to transform and collate in advance per worker. Defaults to 2.
- distributed (bool) – Used for DDP training. Distributes different sections of the dataset to different ranks. Defaults to
False. - return_index (bool) – Used to idnetify where loader needs to retur sample index or not. Defaults to
True. - persistent_workers (bool) – If
True, the data loader will not shutdown the worker processes after a dataset has been consumed once. Defaults toFalse. - decode_method (Dict[str, str], Optional) –
A dictionary of decode methods for each tensor. Defaults to
None.- Supported decode methods are:
’numpy’: Default behaviour. Returns samples as numpy arrays. ’tobytes’: Returns raw bytes of the samples. ’pil’: Returns samples as PIL images. Especially useful when transformation use torchvision transforms, that require PIL images as input. Only supported for tensors with sample_compression='jpeg'or'png'.
- Supported decode methods are:
Returns: A
DeepLakeDataLoaderobject.Return type: Raises: ValueError– If .pytorch() or .tensorflow() or .numpy() has already been called.Examples
>>> import deeplake >>> from torchvision import transforms >>> ds_train = deeplake.load('hub://activeloop/fashion-mnist-train') >>> tform = transforms.Compose([ ... transforms.RandomRotation(20), # Image augmentation ... transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run ... transforms.Normalize([0.5], [0.5]), ... ]) ... >>> batch_size = 32 >>> # create dataloader by chaining with transform function and batch size and returns batch of pytorch tensors >>> train_loader = ds_train.dataloader()\ ... .transform({'images': tform, 'labels': None})\ ... .batch(batch_size)\ ... .shuffle()\ ... .pytorch(decode_method={'images': 'pil'}) # return samples as PIL images for transforms ... >>> # iterate over dataloader >>> for i, sample in enumerate(train_loader): ... pass ...
-
query(query_string: str)¶ Returns a sliced
DeepLakeDataLoaderobject with given query results. It allows to run SQL like queries on dataset and extract results. See supported keywords and the Tensor Query Language documentation here.Parameters: query_string (str) – An SQL string adjusted with new functionalities to run on the dataset object Returns: A DeepLakeDataLoaderobject.Return type: DeepLakeDataLoader Examples
>>> import deeplake >>> ds = deeplake.load('hub://activeloop/fashion-mnist-train') >>> query_ds_train = ds_train.dataloader().query("select * where labels != 5")
>>> import deeplake >>> ds_train = deeplake.load('hub://activeloop/coco-train') >>> query_ds_train = ds_train.dataloader().query("(select * where contains(categories, 'car') limit 1000) union (select * where contains(categories, 'motorcycle') limit 1000)")
-
sample_by(weights: Union[str, list, tuple, numpy.ndarray], replace: Optional[bool] = True, size: Optional[int] = None)¶ Returns a sliced
DeepLakeDataLoaderwith given weighted sampler appliedParameters: - weights – (Union[str, list, tuple, np.ndarray]): If it’s string then tql will be run to calculate the weights based on the expression. list, tuple and ndarray will be treated as the list of the weights per sample
- replace – Optional[bool] If true the samples can be repeated in the result view.
(default:
True). - size – Optional[int] The length of the result view.
(default:
len(dataset))
Returns: A
DeepLakeDataLoaderobject.Return type: Examples
Sample the dataloader with
labels == 5twice more thanlabels == 6>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train') >>> sampled_ds = ds.dataloader().sample_by("max_weight(labels == 5: 10, labels == 6: 5)")
Sample the dataloader treating labels tensor as weights.
>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train') >>> sampled_ds = ds.dataloader().sample_by("labels")
Sample the dataloader with the given weights;
>>> ds_train = deeplake.load('hub://activeloop/coco-train') >>> weights = list() >>> for i in range(0, len(ds_train)): ... weights.append(i % 5) ... >>> sampled_ds = ds.dataloader().sample_by(weights, replace=False)
-
shuffle(shuffle: bool = True, buffer_size: int = 2048)¶ Returns a shuffled
DeepLakeDataLoaderobject.Parameters: - shuffle (bool) – shows wheter we need to shuffle elements or not. Defaults to True.
- buffer_size (int) – The size of the buffer used to shuffle the data in MBs. Defaults to 2048 MB. Increasing the buffer_size will increase the extent of shuffling.
Returns: A
DeepLakeDataLoaderobject.Return type: Raises: ValueError– If .shuffle() has already been called.ValueError– If dataset is view and shuffle is True
-
tensorflow(num_workers: int = 0, collate_fn: Optional[Callable] = None, tensors: Optional[List[str]] = None, num_threads: Optional[int] = None, prefetch_factor: int = 2, return_index: bool = True, decode_method: Optional[Dict[str, str]] = None, persistent_workers: bool = False)¶ Returns a
DeepLakeDataLoaderobject.Parameters: - num_workers (int) – Number of workers to use for transforming and processing the data. Defaults to 0.
- collate_fn (Callable, Optional) – merges a list of samples to form a mini-batch of Tensor(s).
- tensors (List[str], Optional) – List of tensors to load. If None, all tensors are loaded. Defaults to
None. - num_threads (int, Optional) – Number of threads to use for fetching and decompressing the data. If
None, the number of threads is automatically determined. Defaults toNone. - prefetch_factor (int) – Number of batches to transform and collate in advance per worker. Defaults to 2.
- return_index (bool) – Used to idnetify where loader needs to retur sample index or not. Defaults to
True. - persistent_workers (bool) – If
True, the data loader will not shutdown the worker processes after a dataset has been consumed once. Defaults toFalse. - decode_method (Dict[str, str], Optional) –
A dictionary of decode methods for each tensor. Defaults to
None.- Supported decode methods are:
’numpy’: Default behaviour. Returns samples as numpy arrays. ’tobytes’: Returns raw bytes of the samples. ’pil’: Returns samples as PIL images. Especially useful when transformation use torchvision transforms, that require PIL images as input. Only supported for tensors with sample_compression='jpeg'or'png'.
- Supported decode methods are:
Returns: A
DeepLakeDataLoaderobject.Return type: Raises: ValueError– If .pytorch() or .tensorflow() or .numpy() has already been called.Examples
>>> import deeplake >>> from torchvision import transforms >>> ds_train = deeplake.load('hub://activeloop/fashion-mnist-train') >>> batch_size = 32 >>> # create dataloader by chaining with transform function and batch size and returns batch of pytorch tensors >>> train_loader = ds_train.dataloader()\ ... .batch(batch_size)\ ... .shuffle()\ ... .tensorflow() # return samples as PIL images for transforms ... >>> # iterate over dataloader >>> for i, sample in enumerate(train_loader): ... pass ...
-
transform(transform: Union[Callable, Dict[str, Optional[Callable]]], **kwargs)¶ Returns a transformed
DeepLakeDataLoaderobject.Parameters: - transform (Callable or Dict[Callable]) – A function or dictionary of functions to apply to the data.
- kwargs – Additional arguments to be passed to transform. Only applicable if transform is a callable. Ignored if transform is a dictionary.
Returns: A
DeepLakeDataLoaderobject.Return type: Raises: ValueError– If .transform() has already been called.
-