Datasets¶
formed.integrations.datasets.workflow
¶
Workflow steps for Hugging Face Datasets integration.
This module provides workflow steps for loading, processing, and manipulating datasets using the Hugging Face Datasets library.
Available Steps
datasets::load: Load a dataset from disk or the Hugging Face Hub.datasets::compose: Compose multiple Dataset objects into a DatasetDict.datasets::concatenate: Concatenate multiple datasets into a single dataset.datasets::train_test_split: Split a dataset into train and test sets.
DatasetFormat
¶
Bases: Generic[DatasetOrMappingT], Format[DatasetOrMappingT]
identifier
property
¶
identifier
Get the unique identifier for this format.
| RETURNS | DESCRIPTION |
|---|---|
str
|
Format identifier string. |
write
¶
write(artifact, directory)
Source code in src/formed/integrations/datasets/workflow.py
30 31 32 33 34 35 | |
read
¶
read(directory)
Source code in src/formed/integrations/datasets/workflow.py
37 38 39 40 41 42 43 | |
is_default_of
classmethod
¶
is_default_of(obj)
Check if this format is the default for the given object type.
| PARAMETER | DESCRIPTION |
|---|---|
obj
|
Object to check.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if this format should be used by default for this type. |
Source code in src/formed/workflow/format.py
101 102 103 104 105 106 107 108 109 110 111 112 | |
load_dataset
¶
load_dataset(path, **kwargs)
Load a dataset from disk or the Hugging Face Hub.
This step loads a dataset from a local path or downloads it from the Hugging Face Hub. The dataset can be either a Dataset or DatasetDict.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to the dataset (local or remote).
TYPE:
|
**kwargs
|
Additional arguments to pass to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
Loaded Dataset or DatasetDict. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the loaded object is not a Dataset or DatasetDict. |
Source code in src/formed/integrations/datasets/workflow.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | |
compose_datasetdict
¶
compose_datasetdict(**kwargs)
Compose multiple Dataset objects into a single DatasetDict.
This step combines individual Dataset objects into a DatasetDict, filtering out any non-Dataset values.
| PARAMETER | DESCRIPTION |
|---|---|
**kwargs
|
Named datasets to compose. Only Dataset instances are included.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DatasetDict
|
DatasetDict containing all provided Dataset instances. |
Source code in src/formed/integrations/datasets/workflow.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | |
concatenate_datasets
¶
concatenate_datasets(dsets, **kwargs)
Concatenate multiple datasets into a single dataset.
| PARAMETER | DESCRIPTION |
|---|---|
dsets
|
List of datasets to concatenate.
TYPE:
|
**kwargs
|
Additional arguments to pass to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
Concatenated dataset. |
Source code in src/formed/integrations/datasets/workflow.py
102 103 104 105 106 107 108 109 110 111 112 113 | |
train_test_split
¶
train_test_split(
dataset, train_key="train", test_key="test", **kwargs
)
Split a dataset into train and test sets.
This step splits a Dataset or DatasetDict into training and test sets. For DatasetDict inputs, each split is performed independently.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
Dataset or DatasetDict to split.
TYPE:
|
train_key
|
Key name for the training split.
TYPE:
|
test_key
|
Key name for the test split.
TYPE:
|
**kwargs
|
Additional arguments to pass to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Dataset]
|
Dictionary with train and test splits. |
Source code in src/formed/integrations/datasets/workflow.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | |