Skip to content

Datasets

formed.integrations.datasets.workflow

Workflow steps for Hugging Face Datasets integration.

This module provides workflow steps for loading, processing, and manipulating datasets using the Hugging Face Datasets library.

Available Steps
  • datasets::load: Load a dataset from disk or the Hugging Face Hub.
  • datasets::compose: Compose multiple Dataset objects into a DatasetDict.
  • datasets::concatenate: Concatenate multiple datasets into a single dataset.
  • datasets::train_test_split: Split a dataset into train and test sets.

DatasetFormat

Bases: Generic[DatasetOrMappingT], Format[DatasetOrMappingT]

identifier property

identifier

Get the unique identifier for this format.

RETURNS DESCRIPTION
str

Format identifier string.

write

write(artifact, directory)
Source code in src/formed/integrations/datasets/workflow.py
30
31
32
33
34
35
def write(self, artifact: DatasetOrMappingT, directory: Path) -> None:
    if isinstance(artifact, Mapping):
        for key, dataset in artifact.items():
            dataset.save_to_disk(str(directory / f"data.{key}"))
    else:
        artifact.save_to_disk(str(directory / "data"))

read

read(directory)
Source code in src/formed/integrations/datasets/workflow.py
37
38
39
40
41
42
43
def read(self, directory: Path) -> DatasetOrMappingT:
    if (directory / "data").exists():
        return cast(DatasetOrMappingT, datasets.load_from_disk(str(directory / "data")))
    return cast(
        DatasetOrMappingT,
        {datadir.name[5:]: datasets.load_from_disk(str(datadir)) for datadir in directory.glob("data.*")},
    )

is_default_of classmethod

is_default_of(obj)

Check if this format is the default for the given object type.

PARAMETER DESCRIPTION
obj

Object to check.

TYPE: Any

RETURNS DESCRIPTION
bool

True if this format should be used by default for this type.

Source code in src/formed/workflow/format.py
101
102
103
104
105
106
107
108
109
110
111
112
@classmethod
def is_default_of(cls, obj: Any) -> bool:
    """Check if this format is the default for the given object type.

    Args:
        obj: Object to check.

    Returns:
        True if this format should be used by default for this type.

    """
    return False

load_dataset

load_dataset(path, **kwargs)

Load a dataset from disk or the Hugging Face Hub.

This step loads a dataset from a local path or downloads it from the Hugging Face Hub. The dataset can be either a Dataset or DatasetDict.

PARAMETER DESCRIPTION
path

Path to the dataset (local or remote).

TYPE: str | PathLike

**kwargs

Additional arguments to pass to datasets.load_dataset.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
Dataset

Loaded Dataset or DatasetDict.

RAISES DESCRIPTION
ValueError

If the loaded object is not a Dataset or DatasetDict.

Source code in src/formed/integrations/datasets/workflow.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
@step("datasets::load", cacheable=False, format=DatasetFormat())
def load_dataset(
    path: str | PathLike,
    **kwargs: Any,
) -> Dataset:
    """Load a dataset from disk or the Hugging Face Hub.

    This step loads a dataset from a local path or downloads it from the
    Hugging Face Hub. The dataset can be either a Dataset or DatasetDict.

    Args:
        path: Path to the dataset (local or remote).
        **kwargs: Additional arguments to pass to `datasets.load_dataset`.

    Returns:
        Loaded Dataset or DatasetDict.

    Raises:
        ValueError: If the loaded object is not a Dataset or DatasetDict.
    """
    with suppress(FileNotFoundError):
        path = minato.cached_path(path)
    if Path(path).exists():
        dataset = datasets.load_from_disk(str(path))
    else:
        dataset = cast(Dataset, datasets.load_dataset(str(path), **kwargs))
    if not isinstance(dataset, (datasets.Dataset, datasets.DatasetDict)):
        raise ValueError("Only Dataset or DatasetDict is supported")
    return dataset

compose_datasetdict

compose_datasetdict(**kwargs)

Compose multiple Dataset objects into a single DatasetDict.

This step combines individual Dataset objects into a DatasetDict, filtering out any non-Dataset values.

PARAMETER DESCRIPTION
**kwargs

Named datasets to compose. Only Dataset instances are included.

TYPE: Dataset DEFAULT: {}

RETURNS DESCRIPTION
DatasetDict

DatasetDict containing all provided Dataset instances.

Source code in src/formed/integrations/datasets/workflow.py
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
@step("datasets::compose", format=DatasetFormat())
def compose_datasetdict(**kwargs: Dataset) -> datasets.DatasetDict:
    """Compose multiple Dataset objects into a single DatasetDict.

    This step combines individual Dataset objects into a DatasetDict,
    filtering out any non-Dataset values.

    Args:
        **kwargs: Named datasets to compose. Only Dataset instances are included.

    Returns:
        DatasetDict containing all provided Dataset instances.
    """
    datasets_: dict[str, datasets.Dataset] = {
        key: dataset for key, dataset in kwargs.items() if isinstance(dataset, datasets.Dataset)
    }
    if len(datasets_) != len(kwargs):
        logger = use_step_logger(__name__)
        logger.warning(
            "Following keys are ignored since they are not Dataset instances: %s",
            set(kwargs) - set(datasets_),
        )
    return datasets.DatasetDict(**datasets_)

concatenate_datasets

concatenate_datasets(dsets, **kwargs)

Concatenate multiple datasets into a single dataset.

PARAMETER DESCRIPTION
dsets

List of datasets to concatenate.

TYPE: list[Dataset]

**kwargs

Additional arguments to pass to datasets.concatenate_datasets.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
Dataset

Concatenated dataset.

Source code in src/formed/integrations/datasets/workflow.py
102
103
104
105
106
107
108
109
110
111
112
113
@step("datasets::concatenate", format=DatasetFormat())
def concatenate_datasets(dsets: list[datasets.Dataset], **kwargs: Any) -> datasets.Dataset:
    """Concatenate multiple datasets into a single dataset.

    Args:
        dsets: List of datasets to concatenate.
        **kwargs: Additional arguments to pass to `datasets.concatenate_datasets`.

    Returns:
        Concatenated dataset.
    """
    return cast(datasets.Dataset, datasets.concatenate_datasets(dsets, **kwargs))

train_test_split

train_test_split(
    dataset, train_key="train", test_key="test", **kwargs
)

Split a dataset into train and test sets.

This step splits a Dataset or DatasetDict into training and test sets. For DatasetDict inputs, each split is performed independently.

PARAMETER DESCRIPTION
dataset

Dataset or DatasetDict to split.

TYPE: Dataset

train_key

Key name for the training split.

TYPE: str DEFAULT: 'train'

test_key

Key name for the test split.

TYPE: str DEFAULT: 'test'

**kwargs

Additional arguments to pass to train_test_split.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
dict[str, Dataset]

Dictionary with train and test splits.

Source code in src/formed/integrations/datasets/workflow.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
@step("datasets::train_test_split", format=DatasetFormat())
def train_test_split(
    dataset: Dataset,
    train_key: str = "train",
    test_key: str = "test",
    **kwargs: Any,
) -> dict[str, Dataset]:
    """Split a dataset into train and test sets.

    This step splits a Dataset or DatasetDict into training and test sets.
    For DatasetDict inputs, each split is performed independently.

    Args:
        dataset: Dataset or DatasetDict to split.
        train_key: Key name for the training split.
        test_key: Key name for the test split.
        **kwargs: Additional arguments to pass to `train_test_split`.

    Returns:
        Dictionary with train and test splits.
    """
    if isinstance(dataset, datasets.Dataset):
        split = dataset.train_test_split(**kwargs)
        return {train_key: split["train"], test_key: split["test"]}
    else:
        train_datasets: dict[str, datasets.Dataset] = {}
        test_datasets: dict[str, datasets.Dataset] = {}
        for key, dset in dataset.items():
            split = dset.train_test_split(**kwargs)
            train_datasets[str(key)] = split["train"]
            test_datasets[str(key)] = split["test"]
        return {
            train_key: datasets.DatasetDict(**train_datasets),
            test_key: datasets.DatasetDict(**test_datasets),
        }