Datasets¶

formed.integrations.datasets.workflow ¶

Workflow steps for Hugging Face Datasets integration.

This module provides workflow steps for loading, processing, and manipulating datasets using the Hugging Face Datasets library.

Available Steps

datasets::load: Load a dataset from disk or the Hugging Face Hub.
datasets::compose: Compose multiple Dataset objects into a DatasetDict.
datasets::concatenate: Concatenate multiple datasets into a single dataset.
datasets::train_test_split: Split a dataset into train and test sets.

DatasetFormat ¶

Bases: Generic[DatasetOrMappingT], Format[DatasetOrMappingT]

identifier `property` ¶

identifier

Get the unique identifier for this format.

RETURNS	DESCRIPTION
`str`	Format identifier string.

write ¶

write(artifact, directory)

Source code in src/formed/integrations/datasets/workflow.py

def write(self, artifact: DatasetOrMappingT, directory: Path) -> None:
    if isinstance(artifact, Mapping):
        for key, dataset in artifact.items():
            dataset.save_to_disk(str(directory / f"data.{key}"))
    else:
        artifact.save_to_disk(str(directory / "data"))

read ¶

read(directory)

Source code in src/formed/integrations/datasets/workflow.py

def read(self, directory: Path) -> DatasetOrMappingT:
    if (directory / "data").exists():
        return cast(DatasetOrMappingT, datasets.load_from_disk(str(directory / "data")))
    return cast(
        DatasetOrMappingT,
        {datadir.name[5:]: datasets.load_from_disk(str(datadir)) for datadir in directory.glob("data.*")},
    )

is_default_of `classmethod` ¶

is_default_of(obj)

Check if this format is the default for the given object type.

PARAMETER	DESCRIPTION
`obj`	Object to check. TYPE: `Any`

RETURNS	DESCRIPTION
`bool`	True if this format should be used by default for this type.

Source code in src/formed/workflow/format.py

@classmethod
def is_default_of(cls, obj: Any) -> bool:
    """Check if this format is the default for the given object type.

    Args:
        obj: Object to check.

    Returns:
        True if this format should be used by default for this type.

    """
    return False

load_dataset ¶

load_dataset(path, **kwargs)

Load a dataset from disk or the Hugging Face Hub.

This step loads a dataset from a local path or downloads it from the Hugging Face Hub. The dataset can be either a Dataset or DatasetDict.

PARAMETER	DESCRIPTION
`path`	Path to the dataset (local or remote). TYPE: `str \| PathLike`
`**kwargs`	Additional arguments to pass to `datasets.load_dataset`. TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`Dataset`	Loaded Dataset or DatasetDict.

RAISES	DESCRIPTION
`ValueError`	If the loaded object is not a Dataset or DatasetDict.

Source code in src/formed/integrations/datasets/workflow.py

@step("datasets::load", cacheable=False, format=DatasetFormat())
def load_dataset(
    path: str | PathLike,
    **kwargs: Any,
) -> Dataset:
    """Load a dataset from disk or the Hugging Face Hub.

    This step loads a dataset from a local path or downloads it from the
    Hugging Face Hub. The dataset can be either a Dataset or DatasetDict.

    Args:
        path: Path to the dataset (local or remote).
        **kwargs: Additional arguments to pass to `datasets.load_dataset`.

    Returns:
        Loaded Dataset or DatasetDict.

    Raises:
        ValueError: If the loaded object is not a Dataset or DatasetDict.
    """
    with suppress(FileNotFoundError):
        path = minato.cached_path(path)
    if Path(path).exists():
        dataset = datasets.load_from_disk(str(path))
    else:
        dataset = cast(Dataset, datasets.load_dataset(str(path), **kwargs))
    if not isinstance(dataset, (datasets.Dataset, datasets.DatasetDict)):
        raise ValueError("Only Dataset or DatasetDict is supported")
    return dataset

compose_datasetdict ¶

compose_datasetdict(**kwargs)

Compose multiple Dataset objects into a single DatasetDict.

This step combines individual Dataset objects into a DatasetDict, filtering out any non-Dataset values.

PARAMETER	DESCRIPTION
`**kwargs`	Named datasets to compose. Only Dataset instances are included. TYPE: `Dataset` DEFAULT: `{}`

RETURNS	DESCRIPTION
`DatasetDict`	DatasetDict containing all provided Dataset instances.

Source code in src/formed/integrations/datasets/workflow.py

@step("datasets::compose", format=DatasetFormat())
def compose_datasetdict(**kwargs: Dataset) -> datasets.DatasetDict:
    """Compose multiple Dataset objects into a single DatasetDict.

    This step combines individual Dataset objects into a DatasetDict,
    filtering out any non-Dataset values.

    Args:
        **kwargs: Named datasets to compose. Only Dataset instances are included.

    Returns:
        DatasetDict containing all provided Dataset instances.
    """
    datasets_: dict[str, datasets.Dataset] = {
        key: dataset for key, dataset in kwargs.items() if isinstance(dataset, datasets.Dataset)
    }
    if len(datasets_) != len(kwargs):
        logger = use_step_logger(__name__)
        logger.warning(
            "Following keys are ignored since they are not Dataset instances: %s",
            set(kwargs) - set(datasets_),
        )
    return datasets.DatasetDict(**datasets_)

concatenate_datasets ¶

concatenate_datasets(dsets, **kwargs)

Concatenate multiple datasets into a single dataset.

PARAMETER	DESCRIPTION
`dsets`	List of datasets to concatenate. TYPE: `list[Dataset]`
`**kwargs`	Additional arguments to pass to `datasets.concatenate_datasets`. TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`Dataset`	Concatenated dataset.

Source code in src/formed/integrations/datasets/workflow.py

@step("datasets::concatenate", format=DatasetFormat())
def concatenate_datasets(dsets: list[datasets.Dataset], **kwargs: Any) -> datasets.Dataset:
    """Concatenate multiple datasets into a single dataset.

    Args:
        dsets: List of datasets to concatenate.
        **kwargs: Additional arguments to pass to `datasets.concatenate_datasets`.

    Returns:
        Concatenated dataset.
    """
    return cast(datasets.Dataset, datasets.concatenate_datasets(dsets, **kwargs))

train_test_split ¶

train_test_split(
    dataset, train_key="train", test_key="test", **kwargs
)

Split a dataset into train and test sets.

This step splits a Dataset or DatasetDict into training and test sets. For DatasetDict inputs, each split is performed independently.

PARAMETER	DESCRIPTION
`dataset`	Dataset or DatasetDict to split. TYPE: `Dataset`
`train_key`	Key name for the training split. TYPE: `str` DEFAULT: `'train'`
`test_key`	Key name for the test split. TYPE: `str` DEFAULT: `'test'`
`**kwargs`	Additional arguments to pass to `train_test_split`. TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`dict[str, Dataset]`	Dictionary with train and test splits.

Source code in src/formed/integrations/datasets/workflow.py

@step("datasets::train_test_split", format=DatasetFormat())
def train_test_split(
    dataset: Dataset,
    train_key: str = "train",
    test_key: str = "test",
    **kwargs: Any,
) -> dict[str, Dataset]:
    """Split a dataset into train and test sets.

    This step splits a Dataset or DatasetDict into training and test sets.
    For DatasetDict inputs, each split is performed independently.

    Args:
        dataset: Dataset or DatasetDict to split.
        train_key: Key name for the training split.
        test_key: Key name for the test split.
        **kwargs: Additional arguments to pass to `train_test_split`.

    Returns:
        Dictionary with train and test splits.
    """
    if isinstance(dataset, datasets.Dataset):
        split = dataset.train_test_split(**kwargs)
        return {train_key: split["train"], test_key: split["test"]}
    else:
        train_datasets: dict[str, datasets.Dataset] = {}
        test_datasets: dict[str, datasets.Dataset] = {}
        for key, dset in dataset.items():
            split = dset.train_test_split(**kwargs)
            train_datasets[str(key)] = split["train"]
            test_datasets[str(key)] = split["test"]
        return {
            train_key: datasets.DatasetDict(**train_datasets),
            test_key: datasets.DatasetDict(**test_datasets),
        }

Datasets¶

formed.integrations.datasets.workflow ¶

DatasetFormat ¶

identifier property ¶

write ¶

read ¶

is_default_of classmethod ¶

load_dataset ¶

compose_datasetdict ¶

concatenate_datasets ¶

train_test_split ¶

identifier `property` ¶

is_default_of `classmethod` ¶