tabben.datasets

Dataset Class

The basic unit of work for this package is the OpenTabularDataset.

class tabben.datasets.OpenTabularDataset(data_dir: Union[str, bytes, os.PathLike], name: str, split: Union[str, Iterable[str]] = 'train', *, download=True, lazy=False, transform=None, target_transform=None)

Bases: torch.utils.data.dataset.Dataset

A tabular dataset from the benchmark.

__init__(data_dir: Union[str, bytes, os.PathLike], name: str, split: Union[str, Iterable[str]] = 'train', *, download=True, lazy=False, transform=None, target_transform=None)

Load and create a dataset with the given name (storing the dataset files in the data_dir) for the particular subset given by split.

Parameters
  • data_dir (path-like) – Directory to load/store the dataset files

  • name (str) – Name (primary key) of the dataset

  • split (str or iterable of str, default='train') – Subset split of the dataset to load

  • download (bool, default=True) – Whether to download the dataset files if not already present in data_dir

  • lazy (bool, default=False) –

    Whether to postpone loading the data into memory until the first access

    Not implemented yet!

  • transform (callable, optional) – Transform or function that will be applied to the input attributes vector

  • target_transform (callable, optional) – Transform or function that will be applied to the target variables

property bibtex: Optional[str]

Bibtex for the dataset and any associated papers that the original dataset providers have asked to be cited. This is useful if you are doing research with this benchmark and want to cite the original datasets.

Returns

Bibtex if available, otherwise None

Return type

str or None

property categorical_attributes: Optional[Sequence[str]]

Labels/names of the categorical attributes of this dataset if available.

Returns

List of names of categorical attributes if available, otherwise None

Return type

sequence of str or None

dataframe() pandas.DataFrame

Create a pandas DataFrame consisting of both input attributes and output labels for this dataset (for this specific split).

Since pandas is not a required dependency, make sure you already have pandas installed before you call this method.

Returns

Dataframe containing the complete dataset for this split

Return type

pandas.DataFrame

has_extra(extra_name) bool

Check whether this dataset has a specific extra.

Parameters

extra_name (str) – Name of the extra to check

Returns

True if this dataset contains an extra with this name, otherwise False

Return type

bool

property has_extras: bool

Whether this dataset has “extras” metadata, which typically contains the mappings for categories from numbers to labels, license information, bibtex, data profiles, etc.

Returns

Whether this dataset has extras

Return type

bool

property license: Optional[str]

License text for the dataset itself. (The tabben package is MIT-licensed, but the datasets themselves may not be as permissive. Particularly if you intend to use the datasets in a commercial setting, make sure to check the license of the datasets used.)

Returns

License text if available, otherwise None

Return type

str or None

property num_classes: int

Number of classes for this dataset if it is a classification task.

Returns

Number of classification classes

Return type

int

Raises

AttributeError – If called on a non-classification dataset

property num_inputs: int

Number of input attributes for this dataset.

Returns

Number of raw input attributes (without preprocessing or transforms)

Return type

int

property num_outputs: int

Number of output/target variables for this dataset.

Returns

Number of raw output variables (without preprocessing or transforms)

Return type

int

numpy() -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)

Return the input and output attributes as numpy arrays in the standard scikit-learn format of (inputs, outputs).

Returns

2-tuple of inputs and outputs as matrices/vectors

Return type

tuple of numpy.ndarray

property task: str

Task associated with this dataset.

See also

allowed_tasks

List of allowed/currently supported tasks for the benchmark

Metadata

tabben.datasets.list_datasets()

List the tabular datasets available.

Returns

Sequence of names of all datasets in the benchmark

Return type

sequence of str

tabben.datasets.allowed_tasks = {'classification', 'regression'}

set() -> new empty set object set(iterable) -> new set object

Build an unordered collection of unique elements.

Collection of Datasets

class tabben.datasets.DatasetCollection(location, *names, split='train', download=True, lazy=True, transform=None, target_transform=None)

A collection of tabular datasets, providing some convenience methods to bulk load, evaluate, or extract metadata/extras from a set of datasets.

Many of the same attributes and methods for OpenTabularDataset are also available for DatasetCollection, although some of them are pluralized (e.g. task -> tasks, dataframe -> dataframes).

__init__(location, *names, split='train', download=True, lazy=True, transform=None, target_transform=None)

Load and create a collection of datasets stored/downloaded into location for all the datasets with names names and for the same subset split.

Parameters
  • location (path-like) – Path to a directory where the dataset files are stored

  • *names (str) – Names (primary keys) of the datasets to include in this collection

  • split (str, default='train') – Name of the dataset subset

  • download (bool, default=True) – Whether to download the dataset files if not already present

  • transform (callable or list of callable or dict of callable, optional) – Transforms/functions to apply to the input attribute vectors (see below)

  • target_transform (callable or list of callable or dict of callable, optional) – Transforms/functions to apply to the output variables (see below)

Notes

The parameters transform and target_transform are optional, but can be specified as a single callable object, a sequence of callable objects, or a mapping from dataset names to callable objects. In each of these cases:

callable

The single callable object will be applied to all datasets.

sequence of callable

Based on the sequential order of the datasets, transforms are assigned to datasets starting at the beginning of sequence until there are either no more datasets or no more transforms. Datasets not matched with an element in the sequence (i.e. the number of datasets > the length of the sequence) are not transformed.

mapping from name to callable

For each dataset in the collection, if it is a key in the mapping, then the corresponding callable will be applied to the examples for that dataset. Otherwise (if the name is not present as a key), no transform is applied.

classmethod match(location: os.PathLike, *, task: Optional[Union[str, Iterable[str]]] = None, outputs: Optional[Union[int, Iterable[int]]] = None, classes: Optional[Union[int, Iterable[int]]] = None, **kwargs)

Create a dataset collection consisting of all benchmark datasets that match all given conditions. This can be used, for example, to get a collection of all binary classification datasets.

Parameters
  • location (path-like) – Path to where datasets are stored/downloaded to

  • task (str or iterable of str, optional) – Task(s) that must be associated with the datasets

  • outputs (int or range, optional) – Number of outputs that datasets must have

  • classes (int or range, optional) –

    Number of classes that classification datasets must have

    Note: this will only filter out classification datasets that don’t have the correct number of classes. That is, if there are other tasks selected, they will not be filtered out by the classes filter.

  • **kwargs – All other keyword arguments are passed to the constructor.

Returns

Collection of datasets matching all specified conditions

Return type

DatasetCollection

Raises

ValueError – If classes is specified but classification datasets are excluded using task

Notes

To do this without requiring that datasets be available/already downloaded, this class method only supports filtering based on metadata that is located in the TOML metadata file, which does not include dataset extras.

table(*columns)

Returns select attributes of the datasets in this collection in a pandas dataframe (note, this does not return the data attributes, but the meta-attributes of the datasets themselves, like task, number of examples, types of attributes, etc.).

Because pandas is an optional dependency, make sure you have the pandas package installed before calling this method.

Parameters

*columns (str) – Names of meta-attributes to include (see Notes below for a list of options)

Returns

Dataframe of meta attributes about the datasets in this collection

Return type

pandas.DataFrame

Notes

These are the currently supported meta-attribute names: None. (Work in progress)

Other Utilities

tabben.datasets.ensure_downloaded(data_dir: Union[str, bytes, os.PathLike], *datasets: str) None

Downloads the specified datasets (all available datasets if none specified) into the data directory if they are not already present. This is useful in situations where this package is used in an environment without Internet access or for establishing local shared caches.

Parameters
  • data_dir (path-like) – Directory to save the dataset files in

  • *datasets (str) – Names of datasets to download (if empty, all datasets will be downloaded)

tabben.datasets.register_dataset(name: str, task: str = 'classification', *, data_location: str, persist=False, **kwargs) None

Add new datasets to the benchmark at runtime (after package loading).

Parameters
  • name (str) – Name of the dataset (used as a primary index, cannot be ‘all`)

  • task (str) – Which task is associated with this dataset (see allowed_tasks)

  • persist (bool) – Whether to save this dataset so that it persists between restarts (only for this installation)

  • data_location (str) – URI string pointing to the NPZ file for this dataset

  • outputs (int, recommended, default=1) – Number of output variables

  • classes (int, recommended for classification tasks, default=2) – Number of classification classes

  • extras_location (str) – URI string pointing to a JSON file of “extras” metadata for this dataset

  • **kwargs – All other keyword arguments are stored as additional metadata in the TOML file

See also

validate_dataset_file

Validate the NPZ file before adding as a new dataset

tabben.datasets.validate_dataset_file(filepath: Union[str, bytes, os.PathLike]) None

Validate a NPZ dataset file at a local path to make sure that the dataset it contains can be read as a valid dataset using this package. This function is needed for interactive use at the REPL.

Parameters

filepath (str or path-like) – Filepath of the NPZ dataset file

Raises
  • FileNotFoundError – If the filepath does not exist

  • IOError – If the file cannot be read at all

  • DatasetFormatError – If there is an error with the format of the NPZ dataset file

exception tabben.datasets.DatasetFormatError

An exception due to an NPZ dataset file having an unexpected format (in addition to the usual NPZ file format requirements).