`tabben.datasets`¶

Dataset Class¶

The basic unit of work for this package is the OpenTabularDataset.

class tabben.datasets.OpenTabularDataset(data_dir: Union[str, bytes, os.PathLike], name: str, split: Union[str, Iterable[str]] = 'train', *, download=True, lazy=False, transform=None, target_transform=None)¶

Bases: torch.utils.data.dataset.Dataset

A tabular dataset from the benchmark.

__init__(data_dir: Union[str, bytes, os.PathLike], name: str, split: Union[str, Iterable[str]] = 'train', *, download=True, lazy=False, transform=None, target_transform=None)¶

Load and create a dataset with the given name (storing the dataset files in the data_dir) for the particular subset given by split.

Parameters

data_dir (path-like) – Directory to load/store the dataset files
name (str) – Name (primary key) of the dataset
split (str or iterable of str, default='train') – Subset split of the dataset to load
download (bool, default=True) – Whether to download the dataset files if not already present in data_dir
lazy (bool, default=False) –
Whether to postpone loading the data into memory until the first access

Not implemented yet!
transform (callable, optional) – Transform or function that will be applied to the input attributes vector
target_transform (callable, optional) – Transform or function that will be applied to the target variables

property bibtex: Optional[str]¶

Bibtex for the dataset and any associated papers that the original dataset providers have asked to be cited. This is useful if you are doing research with this benchmark and want to cite the original datasets.

Returns: Bibtex if available, otherwise None
Return type: str or None

property categorical_attributes: Optional[Sequence[str]]¶

Labels/names of the categorical attributes of this dataset if available.

Returns: List of names of categorical attributes if available, otherwise None
Return type: sequence of str or None

dataframe() → pandas.DataFrame¶

Create a pandas DataFrame consisting of both input attributes and output labels for this dataset (for this specific split).

Since pandas is not a required dependency, make sure you already have pandas installed before you call this method.

Returns: Dataframe containing the complete dataset for this split
Return type: pandas.DataFrame

has_extra(extra_name) → bool¶

Check whether this dataset has a specific extra.

Parameters: extra_name (str) – Name of the extra to check
Returns: True if this dataset contains an extra with this name, otherwise False
Return type: bool

property has_extras: bool¶

Whether this dataset has “extras” metadata, which typically contains the mappings for categories from numbers to labels, license information, bibtex, data profiles, etc.

Returns: Whether this dataset has extras
Return type: bool

property license: Optional[str]¶

License text for the dataset itself. (The tabben package is MIT-licensed, but the datasets themselves may not be as permissive. Particularly if you intend to use the datasets in a commercial setting, make sure to check the license of the datasets used.)

Returns: License text if available, otherwise None
Return type: str or None

property num_classes: int¶

Number of classes for this dataset if it is a classification task.

Returns: Number of classification classes
Return type: int
Raises: AttributeError – If called on a non-classification dataset

property num_inputs: int¶

Number of input attributes for this dataset.

Returns: Number of raw input attributes (without preprocessing or transforms)
Return type: int

property num_outputs: int¶

Number of output/target variables for this dataset.

Returns: Number of raw output variables (without preprocessing or transforms)
Return type: int

numpy() -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)¶

Return the input and output attributes as numpy arrays in the standard scikit-learn format of (inputs, outputs).

Returns: 2-tuple of inputs and outputs as matrices/vectors
Return type: tuple of numpy.ndarray

property task: str¶

Task associated with this dataset.

Metadata¶

tabben.datasets.list_datasets()¶

List the tabular datasets available.

Returns: Sequence of names of all datasets in the benchmark
Return type: sequence of str

tabben.datasets.allowed_tasks = {'classification', 'regression'}¶

set() -> new empty set object set(iterable) -> new set object

Build an unordered collection of unique elements.

Collection of Datasets¶

class tabben.datasets.DatasetCollection(location, *names, split='train', download=True, lazy=True, transform=None, target_transform=None)¶

A collection of tabular datasets, providing some convenience methods to bulk load, evaluate, or extract metadata/extras from a set of datasets.

Many of the same attributes and methods for OpenTabularDataset are also available for DatasetCollection, although some of them are pluralized (e.g. task -> tasks, dataframe -> dataframes).

__init__(location, *names, split='train', download=True, lazy=True, transform=None, target_transform=None)¶

Load and create a collection of datasets stored/downloaded into location for all the datasets with names names and for the same subset split.

Parameters

location (path-like) – Path to a directory where the dataset files are stored
*names (str) – Names (primary keys) of the datasets to include in this collection
split (str, default='train') – Name of the dataset subset
download (bool, default=True) – Whether to download the dataset files if not already present
transform (callable or list of callable or dict of callable, optional) – Transforms/functions to apply to the input attribute vectors (see below)
target_transform (callable or list of callable or dict of callable, optional) – Transforms/functions to apply to the output variables (see below)

Notes

The parameters transform and target_transform are optional, but can be specified as a single callable object, a sequence of callable objects, or a mapping from dataset names to callable objects. In each of these cases:

callable: The single callable object will be applied to all datasets.
sequence of callable: Based on the sequential order of the datasets, transforms are assigned to datasets starting at the beginning of sequence until there are either no more datasets or no more transforms. Datasets not matched with an element in the sequence (i.e. the number of datasets > the length of the sequence) are not transformed.
mapping from name to callable: For each dataset in the collection, if it is a key in the mapping, then the corresponding callable will be applied to the examples for that dataset. Otherwise (if the name is not present as a key), no transform is applied.

classmethod match(location: os.PathLike, *, task: Optional[Union[str, Iterable[str]]] = None, outputs: Optional[Union[int, Iterable[int]]] = None, classes: Optional[Union[int, Iterable[int]]] = None, **kwargs)¶

Create a dataset collection consisting of all benchmark datasets that match all given conditions. This can be used, for example, to get a collection of all binary classification datasets.

Parameters

location (path-like) – Path to where datasets are stored/downloaded to
task (str or iterable of str, optional) – Task(s) that must be associated with the datasets
outputs (int or range, optional) – Number of outputs that datasets must have
classes (int or range, optional) –
Number of classes that classification datasets must have

Note: this will only filter out classification datasets that don’t have the correct number of classes. That is, if there are other tasks selected, they will not be filtered out by the classes filter.
**kwargs – All other keyword arguments are passed to the constructor.

Returns

Collection of datasets matching all specified conditions

Return type

DatasetCollection

Raises

ValueError – If classes is specified but classification datasets are excluded using task

Notes

To do this without requiring that datasets be available/already downloaded, this class method only supports filtering based on metadata that is located in the TOML metadata file, which does not include dataset extras.

table(*columns)¶

Returns select attributes of the datasets in this collection in a pandas dataframe (note, this does not return the data attributes, but the meta-attributes of the datasets themselves, like task, number of examples, types of attributes, etc.).

Because pandas is an optional dependency, make sure you have the pandas package installed before calling this method.

Parameters: *columns (str) – Names of meta-attributes to include (see Notes below for a list of options)
Returns: Dataframe of meta attributes about the datasets in this collection
Return type: pandas.DataFrame

Notes

These are the currently supported meta-attribute names: None. (Work in progress)

Other Utilities¶

tabben.datasets.ensure_downloaded(data_dir: Union[str, bytes, os.PathLike], *datasets: str) → None¶

Downloads the specified datasets (all available datasets if none specified) into the data directory if they are not already present. This is useful in situations where this package is used in an environment without Internet access or for establishing local shared caches.

Parameters

data_dir (path-like) – Directory to save the dataset files in
*datasets (str) – Names of datasets to download (if empty, all datasets will be downloaded)

tabben.datasets.register_dataset(name: str, task: str = 'classification', *, data_location: str, persist=False, **kwargs) → None¶

Add new datasets to the benchmark at runtime (after package loading).

Parameters

name (str) – Name of the dataset (used as a primary index, cannot be ‘all`)
task (str) – Which task is associated with this dataset (see allowed_tasks)
persist (bool) – Whether to save this dataset so that it persists between restarts (only for this installation)
data_location (str) – URI string pointing to the NPZ file for this dataset
outputs (int, recommended, default=1) – Number of output variables
classes (int, recommended for classification tasks, default=2) – Number of classification classes
extras_location (str) – URI string pointing to a JSON file of “extras” metadata for this dataset
**kwargs – All other keyword arguments are stored as additional metadata in the TOML file

Tabben 0.0.8.dev0 documentation

tabben.datasets

Contents

`tabben.datasets`¶

Dataset Class¶

Metadata¶

Collection of Datasets¶

Other Utilities¶

Tabben 0.0.8.dev0 documentation

tabben.datasets

Contents

tabben.datasets¶

Dataset Class¶

Metadata¶

Collection of Datasets¶

Other Utilities¶

`tabben.datasets`¶