tabben.datasets
Contents
tabben.datasets
¶
Dataset Class¶
The basic unit of work for this package is the OpenTabularDataset.
- class tabben.datasets.OpenTabularDataset(data_dir: Union[str, bytes, os.PathLike], name: str, split: Union[str, Iterable[str]] = 'train', *, download=True, lazy=False, transform=None, target_transform=None)¶
Bases:
torch.utils.data.dataset.Dataset
A tabular dataset from the benchmark.
- __init__(data_dir: Union[str, bytes, os.PathLike], name: str, split: Union[str, Iterable[str]] = 'train', *, download=True, lazy=False, transform=None, target_transform=None)¶
Load and create a dataset with the given name (storing the dataset files in the data_dir) for the particular subset given by split.
- Parameters
data_dir (path-like) – Directory to load/store the dataset files
name (str) – Name (primary key) of the dataset
split (str or iterable of str, default='train') – Subset split of the dataset to load
download (bool, default=True) – Whether to download the dataset files if not already present in data_dir
lazy (bool, default=False) –
Whether to postpone loading the data into memory until the first access
Not implemented yet!
transform (callable, optional) – Transform or function that will be applied to the input attributes vector
target_transform (callable, optional) – Transform or function that will be applied to the target variables
- property bibtex: Optional[str]¶
Bibtex for the dataset and any associated papers that the original dataset providers have asked to be cited. This is useful if you are doing research with this benchmark and want to cite the original datasets.
- Returns
Bibtex if available, otherwise None
- Return type
str or None
- property categorical_attributes: Optional[Sequence[str]]¶
Labels/names of the categorical attributes of this dataset if available.
- Returns
List of names of categorical attributes if available, otherwise None
- Return type
sequence of str or None
- dataframe() pandas.DataFrame ¶
Create a pandas DataFrame consisting of both input attributes and output labels for this dataset (for this specific split).
Since pandas is not a required dependency, make sure you already have pandas installed before you call this method.
- Returns
Dataframe containing the complete dataset for this split
- Return type
pandas.DataFrame
- has_extra(extra_name) bool ¶
Check whether this dataset has a specific extra.
- Parameters
extra_name (str) – Name of the extra to check
- Returns
True if this dataset contains an extra with this name, otherwise False
- Return type
bool
- property has_extras: bool¶
Whether this dataset has “extras” metadata, which typically contains the mappings for categories from numbers to labels, license information, bibtex, data profiles, etc.
- Returns
Whether this dataset has extras
- Return type
bool
- property license: Optional[str]¶
License text for the dataset itself. (The tabben package is MIT-licensed, but the datasets themselves may not be as permissive. Particularly if you intend to use the datasets in a commercial setting, make sure to check the license of the datasets used.)
- Returns
License text if available, otherwise None
- Return type
str or None
- property num_classes: int¶
Number of classes for this dataset if it is a classification task.
- Returns
Number of classification classes
- Return type
int
- Raises
AttributeError – If called on a non-classification dataset
- property num_inputs: int¶
Number of input attributes for this dataset.
- Returns
Number of raw input attributes (without preprocessing or transforms)
- Return type
int
- property num_outputs: int¶
Number of output/target variables for this dataset.
- Returns
Number of raw output variables (without preprocessing or transforms)
- Return type
int
- numpy() -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)¶
Return the input and output attributes as numpy arrays in the standard scikit-learn format of (inputs, outputs).
- Returns
2-tuple of inputs and outputs as matrices/vectors
- Return type
tuple of numpy.ndarray
- property task: str¶
Task associated with this dataset.
See also
allowed_tasks
List of allowed/currently supported tasks for the benchmark
Metadata¶
- tabben.datasets.list_datasets()¶
List the tabular datasets available.
- Returns
Sequence of names of all datasets in the benchmark
- Return type
sequence of str
- tabben.datasets.allowed_tasks = {'classification', 'regression'}¶
set() -> new empty set object set(iterable) -> new set object
Build an unordered collection of unique elements.
Collection of Datasets¶
- class tabben.datasets.DatasetCollection(location, *names, split='train', download=True, lazy=True, transform=None, target_transform=None)¶
A collection of tabular datasets, providing some convenience methods to bulk load, evaluate, or extract metadata/extras from a set of datasets.
Many of the same attributes and methods for OpenTabularDataset are also available for DatasetCollection, although some of them are pluralized (e.g. task -> tasks, dataframe -> dataframes).
- __init__(location, *names, split='train', download=True, lazy=True, transform=None, target_transform=None)¶
Load and create a collection of datasets stored/downloaded into location for all the datasets with names names and for the same subset split.
- Parameters
location (path-like) – Path to a directory where the dataset files are stored
*names (str) – Names (primary keys) of the datasets to include in this collection
split (str, default='train') – Name of the dataset subset
download (bool, default=True) – Whether to download the dataset files if not already present
transform (callable or list of callable or dict of callable, optional) – Transforms/functions to apply to the input attribute vectors (see below)
target_transform (callable or list of callable or dict of callable, optional) – Transforms/functions to apply to the output variables (see below)
Notes
The parameters transform and target_transform are optional, but can be specified as a single callable object, a sequence of callable objects, or a mapping from dataset names to callable objects. In each of these cases:
- callable
The single callable object will be applied to all datasets.
- sequence of callable
Based on the sequential order of the datasets, transforms are assigned to datasets starting at the beginning of sequence until there are either no more datasets or no more transforms. Datasets not matched with an element in the sequence (i.e. the number of datasets > the length of the sequence) are not transformed.
- mapping from name to callable
For each dataset in the collection, if it is a key in the mapping, then the corresponding callable will be applied to the examples for that dataset. Otherwise (if the name is not present as a key), no transform is applied.
- classmethod match(location: os.PathLike, *, task: Optional[Union[str, Iterable[str]]] = None, outputs: Optional[Union[int, Iterable[int]]] = None, classes: Optional[Union[int, Iterable[int]]] = None, **kwargs)¶
Create a dataset collection consisting of all benchmark datasets that match all given conditions. This can be used, for example, to get a collection of all binary classification datasets.
- Parameters
location (path-like) – Path to where datasets are stored/downloaded to
task (str or iterable of str, optional) – Task(s) that must be associated with the datasets
outputs (int or range, optional) – Number of outputs that datasets must have
classes (int or range, optional) –
Number of classes that classification datasets must have
Note: this will only filter out classification datasets that don’t have the correct number of classes. That is, if there are other tasks selected, they will not be filtered out by the classes filter.
**kwargs – All other keyword arguments are passed to the constructor.
- Returns
Collection of datasets matching all specified conditions
- Return type
- Raises
ValueError – If classes is specified but classification datasets are excluded using task
Notes
To do this without requiring that datasets be available/already downloaded, this class method only supports filtering based on metadata that is located in the TOML metadata file, which does not include dataset extras.
- table(*columns)¶
Returns select attributes of the datasets in this collection in a pandas dataframe (note, this does not return the data attributes, but the meta-attributes of the datasets themselves, like task, number of examples, types of attributes, etc.).
Because pandas is an optional dependency, make sure you have the pandas package installed before calling this method.
- Parameters
*columns (str) – Names of meta-attributes to include (see Notes below for a list of options)
- Returns
Dataframe of meta attributes about the datasets in this collection
- Return type
pandas.DataFrame
Notes
These are the currently supported meta-attribute names: None. (Work in progress)
Other Utilities¶
- tabben.datasets.ensure_downloaded(data_dir: Union[str, bytes, os.PathLike], *datasets: str) None ¶
Downloads the specified datasets (all available datasets if none specified) into the data directory if they are not already present. This is useful in situations where this package is used in an environment without Internet access or for establishing local shared caches.
- Parameters
data_dir (path-like) – Directory to save the dataset files in
*datasets (str) – Names of datasets to download (if empty, all datasets will be downloaded)
- tabben.datasets.register_dataset(name: str, task: str = 'classification', *, data_location: str, persist=False, **kwargs) None ¶
Add new datasets to the benchmark at runtime (after package loading).
- Parameters
name (str) – Name of the dataset (used as a primary index, cannot be ‘all`)
task (str) – Which task is associated with this dataset (see allowed_tasks)
persist (bool) – Whether to save this dataset so that it persists between restarts (only for this installation)
data_location (str) – URI string pointing to the NPZ file for this dataset
outputs (int, recommended, default=1) – Number of output variables
classes (int, recommended for classification tasks, default=2) – Number of classification classes
extras_location (str) – URI string pointing to a JSON file of “extras” metadata for this dataset
**kwargs – All other keyword arguments are stored as additional metadata in the TOML file
See also
validate_dataset_file
Validate the NPZ file before adding as a new dataset
- tabben.datasets.validate_dataset_file(filepath: Union[str, bytes, os.PathLike]) None ¶
Validate a NPZ dataset file at a local path to make sure that the dataset it contains can be read as a valid dataset using this package. This function is needed for interactive use at the REPL.
- Parameters
filepath (str or path-like) – Filepath of the NPZ dataset file
- Raises
FileNotFoundError – If the filepath does not exist
IOError – If the file cannot be read at all
DatasetFormatError – If there is an error with the format of the NPZ dataset file
- exception tabben.datasets.DatasetFormatError¶
An exception due to an NPZ dataset file having an unexpected format (in addition to the usual NPZ file format requirements).