Musk Dataset
Contents
Musk Dataset¶
The musk dataset is a chemical molecule dataset trying to classify whether certain molecules (using only measurements from a particular conformation) smell like musk. (This is using version 2 of the dataset.)
You can read more of the description of the dataset from its UCI ML Repo page.
Data Preprocessing¶
The dataset contains measurements for multiple conformations of multiple molecules, where the identity of the molecule (not the conformation) determines whether it’s musk or not. For the train-test split, we split the dataset into a custom 80-20 train-test split, with any given molecule occurring in only one split. The train and test splits also have approximately the same class imbalance.
We don’t do any further preprocessing of the data itself.