Benchmarking

Another common reason for rejections of machine learning papers in applied science is the lack of proper benchmarks. This section will be fairly short, as it differs from discipline to discipline.

However, any time we apply a superfancy deep neural network, we need to supply a benchmark to compare the relative performance of our model to. These models should be established methods in the field and simpler machine learning methods like a linear model, support-vector machine or a random forest.

In [1]:
import pandas as pd
penguins = pd.read_csv('../data/penguins_clean.csv')
penguins.head()
Out[1]:
Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Sex Species
0 39.1 18.7 181.0 MALE Adelie Penguin (Pygoscelis adeliae)
1 39.5 17.4 186.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
2 40.3 18.0 195.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
3 36.7 19.3 193.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
4 39.3 20.6 190.0 MALE Adelie Penguin (Pygoscelis adeliae)
In [2]:
from sklearn.model_selection import train_test_split
num_features = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)"]
cat_features = ["Sex"]
features = num_features + cat_features
target = ["Species"]

X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target], stratify=penguins[target[0]], train_size=.7, random_state=42)
X_train
Out[2]:
Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Sex
221 51.1 16.3 220.0 MALE
315 49.8 17.3 198.0 FEMALE
262 46.8 14.3 215.0 FEMALE
191 45.5 13.9 210.0 FEMALE
8 38.6 21.2 191.0 MALE
... ... ... ... ...
9 34.6 21.1 198.0 MALE
96 37.7 16.0 183.0 FEMALE
184 48.7 15.7 208.0 MALE
212 43.5 14.2 220.0 FEMALE
64 33.5 19.0 190.0 FEMALE

233 rows × 4 columns

Dummy Classifiers

One of the easiest way to build a benchmark is ensuring that our model performs better than random.

Tip: If our model is effectively as good as a coin flip, it's a bad model.
However, sometimes it isn't obvious how good or bad a model is. Take our penguin data. What counts as "random classification" on 3 classes that aren't equally distributed?
In [3]:
y_train.reset_index().groupby(["Species"]).count()
Out[3]:
index
Species
Adelie Penguin (Pygoscelis adeliae) 102
Chinstrap penguin (Pygoscelis antarctica) 47
Gentoo penguin (Pygoscelis papua) 84

We can use the DummyClassifier and DummyRegressor to show what a random model would predict.

In [4]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
Out[4]:
0.43564356435643564

Benchmark Datasets

Another great tool to use is benchmark datasets.

Most fields have started creating benchmark datasets to test new methods in a controlled environment. Unfortunately, it still happens that results are over-reported because models weren't adequately evaluated as seen in notebook 1. Nevertheless, it's easy to reproduce the results as both the code and data are available, so we can quickly see how legitimate reported scores are.

Examples:

Domain Methods

Any method is stronger if it is verified against standard methods in the field.

A weather forecast post-processing method should be evaluated against a standard for forecast post-processing.

This is where domain expertise is important.

Linear and Standard Models

In addition to the Dummy methods, we also want to evaluate our fancy solutions against very simple models.

Personally, I like using:

As an exercise try implementing baseline models to compare against the support-vector machine with preprocessing.

In [ ]: