Another common reason for rejections of machine learning papers in applied science is the lack of proper benchmarks. This section will be fairly short, as it differs from discipline to discipline.
However, any time we apply a superfancy deep neural network, we need to supply a benchmark to compare the relative performance of our model to. These models should be established methods in the field and simpler machine learning methods like a linear model, support-vector machine or a random forest.
import pandas as pd penguins = pd.read_csv('../data/penguins_clean.csv') penguins.head()
|Culmen Length (mm)||Culmen Depth (mm)||Flipper Length (mm)||Sex||Species|
|0||39.1||18.7||181.0||MALE||Adelie Penguin (Pygoscelis adeliae)|
|1||39.5||17.4||186.0||FEMALE||Adelie Penguin (Pygoscelis adeliae)|
|2||40.3||18.0||195.0||FEMALE||Adelie Penguin (Pygoscelis adeliae)|
|3||36.7||19.3||193.0||FEMALE||Adelie Penguin (Pygoscelis adeliae)|
|4||39.3||20.6||190.0||MALE||Adelie Penguin (Pygoscelis adeliae)|
from sklearn.model_selection import train_test_split num_features = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)"] cat_features = ["Sex"] features = num_features + cat_features target = ["Species"] X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target], stratify=penguins[target], train_size=.7, random_state=42) X_train
|Culmen Length (mm)||Culmen Depth (mm)||Flipper Length (mm)||Sex|
233 rows × 4 columns
One of the easiest way to build a benchmark is ensuring that our model performs better than random.
|Adelie Penguin (Pygoscelis adeliae)||102|
|Chinstrap penguin (Pygoscelis antarctica)||47|
|Gentoo penguin (Pygoscelis papua)||84|
We can use the
DummyRegressor to show what a random model would predict.
from sklearn.dummy import DummyClassifier clf = DummyClassifier() clf.fit(X_train, y_train) clf.score(X_test, y_test)
Another great tool to use is benchmark datasets.
Most fields have started creating benchmark datasets to test new methods in a controlled environment. Unfortunately, it still happens that results are over-reported because models weren't adequately evaluated as seen in notebook 1. Nevertheless, it's easy to reproduce the results as both the code and data are available, so we can quickly see how legitimate reported scores are.
- Imagenet in computer vision
- WeatherBench in meteorology
- ChestX-ray8 in medical imaging
Any method is stronger if it is verified against standard methods in the field.
A weather forecast post-processing method should be evaluated against a standard for forecast post-processing.
This is where domain expertise is important.
Linear and Standard Models
In addition to the Dummy methods, we also want to evaluate our fancy solutions against very simple models.
Personally, I like using:
As an exercise try implementing baseline models to compare against the support-vector machine with preprocessing.