Jesper Dramsch machine learning, reproducibility, SSI, tutorial, read in 3 minutes

Benchmarking¶

Another common reason for rejections of machine learning papers in applied science is the lack of proper benchmarks. This section will be fairly short, as it differs from discipline to discipline.

However, any time we apply a superfancy deep neural network, we need to supply a benchmark to compare the relative performance of our model to. These models should be established methods in the field and simpler machine learning methods like a linear model, support-vector machine or a random forest.

In [1]:

import pandas as pd
penguins = pd.read_csv('../data/penguins_clean.csv')
penguins.head()

Out[1]:

	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Sex	Species
0	39.1	18.7	181.0	MALE	Adelie Penguin (Pygoscelis adeliae)
1	39.5	17.4	186.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
2	40.3	18.0	195.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
3	36.7	19.3	193.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
4	39.3	20.6	190.0	MALE	Adelie Penguin (Pygoscelis adeliae)

In [2]:

from sklearn.model_selection import train_test_split
num_features = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)"]
cat_features = ["Sex"]
features = num_features + cat_features
target = ["Species"]

X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target], stratify=penguins[target[0]], train_size=.7, random_state=42)
X_train

Out[2]:

	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Sex
221	51.1	16.3	220.0	MALE
315	49.8	17.3	198.0	FEMALE
262	46.8	14.3	215.0	FEMALE
191	45.5	13.9	210.0	FEMALE
8	38.6	21.2	191.0	MALE
...	...	...	...	...
9	34.6	21.1	198.0	MALE
96	37.7	16.0	183.0	FEMALE
184	48.7	15.7	208.0	MALE
212	43.5	14.2	220.0	FEMALE
64	33.5	19.0	190.0	FEMALE

233 rows × 4 columns

Dummy Classifiers

One of the easiest way to build a benchmark is ensuring that our model performs better than random.

Tip: If our model is effectively as good as a coin flip, it's a bad model.

However, sometimes it isn't obvious how good or bad a model is. Take our penguin data. What counts as "random classification" on 3 classes that aren't equally distributed?

In [3]:

y_train.reset_index().groupby(["Species"]).count()

Out[3]:

	index
Species
Adelie Penguin (Pygoscelis adeliae)	102
Chinstrap penguin (Pygoscelis antarctica)	47
Gentoo penguin (Pygoscelis papua)	84

We can use the DummyClassifier and DummyRegressor to show what a random model would predict.

In [4]:

from sklearn.dummy import DummyClassifier
clf = DummyClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

Out[4]:

0.43564356435643564

Benchmark Datasets

Another great tool to use is benchmark datasets.

Most fields have started creating benchmark datasets to test new methods in a controlled environment. Unfortunately, it still happens that results are over-reported because models weren't adequately evaluated as seen in notebook 1. Nevertheless, it's easy to reproduce the results as both the code and data are available, so we can quickly see how legitimate reported scores are.

Examples:

Imagenet in computer vision
WeatherBench in meteorology
ChestX-ray8 in medical imaging

Domain Methods

Any method is stronger if it is verified against standard methods in the field.

A weather forecast post-processing method should be evaluated against a standard for forecast post-processing.

This is where domain expertise is important.

Linear and Standard Models

In addition to the Dummy methods, we also want to evaluate our fancy solutions against very simple models.

Personally, I like using:

As an exercise try implementing baseline models to compare against the support-vector machine with preprocessing.

In [ ]:

This website stores data such as cookies to enable site functionality including analytics through Google Analytics and personalization. By using this website, you automatically accept that we use cookies.

Read more...

Increase citations, ease reviews and facilitate collaboration – Benchmarking

Increase citations, ease reviews and facilitate collaboration – Benchmarking

Benchmarking¶

Dummy Classifiers

Benchmark Datasets

Domain Methods

Linear and Standard Models

Search

Contents

Recent Articles

Newsletter

Related Articles

Categories

Popular Tags

© 2024 Jesper Dramsch All rights reserved.

Ethics Statement ◆ Privacy Policy ◆ Terms & Conditions ◆ Affiliate Disclosure

Related Articles

How to get Involved in Offensive LLM Security

How does generative AI change in 2024?

My 2023 in Review: The year of AI

Preparing customGPTs for the GPT store

This website stores data such as cookies to enable site functionality including analytics through Google Analytics and personalization. By using this website, you automatically accept that we use cookies.

Read more...

Increase citations, ease reviews and facilitate collaboration – Benchmarking

Increase citations, ease reviews and facilitate collaboration – Benchmarking

Benchmarking¶

Dummy Classifiers

Benchmark Datasets

Domain Methods

Linear and Standard Models

Search

Contents

Recent Articles

Newsletter

Related Articles

Categories

Popular Tags

Share

© 2024 Jesper Dramsch All rights reserved.

Ethics Statement ◆ Privacy Policy ◆ Terms & Conditions ◆ Affiliate Disclosure