R2, MSE, L1, MAE... WTF?

The right metric makes or breaks your machine learning model Especially in real-world applications!

Here's how to make your model perform 🦾

Intuitive because we had enough classes

(and there's a personal rec from a Standford professor)

## 🎭 Classic metrics have many names

Here's the concept key:

L1 = ℓ1 = Mean Absolute Error = MAE = Lasso = Manhattan = Taxicab

L2 = ℓ2 = Mean Squared Error = MSE = Ridge = Euclidean = Tikhonov**

Math, physics, and stats clash in naming. 🤷

(** L2 is technically √ of MSE)

## 📈 MSE is the default

Physicists and ML practitioners love it.

Just subtract your measurement from the expected/label value, square it to make it positive and take the average of all those values.

In math symbols that is: Σ (ŷₗ - yₗ)² / n

But what does that do?!

### ✒ What it looks like

MSE basically forms the classic f(x) = x² parabola from school.

Find the distance between your expected and measured value on the horizontal x-axis and get your corresponding MSE value from the parabola above.

MSE gets big quick

### 📐 Geometry!

If you squint, this looks like the Pythagorean theorem for 🔻: a² + b² = c²

Sum a bunch of squares to get the longest side of 📐!

So if we take the square root of MSE, we get the direct path between two points.

(Physicists often prefer the root MSE 👉 RMSE.)

### 🎁 Free statistics!

I love it when I don't have to do statistics myself.

MSE secretly sneaks in the assumption that our errors are distributed like a bell curve (you may know her as Gaussian or normal distribution).

In our real world, this assumption works a LOT of times.

### 🌟 Big & Small errors

In our ML model, we obviously want perfect performance.

But since that's not possible, we have to decide how to treat errors.

Imagine these errors 0.1² = 0.01 1² = 1 10² = 100 100² = 10000

Large errors & outliers explode when we square them!

### 🤖 The robot's verdict?

MSE is super fast to calculate. It's just a square, school kids learn that.

Model is kept in check by exploding large errors harder.

A simple square is convex and ALWAYS has a derivative and that is perfect for ML. No weird behaviours. Just vibes.

## 📣 Intermission!

If you work in natural language processing (NLP), you probably came across the Brier-score.

If you squint, the Brier-score is actually MSE applied to probabilities!

(looking at only one dimension that is)

## ⁉ Why do we even have something else ⁉

MSE sounds awesome, but imagine Google Maps would always tell you the flight distance. Through oceans, buildings and fields? You might even consider changing to Bing.

The Mean Absolute Error is very tidy walking around obstacles.

### 📈 MAE is fast and clean

Often the right choice, but less loved by physicists.

Subtract your measurement and label then take the absolute to make it positive and take the average of all those values.

In math symbols that is: Σ |ŷₗ - yₗ| / n

So, what does that do?!

### 📏 Linearity!

Simply making our errors positive and summing then means that all our model outputs are correct linearly.

This is great in real-world ML with data on multiple scales.

A 1% error for

• 100 = 1
• 10000 = 100

in MSE this would be: 1 and 10000

### 🎢 Sparsity!

MAE gets an equal amount of benefit out of getting the small and large values right.

That means it is great for data with intermittent signals and a bunch of 0s normally. Sparse data!

Getting the 0s right is as important to the model as the spikes.

### 📉 Math gets in the way

Machine learning models really like it if things are differentiable, hence why MSE works so well.

MAE is great too, except for that pesky kink at x=0.

That one can mess up our model from learning sometimes.

### 🤖 The robot's verdict on MAE?

It's blazing fast to calculate.

The linearity is great for data over multiple scales.

Sparsity is great for intermittent signals and getting those 0s right. This makes it a popular choice for regularization too.

## 🎓 You promised a Stanford professor!

Back at the Technical University of Denmark, I attended a talk by Stephen Boyd about Convex Optimization.

We had a chat about CVXPY, ML and metrics, when he suggested to check out the Huber loss.

And yes, it's the guy Andrej Karpathy mentions

### 🎉 Why the Huber loss?

It combines the best of MAE and MSE!

It is convex for small errors and linear for large errors.

That means outliers are handled well and it works well over large scales in addition to ML loving the differentiable center.

## Conclusion

• Mean Squared Error is a great default

• MSE uses neat statistics

• MSE is prone to outliers

• Mean Absolute Error is a great alternative

• MAE scales well and is robust

• MAE for ML struggles at x=0

• Huber loss combines both

• Choosing metrics is important