R2, MSE, L1, MAE... WTF?

The right metric makes or breaks your machine learning model Especially in real-world applications!

Here's how to make your model perform 🦾

Intuitive because we had enough classes

(and there's a personal rec from a Standford professor)

🎭 Classic metrics have many names

Here's the concept key:

L1 = ℓ1 = Mean Absolute Error = MAE = Lasso = Manhattan = Taxicab

L2 = ℓ2 = Mean Squared Error = MSE = Ridge = Euclidean = Tikhonov**

Math, physics, and stats clash in naming. 🤷

(** L2 is technically √ of MSE)

📈 MSE is the default

Physicists and ML practitioners love it.

Just subtract your measurement from the expected/label value, square it to make it positive and take the average of all those values.

In math symbols that is: Σ (ŷₗ - yₗ)² / n

But what does that do?!

✒ What it looks like

MSE basically forms the classic f(x) = x² parabola from school.

Find the distance between your expected and measured value on the horizontal x-axis and get your corresponding MSE value from the parabola above.

MSE gets big quick

Errors in MSE 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 -4 -3 -2 -1 0 1 2 3 4 Errors in MSE 16 12.153846153846153 9.884615384615415 -4 14.0625 31.14423076923077 69.73287259615381 12.25 50.13461538461539 125.7199519230769 10.5625 69.125 177.84585336538458 9 88.11538461538461 226.1105769230769 -3 7.5625 107.10576923076923 270.5141225961538 6.25 126.09615384615383 311.05649038461536 5.0625 145.08653846153845 347.73768028846155 4 164.0769230769231 380.5576923076923 -2 3.0625 183.0673076923077 409.5165264423077 2.25 202.05769230769232 434.6141826923077 1.5625 221.04807692307693 455.8506610576923 1 240.03846153846155 473.22596153846155 -1 0.5625 259.02884615384613 486.74008413461536 0.25 278.0192307692308 496.39302884615387 0.0625 297.00961538461536 502.1847956730769 0 316.0 504.11538461538464 0 0.0625 334.9903846153846 502.1847956730769 0.25 353.9807692307692 496.39302884615387 0.5625 372.9711538461538 486.74008413461536 1 391.96153846153845 473.22596153846155 1 1.5625 410.95192307692304 455.8506610576923 2.25 429.9423076923077 434.6141826923077 3.0625 448.93269230769226 409.5165264423077 4 467.9230769230769 380.5576923076923 2 5.0625 486.9134615384615 347.73768028846155 6.25 505.90384615384613 311.05649038461536 7.5625 524.8942307692307 270.5141225961538 9 543.8846153846154 226.1105769230769 3 10.5625 562.875 177.84585336538458 12.25 581.8653846153845 125.7199519230769 14.0625 600.8557692307692 69.73287259615381 16 619.8461538461538 9.884615384615415 4 f(x) = x^2

📐 Geometry!

If you squint, this looks like the Pythagorean theorem for 🔻: a² + b² = c²

Sum a bunch of squares to get the longest side of 📐!

So if we take the square root of MSE, we get the direct path between two points.

(Physicists often prefer the root MSE 👉 RMSE.)

🎁 Free statistics!

I love it when I don't have to do statistics myself.

MSE secretly sneaks in the assumption that our errors are distributed like a bell curve (you may know her as Gaussian or normal distribution).

In our real world, this assumption works a LOT of times.

🌟 Big & Small errors

In our ML model, we obviously want perfect performance.

But since that's not possible, we have to decide how to treat errors.

Imagine these errors 0.1² = 0.01 1² = 1 10² = 100 100² = 10000

Large errors & outliers explode when we square them!

🤖 The robot's verdict?

MSE is super fast to calculate. It's just a square, school kids learn that.

Model is kept in check by exploding large errors harder.

A simple square is convex and ALWAYS has a derivative and that is perfect for ML. No weird behaviours. Just vibes.

📣 Intermission!

If you work in natural language processing (NLP), you probably came across the Brier-score.

If you squint, the Brier-score is actually MSE applied to probabilities!

(looking at only one dimension that is)

⁉ Why do we even have something else ⁉

MSE sounds awesome, but imagine Google Maps would always tell you the flight distance. Through oceans, buildings and fields? You might even consider changing to Bing.

The Mean Absolute Error is very tidy walking around obstacles.

📈 MAE is fast and clean

Often the right choice, but less loved by physicists.

Subtract your measurement and label then take the absolute to make it positive and take the average of all those values.

In math symbols that is: Σ |ŷₗ - yₗ| / n

So, what does that do?!

📏 Linearity!

Simply making our errors positive and summing then means that all our model outputs are correct linearly.

This is great in real-world ML with data on multiple scales.

A 1% error for

  • 100 = 1
  • 10000 = 100

in MSE this would be: 1 and 10000

Errors in MAE 0 0 1 1 2 2 3 3 4 4 -4 -3 -2 -1 0 1 2 3 4 Errors in MAE 4 12.192307692307692 9.884615384615415 -4 3.75 31.24278846153846 40.77403846153845 3.5 50.293269230769226 71.66346153846155 3.25 69.34375 102.55288461538458 3 88.39423076923076 133.44230769230768 -3 2.75 107.44471153846153 164.33173076923077 2.5 126.4951923076923 195.2211538461538 2.25 145.54567307692307 226.1105769230769 2 164.59615384615384 257.0 -2 1.75 183.6466346153846 287.8894230769231 1.5 202.6971153846154 318.7788461538462 1.25 221.74759615384616 349.6682692307692 1 240.79807692307693 380.5576923076923 -1 0.75 259.8485576923077 411.44711538461536 0.5 278.89903846153845 442.33653846153845 0.25 297.9495192307692 473.22596153846155 0 317.0 504.11538461538464 0 0.25 336.0504807692308 473.22596153846155 0.5 355.10096153846155 442.33653846153845 0.75 374.1514423076923 411.44711538461536 1 393.2019230769231 380.5576923076923 1 1.25 412.2524038461538 349.6682692307692 1.5 431.3028846153846 318.7788461538462 1.75 450.35336538461536 287.8894230769231 2 469.40384615384613 257.0 2 2.25 488.4543269230769 226.1105769230769 2.5 507.50480769230774 195.2211538461538 2.75 526.5552884615385 164.33173076923077 3 545.6057692307693 133.44230769230768 3 3.25 564.65625 102.55288461538458 3.5 583.7067307692308 71.66346153846155 3.75 602.7572115384615 40.77403846153845 4 621.8076923076924 9.884615384615415 4 f(x) = |x|

🎢 Sparsity!

MAE gets an equal amount of benefit out of getting the small and large values right.

That means it is great for data with intermittent signals and a bunch of 0s normally. Sparse data!

Getting the 0s right is as important to the model as the spikes.

📉 Math gets in the way

Machine learning models really like it if things are differentiable, hence why MSE works so well.

MAE is great too, except for that pesky kink at x=0.

That one can mess up our model from learning sometimes.

🤖 The robot's verdict on MAE?

It's blazing fast to calculate.

The linearity is great for data over multiple scales.

Sparsity is great for intermittent signals and getting those 0s right. This makes it a popular choice for regularization too.

🎓 You promised a Stanford professor!

Back at the Technical University of Denmark, I attended a talk by Stephen Boyd about Convex Optimization.

We had a chat about CVXPY, ML and metrics, when he suggested to check out the Huber loss.

And yes, it's the guy Andrej Karpathy mentions

🎉 Why the Huber loss?

It combines the best of MAE and MSE!

It is convex for small errors and linear for large errors.

That means outliers are handled well and it works well over large scales in addition to ML loving the differentiable center.

Huber Loss vs MSE 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 -4 -3 -2 -1 0 1 2 3 4 Huber Loss vs MSE 16 11.507692307692308 9.884615384615415 -4 14.0625 29.48846153846154 69.73287259615381 12.25 47.46923076923077 125.7199519230769 10.5625 65.44999999999999 177.84585336538458 9 83.43076923076922 226.1105769230769 -3 7.5625 101.41153846153844 270.5141225961538 6.25 119.39230769230768 311.05649038461536 5.0625 137.37307692307692 347.73768028846155 4 155.35384615384615 380.5576923076923 -2 3.0625 173.33461538461538 409.5165264423077 2.25 191.3153846153846 434.6141826923077 1.5625 209.29615384615386 455.8506610576923 1 227.27692307692305 473.22596153846155 -1 0.5625 245.2576923076923 486.74008413461536 0.25 263.2384615384615 496.39302884615387 0.0625 281.21923076923076 502.1847956730769 0 299.2 504.11538461538464 0 0.0625 317.1807692307692 502.1847956730769 0.25 335.16153846153844 496.39302884615387 0.5625 353.14230769230767 486.74008413461536 1 371.12307692307695 473.22596153846155 1 1.5625 389.1038461538461 455.8506610576923 2.25 407.08461538461535 434.6141826923077 3.0625 425.0653846153846 409.5165264423077 4 443.0461538461538 380.5576923076923 2 5.0625 461.0269230769231 347.73768028846155 6.25 479.0076923076923 311.05649038461536 7.5625 496.9884615384615 270.5141225961538 9 514.9692307692308 226.1105769230769 3 10.5625 532.95 177.84585336538458 12.25 550.9307692307692 125.7199519230769 14.0625 568.9115384615385 69.73287259615381 16 586.8923076923077 9.884615384615415 4 4 11.507692307692308 380.5576923076923 -4 3.75 29.48846153846154 388.2800480769231 3.5 47.46923076923077 396.00240384615387 3.25 65.44999999999999 403.72475961538464 3 83.43076923076922 411.44711538461536 -3 2.75 101.41153846153844 419.1694711538462 2.5 119.39230769230768 426.8918269230769 2.25 137.37307692307692 434.6141826923077 2 155.35384615384615 442.33653846153845 -2 1.75 173.33461538461538 450.0588942307692 1.5 191.3153846153846 457.78125 1.25 209.29615384615386 465.5036057692308 1 227.27692307692305 473.22596153846155 -1 0.75 245.2576923076923 480.9483173076923 0.5 263.2384615384615 488.6706730769231 0.25 281.21923076923076 496.39302884615387 0 299.2 504.11538461538464 0 0.25 317.1807692307692 496.39302884615387 0.5 335.16153846153844 488.6706730769231 0.75 353.14230769230767 480.9483173076923 1 371.12307692307695 473.22596153846155 1 1.25 389.1038461538461 465.5036057692308 1.5 407.08461538461535 457.78125 1.75 425.0653846153846 450.0588942307692 2 443.0461538461538 442.33653846153845 2 2.25 461.0269230769231 434.6141826923077 2.5 479.0076923076923 426.8918269230769 2.75 496.9884615384615 419.1694711538462 3 514.9692307692308 411.44711538461536 3 3.25 532.95 403.72475961538464 3.5 550.9307692307692 396.00240384615387 3.75 568.9115384615385 388.2800480769231 4 586.8923076923077 380.5576923076923 4 3.5 11.507692307692308 396.00240384615387 -4 3.25 29.48846153846154 403.72475961538464 3 47.46923076923077 411.44711538461536 2.75 65.44999999999999 419.1694711538462 2.5 83.43076923076922 426.8918269230769 -3 2.25 101.41153846153844 434.6141826923077 2 119.39230769230768 442.33653846153845 1.75 137.37307692307692 450.0588942307692 1.5 155.35384615384615 457.78125 -2 1.25 173.33461538461538 465.5036057692308 1 191.3153846153846 473.22596153846155 0.75 209.29615384615386 480.9483173076923 0.5 227.27692307692305 488.6706730769231 -1 0.28125 245.2576923076923 495.427734375 0.125 263.2384615384615 500.2542067307692 0.03125 281.21923076923076 503.1500901442308 0 299.2 504.11538461538464 0 0.03125 317.1807692307692 503.1500901442308 0.125 335.16153846153844 500.2542067307692 0.28125 353.14230769230767 495.427734375 0.5 371.12307692307695 488.6706730769231 1 0.75 389.1038461538461 480.9483173076923 1 407.08461538461535 473.22596153846155 1.25 425.0653846153846 465.5036057692308 1.5 443.0461538461538 457.78125 2 1.75 461.0269230769231 450.0588942307692 2 479.0076923076923 442.33653846153845 2.25 496.9884615384615 434.6141826923077 2.5 514.9692307692308 426.8918269230769 3 2.75 532.95 419.1694711538462 3 550.9307692307692 411.44711538461536 3.25 568.9115384615385 403.72475961538464 3.5 586.8923076923077 396.00240384615387 4 6 11.507692307692308 318.7788461538462 -4 5.5 29.48846153846154 334.22355769230774 5 47.46923076923077 349.6682692307692 4.5 65.44999999999999 365.1129807692308 4 83.43076923076922 380.5576923076923 -3 3.5 101.41153846153844 396.00240384615387 3 119.39230769230768 411.44711538461536 2.5 137.37307692307692 426.8918269230769 2 155.35384615384615 442.33653846153845 -2 1.53125 173.33461538461538 456.81595552884613 1.125 191.3153846153846 469.36478365384613 0.78125 209.29615384615386 479.98302283653845 0.5 227.27692307692305 488.6706730769231 -1 0.28125 245.2576923076923 495.427734375 0.125 263.2384615384615 500.2542067307692 0.03125 281.21923076923076 503.1500901442308 0 299.2 504.11538461538464 0 0.03125 317.1807692307692 503.1500901442308 0.125 335.16153846153844 500.2542067307692 0.28125 353.14230769230767 495.427734375 0.5 371.12307692307695 488.6706730769231 1 0.78125 389.1038461538461 479.98302283653845 1.125 407.08461538461535 469.36478365384613 1.53125 425.0653846153846 456.81595552884613 2 443.0461538461538 442.33653846153845 2 2.5 461.0269230769231 426.8918269230769 3 479.0076923076923 411.44711538461536 3.5 496.9884615384615 396.00240384615387 4 514.9692307692308 380.5576923076923 3 4.5 532.95 365.1129807692308 5 550.9307692307692 349.6682692307692 5.5 568.9115384615385 334.22355769230774 6 586.8923076923077 318.7788461538462 4 f(x) = x^2 f(x) = |x| Huber \delta=1 Huber \delta=2

Conclusion

  • Mean Squared Error is a great default
  • MSE uses neat statistics
  • MSE is prone to outliers

  • Mean Absolute Error is a great alternative

  • MAE scales well and is robust
  • MAE for ML struggles at x=0

  • Huber loss combines both

  • Choosing metrics is important