R2, MSE, L1, MAE... WTF?
The right metric makes or breaks your machine learning model Especially in real-world applications!
Here's how to make your model perform π¦Ύ
Intuitive because we had enough classes
(and there's a personal rec from a Standford professor)
π Classic metrics have many names
Here's the concept key:
L1 = β1 = Mean Absolute Error = MAE = Lasso = Manhattan = Taxicab
L2 = β2 = Mean Squared Error = MSE = Ridge = Euclidean = Tikhonov**
Math, physics, and stats clash in naming. π€·
(** L2 is technically β of MSE)
π MSE is the default
Physicists and ML practitioners love it.
Just subtract your measurement from the expected/label value, square it to make it positive and take the average of all those values.
In math symbols that is: Ξ£ (Ε·β - yβ)Β² / n
But what does that do?!
β What it looks like
MSE basically forms the classic f(x) = xΒ² parabola from school.
Find the distance between your expected and measured value on the horizontal x-axis and get your corresponding MSE value from the parabola above.
MSE gets big quick
Errors in MSE
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
-4
-3
-2
-1
0
1
2
3
4
Errors in MSE
16
12.153846153846153
9.884615384615415
-4
14.0625
31.14423076923077
69.73287259615381
12.25
50.13461538461539
125.7199519230769
10.5625
69.125
177.84585336538458
9
88.11538461538461
226.1105769230769
-3
7.5625
107.10576923076923
270.5141225961538
6.25
126.09615384615383
311.05649038461536
5.0625
145.08653846153845
347.73768028846155
4
164.0769230769231
380.5576923076923
-2
3.0625
183.0673076923077
409.5165264423077
2.25
202.05769230769232
434.6141826923077
1.5625
221.04807692307693
455.8506610576923
1
240.03846153846155
473.22596153846155
-1
0.5625
259.02884615384613
486.74008413461536
0.25
278.0192307692308
496.39302884615387
0.0625
297.00961538461536
502.1847956730769
0
316.0
504.11538461538464
0
0.0625
334.9903846153846
502.1847956730769
0.25
353.9807692307692
496.39302884615387
0.5625
372.9711538461538
486.74008413461536
1
391.96153846153845
473.22596153846155
1
1.5625
410.95192307692304
455.8506610576923
2.25
429.9423076923077
434.6141826923077
3.0625
448.93269230769226
409.5165264423077
4
467.9230769230769
380.5576923076923
2
5.0625
486.9134615384615
347.73768028846155
6.25
505.90384615384613
311.05649038461536
7.5625
524.8942307692307
270.5141225961538
9
543.8846153846154
226.1105769230769
3
10.5625
562.875
177.84585336538458
12.25
581.8653846153845
125.7199519230769
14.0625
600.8557692307692
69.73287259615381
16
619.8461538461538
9.884615384615415
4
f(x) = x^2
π Geometry!
If you squint, this looks like the Pythagorean theorem for π»: aΒ² + bΒ² = cΒ²
Sum a bunch of squares to get the longest side of π!
So if we take the square root of MSE, we get the direct path between two points.
(Physicists often prefer the root MSE π RMSE.)
π Free statistics!
I love it when I don't have to do statistics myself.
MSE secretly sneaks in the assumption that our errors are distributed like a bell curve (you may know her as Gaussian or normal distribution).
In our real world, this assumption works a LOT of times.
π Big & Small errors
In our ML model, we obviously want perfect performance.
But since that's not possible, we have to decide how to treat errors.
Imagine these errors 0.1Β² = 0.01 1Β² = 1 10Β² = 100 100Β² = 10000
Large errors & outliers explode when we square them!
π€ The robot's verdict?
MSE is super fast to calculate. It's just a square, school kids learn that.
Model is kept in check by exploding large errors harder.
A simple square is convex and ALWAYS has a derivative and that is perfect for ML. No weird behaviours. Just vibes.
π£ Intermission!
If you work in natural language processing (NLP), you probably came across the Brier-score.
If you squint, the Brier-score is actually MSE applied to probabilities!
(looking at only one dimension that is)
β Why do we even have something else β
MSE sounds awesome, but imagine Google Maps would always tell you the flight distance. Through oceans, buildings and fields? You might even consider changing to Bing.
The Mean Absolute Error is very tidy walking around obstacles.
π MAE is fast and clean
Often the right choice, but less loved by physicists.
Subtract your measurement and label then take the absolute to make it positive and take the average of all those values.
In math symbols that is: Ξ£ |Ε·β - yβ| / n
So, what does that do?!
π Linearity!
Simply making our errors positive and summing then means that all our model outputs are correct linearly.
This is great in real-world ML with data on multiple scales.
A 1% error for
in MSE this would be: 1 and 10000
Errors in MAE
0
0
1
1
2
2
3
3
4
4
-4
-3
-2
-1
0
1
2
3
4
Errors in MAE
4
12.192307692307692
9.884615384615415
-4
3.75
31.24278846153846
40.77403846153845
3.5
50.293269230769226
71.66346153846155
3.25
69.34375
102.55288461538458
3
88.39423076923076
133.44230769230768
-3
2.75
107.44471153846153
164.33173076923077
2.5
126.4951923076923
195.2211538461538
2.25
145.54567307692307
226.1105769230769
2
164.59615384615384
257.0
-2
1.75
183.6466346153846
287.8894230769231
1.5
202.6971153846154
318.7788461538462
1.25
221.74759615384616
349.6682692307692
1
240.79807692307693
380.5576923076923
-1
0.75
259.8485576923077
411.44711538461536
0.5
278.89903846153845
442.33653846153845
0.25
297.9495192307692
473.22596153846155
0
317.0
504.11538461538464
0
0.25
336.0504807692308
473.22596153846155
0.5
355.10096153846155
442.33653846153845
0.75
374.1514423076923
411.44711538461536
1
393.2019230769231
380.5576923076923
1
1.25
412.2524038461538
349.6682692307692
1.5
431.3028846153846
318.7788461538462
1.75
450.35336538461536
287.8894230769231
2
469.40384615384613
257.0
2
2.25
488.4543269230769
226.1105769230769
2.5
507.50480769230774
195.2211538461538
2.75
526.5552884615385
164.33173076923077
3
545.6057692307693
133.44230769230768
3
3.25
564.65625
102.55288461538458
3.5
583.7067307692308
71.66346153846155
3.75
602.7572115384615
40.77403846153845
4
621.8076923076924
9.884615384615415
4
f(x) = |x|
π’ Sparsity!
MAE gets an equal amount of benefit out of getting the small and large values right.
That means it is great for data with intermittent signals and a bunch of 0s normally. Sparse data!
Getting the 0s right is as important to the model as the spikes.
π Math gets in the way
Machine learning models really like it if things are differentiable, hence why MSE works so well.
MAE is great too, except for that pesky kink at x=0.
That one can mess up our model from learning sometimes.
π€ The robot's verdict on MAE?
It's blazing fast to calculate.
The linearity is great for data over multiple scales.
Sparsity is great for intermittent signals and getting those 0s right. This makes it a popular choice for regularization too.
π You promised a Stanford professor!
Back at the Technical University of Denmark, I attended a talk by Stephen Boyd about Convex Optimization.
We had a chat about CVXPY, ML and metrics, when he suggested to check out the Huber loss.
And yes, it's the guy Andrej Karpathy mentions
π Why the Huber loss?
It combines the best of MAE and MSE!
It is convex for small errors and linear for large errors.
That means outliers are handled well and it works well over large scales in addition to ML loving the differentiable center.
Huber Loss vs MSE
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
-4
-3
-2
-1
0
1
2
3
4
Huber Loss vs MSE
16
11.507692307692308
9.884615384615415
-4
14.0625
29.48846153846154
69.73287259615381
12.25
47.46923076923077
125.7199519230769
10.5625
65.44999999999999
177.84585336538458
9
83.43076923076922
226.1105769230769
-3
7.5625
101.41153846153844
270.5141225961538
6.25
119.39230769230768
311.05649038461536
5.0625
137.37307692307692
347.73768028846155
4
155.35384615384615
380.5576923076923
-2
3.0625
173.33461538461538
409.5165264423077
2.25
191.3153846153846
434.6141826923077
1.5625
209.29615384615386
455.8506610576923
1
227.27692307692305
473.22596153846155
-1
0.5625
245.2576923076923
486.74008413461536
0.25
263.2384615384615
496.39302884615387
0.0625
281.21923076923076
502.1847956730769
0
299.2
504.11538461538464
0
0.0625
317.1807692307692
502.1847956730769
0.25
335.16153846153844
496.39302884615387
0.5625
353.14230769230767
486.74008413461536
1
371.12307692307695
473.22596153846155
1
1.5625
389.1038461538461
455.8506610576923
2.25
407.08461538461535
434.6141826923077
3.0625
425.0653846153846
409.5165264423077
4
443.0461538461538
380.5576923076923
2
5.0625
461.0269230769231
347.73768028846155
6.25
479.0076923076923
311.05649038461536
7.5625
496.9884615384615
270.5141225961538
9
514.9692307692308
226.1105769230769
3
10.5625
532.95
177.84585336538458
12.25
550.9307692307692
125.7199519230769
14.0625
568.9115384615385
69.73287259615381
16
586.8923076923077
9.884615384615415
4
4
11.507692307692308
380.5576923076923
-4
3.75
29.48846153846154
388.2800480769231
3.5
47.46923076923077
396.00240384615387
3.25
65.44999999999999
403.72475961538464
3
83.43076923076922
411.44711538461536
-3
2.75
101.41153846153844
419.1694711538462
2.5
119.39230769230768
426.8918269230769
2.25
137.37307692307692
434.6141826923077
2
155.35384615384615
442.33653846153845
-2
1.75
173.33461538461538
450.0588942307692
1.5
191.3153846153846
457.78125
1.25
209.29615384615386
465.5036057692308
1
227.27692307692305
473.22596153846155
-1
0.75
245.2576923076923
480.9483173076923
0.5
263.2384615384615
488.6706730769231
0.25
281.21923076923076
496.39302884615387
0
299.2
504.11538461538464
0
0.25
317.1807692307692
496.39302884615387
0.5
335.16153846153844
488.6706730769231
0.75
353.14230769230767
480.9483173076923
1
371.12307692307695
473.22596153846155
1
1.25
389.1038461538461
465.5036057692308
1.5
407.08461538461535
457.78125
1.75
425.0653846153846
450.0588942307692
2
443.0461538461538
442.33653846153845
2
2.25
461.0269230769231
434.6141826923077
2.5
479.0076923076923
426.8918269230769
2.75
496.9884615384615
419.1694711538462
3
514.9692307692308
411.44711538461536
3
3.25
532.95
403.72475961538464
3.5
550.9307692307692
396.00240384615387
3.75
568.9115384615385
388.2800480769231
4
586.8923076923077
380.5576923076923
4
3.5
11.507692307692308
396.00240384615387
-4
3.25
29.48846153846154
403.72475961538464
3
47.46923076923077
411.44711538461536
2.75
65.44999999999999
419.1694711538462
2.5
83.43076923076922
426.8918269230769
-3
2.25
101.41153846153844
434.6141826923077
2
119.39230769230768
442.33653846153845
1.75
137.37307692307692
450.0588942307692
1.5
155.35384615384615
457.78125
-2
1.25
173.33461538461538
465.5036057692308
1
191.3153846153846
473.22596153846155
0.75
209.29615384615386
480.9483173076923
0.5
227.27692307692305
488.6706730769231
-1
0.28125
245.2576923076923
495.427734375
0.125
263.2384615384615
500.2542067307692
0.03125
281.21923076923076
503.1500901442308
0
299.2
504.11538461538464
0
0.03125
317.1807692307692
503.1500901442308
0.125
335.16153846153844
500.2542067307692
0.28125
353.14230769230767
495.427734375
0.5
371.12307692307695
488.6706730769231
1
0.75
389.1038461538461
480.9483173076923
1
407.08461538461535
473.22596153846155
1.25
425.0653846153846
465.5036057692308
1.5
443.0461538461538
457.78125
2
1.75
461.0269230769231
450.0588942307692
2
479.0076923076923
442.33653846153845
2.25
496.9884615384615
434.6141826923077
2.5
514.9692307692308
426.8918269230769
3
2.75
532.95
419.1694711538462
3
550.9307692307692
411.44711538461536
3.25
568.9115384615385
403.72475961538464
3.5
586.8923076923077
396.00240384615387
4
6
11.507692307692308
318.7788461538462
-4
5.5
29.48846153846154
334.22355769230774
5
47.46923076923077
349.6682692307692
4.5
65.44999999999999
365.1129807692308
4
83.43076923076922
380.5576923076923
-3
3.5
101.41153846153844
396.00240384615387
3
119.39230769230768
411.44711538461536
2.5
137.37307692307692
426.8918269230769
2
155.35384615384615
442.33653846153845
-2
1.53125
173.33461538461538
456.81595552884613
1.125
191.3153846153846
469.36478365384613
0.78125
209.29615384615386
479.98302283653845
0.5
227.27692307692305
488.6706730769231
-1
0.28125
245.2576923076923
495.427734375
0.125
263.2384615384615
500.2542067307692
0.03125
281.21923076923076
503.1500901442308
0
299.2
504.11538461538464
0
0.03125
317.1807692307692
503.1500901442308
0.125
335.16153846153844
500.2542067307692
0.28125
353.14230769230767
495.427734375
0.5
371.12307692307695
488.6706730769231
1
0.78125
389.1038461538461
479.98302283653845
1.125
407.08461538461535
469.36478365384613
1.53125
425.0653846153846
456.81595552884613
2
443.0461538461538
442.33653846153845
2
2.5
461.0269230769231
426.8918269230769
3
479.0076923076923
411.44711538461536
3.5
496.9884615384615
396.00240384615387
4
514.9692307692308
380.5576923076923
3
4.5
532.95
365.1129807692308
5
550.9307692307692
349.6682692307692
5.5
568.9115384615385
334.22355769230774
6
586.8923076923077
318.7788461538462
4
f(x) = x^2
f(x) = |x|
Huber \delta=1
Huber \delta=2
Conclusion