My first encounter with missing data in machine learning?

Very first time I touched data!

XGBoost was just becoming usable back then, and scikit-learn decision trees didn’t handle NaNs yet!

These gaps in our data could wreak havoc on the accuracy of our models, leading to erroneous predictions and potential model breakdowns.

So what to do?

In my first attempts, I was just using simple imputation.

Replace the values with their average or the median. But I quickly tried getting more fancy.

What if we can use trends?

Maybe even a machine learning model to predict the value that is missing or use expert knowledge?

I started reading into medical literature who have been dealing with missing data and statistics for a long time.

You know what?

I talked to Daliana Liu about missing data as well!


But in the meantime, there are 3 best practices to deal with missing values:

1️⃣ You can remove or fill the rows, but just adding the average is often the best and most robust.

Sometimes, when the number of missing values was relatively small, it made sense to eliminate the rows containing them. This was a straightforward approach, suitable when working with observations that weren't systematically missing. If a few days' worth of temperature data were missing in my daily weather dataset, I chose to remove those specific rows to maintain data integrity.

But, removing data wasn't always the best option. In such cases, filling in the missing values with suitable replacements became my trusted strategy. These replacements could be constant values like zero, or measures of central tendency such as the mean or median. This method, backed by research, proved to be preferable over more complex techniques, especially in the realm of meteorology.

2️⃣ Tree-based models that handle missing values!

Models like XGBoost can natively handle missing values and even glean information from the fact that it’s a missing value!

The fact that we can skip this pre-processing step and remove one modelling choice through an implicit mechanism is incredibly powerful!

3️⃣ Missingness as a feature

We can actually do another really neat trick in machine learning!

I have started calling this “missingness as a feature”.

Here I added an extra column that just has “yes” and “no” boolean values, for if I had NaNs present in the data.

Sometimes our data isn’t missing randomly, so our machine learning model can actually gain real information from this column!

Then I go to 1️⃣, and fill those missing values for the standard model to work its magic.

One example here is the Sea Surface Temperature in Earth modelling!

This data classically has missing values over land, so we would get a land-sea mask from this boolean column.

(Now, if we already have the LSM in our data, we clearly don’t need this!)

So many fun things to do with missing data!

I actually wrote about this in my newsletter last week, if you’re curious!