Real-world data is hard to work with.

That may be an understatement. Andrew Ng stated:

Every company has messy data,

and even the best of AI companies are not fully satisfied with their data.

But how can data be messy?

Messy Categories

On the one hand, data is naturally messy. Columns of categories contain entries that are similar but not quite the same. Detail often needs to be abstracted to not explode the category count and naturally, we have outliers everywhere.

On the other hand, data entry is messy. How do you match a typo in a column that was simply made by a tired minimum wage worker?

These in conjunction only complicated the issue.

Messy Labels

Not all data is created equal.

Some data points in a data set are easier to label and identify than others. That makes these points cheaper to label and easier to identify by lower-skilled and junior workers on the task.

In addition, data sets are often naturally imbalanced. There's more water on the surface of the Earth than forests.

These effects work together and can cause significant imbalances in a supervised machine learning problem

Messy Conventions

Classic data collection and storage is often detrimental to machine learning applications.

The easiest way to store most data is sequential. One data point next to the other. Row by Column. Your fancy 3D convolutional neural network expects the data as a neat 3D volume, however.

It gets worse, that conventions in digitization will conventionally use specific lengths of data like 101 or 701. Those are prime numbers. 121 latitudes? The prime factors are 11x11.

These are terrible for ML systems that rely on spatial correlation and compression to learn!

Image of Atomic Essay Day 7 - Real-world Machine Learning is Hard

This atomic essay was part of the October 2021 #Ship30for30 cohort. A 30-day daily writing challenge by Dickie Bush and Nicolas Cole. Want to join the challenge?