Real-world Machine Learning is Hard

Jesper Dramsch machine learning, ship30for30, read in 2 minutes contains affiliate links

Real-world data is hard to work with.

That may be an understatement. Andrew Ng stated:

Every company has messy data,

and even the best of AI companies are not fully satisfied with their data.

But how can data be messy?

Messy Categories

On the one hand, data is naturally messy. Columns of categories contain entries that are similar but not quite the same. Detail often needs to be abstracted to not explode the category count and naturally, we have outliers everywhere.

On the other hand, data entry is messy. How do you match a typo in a column that was simply made by a tired minimum wage worker?

These in conjunction only complicated the issue.

Messy Labels

Not all data is created equal.

Some data points in a data set are easier to label and identify than others. That makes these points cheaper to label and easier to identify by lower-skilled and junior workers on the task.

In addition, data sets are often naturally imbalanced. There's more water on the surface of the Earth than forests.

These effects work together and can cause significant imbalances in a supervised machine learning problem

Messy Conventions

Classic data collection and storage is often detrimental to machine learning applications.

The easiest way to store most data is sequential. One data point next to the other. Row by Column. Your fancy 3D convolutional neural network expects the data as a neat 3D volume, however.

It gets worse, that conventions in digitization will conventionally use specific lengths of data like 101 or 701. Those are prime numbers. 121 latitudes? The prime factors are 11x11.

These are terrible for ML systems that rely on spatial correlation and compression to learn!

Image of Atomic Essay Day 7 - Real-world Machine Learning is Hard

This atomic essay was part of the October 2021 #Ship30for30 cohort. A 30-day daily writing challenge by Dickie Bush and Nicolas Cole. Want to join the challenge?

This website stores data such as cookies to enable site functionality including analytics through Google Analytics and personalization. By using this website, you automatically accept that we use cookies.

Read more...

Real-world Machine Learning is Hard

Real-world Machine Learning is Hard

Messy data makes the world go round.

Messy Categories

Messy Labels

Messy Conventions

Search

Contents

Recent Articles

Newsletter

Related Articles

Categories

Popular Tags

© 2024 Jesper Dramsch All rights reserved.

Ethics Statement ◆ Privacy Policy ◆ Terms & Conditions ◆ Affiliate Disclosure

This website stores data such as cookies to enable site functionality including analytics through Google Analytics and personalization. By using this website, you automatically accept that we use cookies.

Read more...

Real-world Machine Learning is Hard

Real-world Machine Learning is Hard

Messy data makes the world go round.

Messy Categories

Messy Labels

Messy Conventions

Search

Contents

Recent Articles

Newsletter

Related Articles

Categories

Popular Tags

Share

© 2024 Jesper Dramsch All rights reserved.

Ethics Statement ◆ Privacy Policy ◆ Terms & Conditions ◆ Affiliate Disclosure