My satellite image segmentation model wasn’t learning, and we were wasting a $60,000 contract.

Our classes were skewed 1 to 30!

That was the true target of our model, but we fixed it with an easy trick.

In the early days of my career transition out of the oil and gas industry to the world of machine learning, I found myself facing a challenging project involving satellite image segmentation. It was a thrilling detective story, and it's a story worth sharing.

We were tasked with creating a satellite image segmentation model, and as you may know, Earth sciences often come with this inherent problem. Some classes in the imagery data were much rarer than others, which presented a real challenge.

In the Earth sciences, it's common to encounter situations where the classes you want to identify are not evenly distributed. In our case, we had classes like "forest," "water bodies," and "urban areas." These are naturally imbalanced because, well, urban areas are far less common than, say, forests. The typical class ratio might be more like 1 to 50, making it necessary to address this imbalance to create any useful machine learning model!

It got worse!

Can't you just classify more?

We didn’t have time to segment more data before the first milestone meeting. The model had to work with what we had!

We needed to tackle this class imbalance head-on, and one of the methods we explored was Random Undersampling. This technique involves randomly removing some of the majority class samples to balance the class distribution. In Earth sciences, this would mean reducing the number of samples from common classes like "forest" or "water bodies."

In our case, Random Undersampling proved to be really effective, but it came with a crucial insight: we didn't need to aim for a perfect 1 to 1 class ratio, which is a common misconception.

We aimed to maintain a ratio somewhere along the lines of 1 to 7.

Balancing the smallest and largest classes while preserving some of the data from the majority class was the right choice.

Should we try SMOTE

Here's where things got interesting. While Random Undersampling was a practical approach, we noticed that Synthetic Minority Over-sampling Technique (SMOTE) was frequently mentioned in blog posts and research papers.

But it seemed to be conspicuously absent in real-world applications.

SMOTE is a method for oversampling the minority class by generating synthetic samples based on existing data points. While it's a fascinating concept, its practical application often required careful fine-tuning and sometimes didn't provide the expected results, especially when dealing with satellite imagery.

NASA Satellite

In the real world, where data is often messy, noisy, and complex, Random Undersampling proved to be a faster and more robust solution. It didn't rely on complex modelling assumptions and was more forgiving when dealing with the intricacies of Earth science data.

Conclusion

In the end, our journey through class imbalance and satellite image segmentation taught me a valuable lesson:

Sometimes the simplest solutions are the most effective.

It's crucial to adapt machine learning techniques to the specific challenges of the Earth sciences. By embracing Random Undersampling and keeping our class ratios within a 1 to 7 range, we found a pragmatic way to create a reliable model that made a real impact in our work to improve weather forecasts in Europe.

It proves true again. As we navigate the world of machine learning and AI, practicality often trumps complexity. Also, real-world applications may require us to make choices that compound the good, even if they don't align with the latest buzzwords.