Getting to know the data¶

This tutorial uses the Palmer Penguins dataset.

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

Let's dive into some quick exploration of the data!

In [1]:

import pandas as pd

In [2]:

penguins_raw = pd.read_csv("../data/penguins.csv")
penguins_raw.head()

Out[2]:

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	2007-11-11	39.1	18.7	181.0	3750.0	MALE	NaN	NaN	Not enough blood for isotopes.
1	PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	2007-11-11	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	2007-11-16	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
3	PAL0708	4	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	2007-11-16	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Adult not sampled.
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	2007-11-16	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN

This looks like a lot. Let's reduce this to some numerical columns and the species as our target column.

In [3]:

num_features = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)"]
cat_features = ["Sex"]
features = num_features + cat_features
target = ["Species"]
penguins = penguins_raw[features+target]
penguins

Out[3]:

	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Sex	Species
0	39.1	18.7	181.0	MALE	Adelie Penguin (Pygoscelis adeliae)
1	39.5	17.4	186.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
2	40.3	18.0	195.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
3	NaN	NaN	NaN	NaN	Adelie Penguin (Pygoscelis adeliae)
4	36.7	19.3	193.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
...	...	...	...	...	...
339	55.8	19.8	207.0	MALE	Chinstrap penguin (Pygoscelis antarctica)
340	43.5	18.1	202.0	FEMALE	Chinstrap penguin (Pygoscelis antarctica)
341	49.6	18.2	193.0	MALE	Chinstrap penguin (Pygoscelis antarctica)
342	50.8	19.0	210.0	MALE	Chinstrap penguin (Pygoscelis antarctica)
343	50.2	18.7	198.0	FEMALE	Chinstrap penguin (Pygoscelis antarctica)

344 rows × 5 columns

Data Visualization¶

That's much better. So now we can look at the data in detail.

In [4]:

import seaborn as sns

pairplot_figure = sns.pairplot(penguins, hue="Species")

No description has been provided for this image

Data cleaning

Looks like we're getting some good separation of the clusters.

So that means we can probably do some cleaning and get ready to build some good machine learning models.

In [5]:

penguins = penguins.dropna(axis='rows')
penguins

Out[5]:

	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Sex	Species
0	39.1	18.7	181.0	MALE	Adelie Penguin (Pygoscelis adeliae)
1	39.5	17.4	186.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
2	40.3	18.0	195.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
4	36.7	19.3	193.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
5	39.3	20.6	190.0	MALE	Adelie Penguin (Pygoscelis adeliae)
...	...	...	...	...	...
339	55.8	19.8	207.0	MALE	Chinstrap penguin (Pygoscelis antarctica)
340	43.5	18.1	202.0	FEMALE	Chinstrap penguin (Pygoscelis antarctica)
341	49.6	18.2	193.0	MALE	Chinstrap penguin (Pygoscelis antarctica)
342	50.8	19.0	210.0	MALE	Chinstrap penguin (Pygoscelis antarctica)
343	50.2	18.7	198.0	FEMALE	Chinstrap penguin (Pygoscelis antarctica)

334 rows × 5 columns

In [6]:

penguins.to_csv('../data/penguins_clean.csv', index=False)

Not too bad it looks like we lost two rows. That's manageable, it's a toy dataset after all.

So let's build a small model to classify penuins!

Machine Learning

First we need to split the data.

This way we can test whether our model learned general rules about our data, or if it just memorized the training data. When a model does not learn the generalities, this is known as overfitting.

In [7]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target], train_size=.7)
X_train

Out[7]:

	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Sex
261	48.1	15.1	209.0	MALE
336	51.9	19.5	206.0	MALE
161	46.8	15.4	215.0	MALE
305	52.8	20.0	205.0	MALE
175	46.3	15.8	215.0	MALE
...	...	...	...	...
201	45.2	15.8	215.0	MALE
270	47.2	13.7	214.0	FEMALE
320	50.9	17.9	196.0	FEMALE
116	38.6	17.0	188.0	FEMALE
86	36.3	19.5	190.0	MALE

233 rows × 4 columns

In [8]:

y_train

Out[8]:

	Species
261	Gentoo penguin (Pygoscelis papua)
336	Chinstrap penguin (Pygoscelis antarctica)
161	Gentoo penguin (Pygoscelis papua)
305	Chinstrap penguin (Pygoscelis antarctica)
175	Gentoo penguin (Pygoscelis papua)
...	...
201	Gentoo penguin (Pygoscelis papua)
270	Gentoo penguin (Pygoscelis papua)
320	Chinstrap penguin (Pygoscelis antarctica)
116	Adelie Penguin (Pygoscelis adeliae)
86	Adelie Penguin (Pygoscelis adeliae)

233 rows × 1 columns

Now we can build a machine learning model.

Here we'll use the scikit-learn pipeline model. This makes it really easy for us to train prepocessors and models on the training data alone and cleanly apply to the test data set without leakage.

Pre-processing¶

Tip: In science, any type of feature selection, scaling, basically anything you do to the data, needs to be done after a split into training and test set.
Statistically and scientifically valid results come from proper treatment of our data. Unfortunately, we can overfit manually if we don't split out a test set before pre-processing.

In [9]:

from sklearn.preprocessing import StandardScaler, OneHotEncoder

num_transformer = StandardScaler()
cat_transformer = OneHotEncoder(handle_unknown='ignore')

The ColumnTransformer is a neat tool that can apply your preprocessing steps to the right columns in your dataset.

In fact, you could also use a Pipeline instead of "just" a StandardScaler to use more sophisticated and complex preprocessing workflows that go beyond this toy project.

In [10]:

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])

In [11]:

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC()),
])
model

Out[11]:

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['Culmen Length (mm)',
                                                   'Culmen Depth (mm)',
                                                   'Flipper Length (mm)']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Sex'])])),
                ('classifier', SVC())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can see a nice model representation here.

You can click on the different modules that will tell you which arguments were passed into the pipeline. In our case, how we handle unknown values in the OneHotEncoder.

Model Training¶

Now it's time to fit our Support-Vector Machine to our training data.

In [12]:

model.fit(X_train, y_train[target[0]])

Out[12]:

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['Culmen Length (mm)',
                                                   'Culmen Depth (mm)',
                                                   'Flipper Length (mm)']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Sex'])])),
                ('classifier', SVC())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can see that we get a decent score on the training data.

This metric only tells us how well the model can perform on the data it has seen, we don't know anything about generalization and actual "learning" yet.

In [13]:

model.score(X_train, y_train)

Out[13]:

0.9957081545064378

To evaluate how well our model learned, we check the model against the test data one final time.

Tip: It is possible to manually overfit a model to the test set, by tweaking the pipelines based on the test score.
This invalidates scientific results and must be avoided. Only evaluate on the test set once!

In [14]:

model.score(X_test, y_test)

Out[14]:

0.9801980198019802

In [ ]:

This website stores data such as cookies to enable site functionality including analytics through Google Analytics and personalization. By using this website, you automatically accept that we use cookies.

Read more...

Increase citations, ease reviews and facilitate collaboration – Data Prep

Increase citations, ease reviews and facilitate collaboration – Data Prep

Getting to know the data¶

Data Visualization¶

Data cleaning

Machine Learning

Pre-processing¶

Model Training¶

Search

Contents

Recent Articles

Newsletter

Related Articles

Categories

Popular Tags

© 2024 Jesper Dramsch All rights reserved.

Ethics Statement ◆ Privacy Policy ◆ Terms & Conditions ◆ Affiliate Disclosure

Related Articles

How to get Involved in Offensive LLM Security

How does generative AI change in 2024?

My 2023 in Review: The year of AI

Preparing customGPTs for the GPT store

This website stores data such as cookies to enable site functionality including analytics through Google Analytics and personalization. By using this website, you automatically accept that we use cookies.

Read more...

Increase citations, ease reviews and facilitate collaboration – Data Prep

Increase citations, ease reviews and facilitate collaboration – Data Prep

Getting to know the data¶

Data Visualization¶

Data cleaning

Machine Learning

Pre-processing¶

Model Training¶

Search

Contents

Recent Articles

Newsletter

Related Articles

Categories

Popular Tags

Share

© 2024 Jesper Dramsch All rights reserved.

Ethics Statement ◆ Privacy Policy ◆ Terms & Conditions ◆ Affiliate Disclosure