Getting to know the data¶
This tutorial uses the Palmer Penguins dataset.
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
Let's dive into some quick exploration of the data!
import pandas as pd
penguins_raw = pd.read_csv("../data/penguins.csv")
penguins_raw.head()
This looks like a lot. Let's reduce this to some numerical columns and the species as our target column.
num_features = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)"]
cat_features = ["Sex"]
features = num_features + cat_features
target = ["Species"]
penguins = penguins_raw[features+target]
penguins
Data Visualization¶
That's much better. So now we can look at the data in detail.
import seaborn as sns
pairplot_figure = sns.pairplot(penguins, hue="Species")
Data cleaning
Looks like we're getting some good separation of the clusters.
So that means we can probably do some cleaning and get ready to build some good machine learning models.
penguins = penguins.dropna(axis='rows')
penguins
penguins.to_csv('../data/penguins_clean.csv', index=False)
Not too bad it looks like we lost two rows. That's manageable, it's a toy dataset after all.
So let's build a small model to classify penuins!
Machine Learning
First we need to split the data.
This way we can test whether our model learned general rules about our data, or if it just memorized the training data. When a model does not learn the generalities, this is known as overfitting.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target], train_size=.7)
X_train
y_train
Now we can build a machine learning model.
Here we'll use the scikit-learn pipeline model. This makes it really easy for us to train prepocessors and models on the training data alone and cleanly apply to the test data set without leakage.
Pre-processing¶
Statistically and scientifically valid results come from proper treatment of our data. Unfortunately, we can overfit manually if we don't split out a test set before pre-processing.
from sklearn.preprocessing import StandardScaler, OneHotEncoder
num_transformer = StandardScaler()
cat_transformer = OneHotEncoder(handle_unknown='ignore')
The ColumnTransformer
is a neat tool that can apply your preprocessing steps to the right columns in your dataset.
In fact, you could also use a Pipeline instead of "just" a StandardScaler
to use more sophisticated and complex preprocessing workflows that go beyond this toy project.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
])
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', SVC()),
])
model
We can see a nice model representation here.
You can click on the different modules that will tell you which arguments were passed into the pipeline. In our case, how we handle unknown values in the OneHotEncoder.
Model Training¶
Now it's time to fit our Support-Vector Machine to our training data.
model.fit(X_train, y_train[target[0]])
We can see that we get a decent score on the training data.
This metric only tells us how well the model can perform on the data it has seen, we don't know anything about generalization and actual "learning" yet.
model.score(X_train, y_train)
To evaluate how well our model learned, we check the model against the test data one final time.
This invalidates scientific results and must be avoided. Only evaluate on the test set once!
model.score(X_test, y_test)