Understanding Machine Learning Pipeline — A Gentle Introduction

Simplifying Data Preprocessing with Sklearn Pipeline Class

4 min readNov 3, 2021

Introduction

Pipelines are a simple way to keep your data preprocessing and modelling code organized. Specifically, a pipeline bundles preprocessing and modelling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won’t need to manually keep track of your training and validation data at each step.
Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
Easier to Production: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won’t go into the many related concerns here, but pipelines can help.
More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.

About the Dataset
Performing Train Test Split
Preprocessing W/O pipeline
Model Implementation W/O Pipelines
Sklearn Pipeline Implementation
Summary
References

№1: About the Dataset

The dataset has been picked up from Kaggle and can be accessed from here. The data contains information from the 1990 California census.

The dataset contains the following columns

longitude: A measure of how far west a house is; a higher value is farther west
latitude: A measure of how far north a house is; a higher value is farther north
housingMedianAge: Median age of a house within a block; a lower number is a newer building
totalRooms: Total number of rooms within a block
totalBedrooms: Total number of bedrooms within a block
population: Total number of people residing within a block
households: Total number of households, a group of people residing within a home unit, for a block
medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
medianHouseValue: Median house value for households within a block (measured in US Dollars)
oceanProximity: Location of the house w.r.t ocean/sea

medianHouseValue being the target block

№2: Performing Train Test Split

Before we preprocess the dataset it is required to download the dataset from Kaggle and then load it to the pandas' data frame.

Now let us remove the columns which are not relevant and perform the train test split

№3: Preprocessing W/O pipeline

Below is a hack to segregate the numerical and categorical columns

Imputing Numeric Columns

Imputing is the process of filling up empty or nan values within your dataset with a numeric value. The numeric value can be mean, median, most_frequent value or constant. You can read more about this here.

Scaling Numeric Columns

Now scaling all the numberic value btw 0 to 1. There are widely 2 known techniques — MinMaxScaler and standardscaler. I use MinMaxScaler in the current situation.

Imputing Categorical Columns

Similarly, we use imputer to fill missing values in categorical columns. As categorical column contains text data we can not fill the empty values by median, mean…etc hence I used most_frequeent as the strategy.

Encoding Categorical Columns

lastly, we encode the categorical columns which basically create a sparse matrix out of the categorical values.

№4: Model Implementation W/O Pipelines

Once the imputer, scaling and one-hot encoding in applied all the relevant columns are bought together.

Now the linear model is training and the MSE score is calculated on the test set.

Finally, we complete an end to end process of implementing the ML model

№5: Sklearn Pipeline Implementation

Firstly performing the train test split and segregating the numeric and categorical columns

Pipeline Implementation

For pipeline implementation, we import the sklearn pipeline class along with the column transformer class. This is how we implement this

Model Implementation with Pipelines

We import the linear regression model and pass the training data to the pipeline and the model in a single step. Easy peasy 😆

In the next step, we pass the test data and calculate the MAE.

Bonus — Implementing cross-validation

When the pipeline is created we can also calculate the cross-validation scores in a single step.

№6: Summary

We imported the California housing dataset from Kaggle.
We implemented all the pre-processing steps (like filling missing values, scaling, encoding…etc) on the dataset.
We trained the ML model.
Now we repeated the preprocessing steps using ML pipelines.
We understood the benefits of pipeline implementation and the bonus tip (cross-validation).
We trained the ML model and received the MAE score.

№7: References

I really hope you guys learned something from this post. Feel free to 👏if you like what you learnt. Let me know if there is anything you need my help with.

Feel free to reach out to me over LinkedIn.