Understanding Machine Learning Pipeline — A Gentle Introduction
Simplifying Data Preprocessing with Sklearn Pipeline Class
Introduction
Pipelines are a simple way to keep your data preprocessing and modelling code organized. Specifically, a pipeline bundles preprocessing and modelling steps so you can use the whole bundle as if it were a single step.
Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:
- Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won’t need to manually keep track of your training and validation data at each step.
- Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
- Easier to Production: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won’t go into the many related concerns here, but pipelines can help.
- More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.
Table Of Contents
- About the Dataset
- Performing Train Test Split
- Preprocessing W/O pipeline
- Model Implementation W/O Pipelines
- Sklearn Pipeline Implementation
- Summary
- References
№1: About the Dataset
The dataset has been picked up from Kaggle and can be accessed from here. The data contains information from the 1990 California census.
The dataset contains the following columns
- longitude: A measure of how far west a house is; a higher value is farther west
- latitude: A measure of how far north a house is; a higher value is farther north
- housingMedianAge: Median age of a house within a block; a lower number is a newer building
- totalRooms: Total number of rooms within a block
- totalBedrooms: Total number of bedrooms within a block
- population: Total number of people residing within a block
- households: Total number of households, a group of people residing within a home unit, for a block
- medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
- medianHouseValue: Median house value for households within a block (measured in US Dollars)
- oceanProximity: Location of the house w.r.t ocean/sea
medianHouseValue being the target block
№2: Performing Train Test Split
Before we preprocess the dataset it is required to download the dataset from Kaggle and then load it to the pandas' data frame.
Now let us remove the columns which are not relevant and perform the train test split
№3: Preprocessing W/O pipeline
Below is a hack to segregate the numerical and categorical columns
Imputing Numeric Columns
Imputing is the process of filling up empty or nan values within your dataset with a numeric value. The numeric value can be mean, median, most_frequent value or constant. You can read more about this here.
Scaling Numeric Columns
Now scaling all the numberic value btw 0 to 1. There are widely 2 known techniques — MinMaxScaler and standardscaler. I use MinMaxScaler in the current situation.
Imputing Categorical Columns
Similarly, we use imputer to fill missing values in categorical columns. As categorical column contains text data we can not fill the empty values by median, mean…etc hence I used most_frequeent as the strategy.
Encoding Categorical Columns
lastly, we encode the categorical columns which basically create a sparse matrix out of the categorical values.
№4: Model Implementation W/O Pipelines
Once the imputer, scaling and one-hot encoding in applied all the relevant columns are bought together.
Now the linear model is training and the MSE score is calculated on the test set.
Finally, we complete an end to end process of implementing the ML model
№5: Sklearn Pipeline Implementation
Firstly performing the train test split and segregating the numeric and categorical columns
Pipeline Implementation
For pipeline implementation, we import the sklearn pipeline class along with the column transformer class. This is how we implement this
Model Implementation with Pipelines
We import the linear regression model and pass the training data to the pipeline and the model in a single step. Easy peasy 😆
In the next step, we pass the test data and calculate the MAE.
Bonus — Implementing cross-validation
When the pipeline is created we can also calculate the cross-validation scores in a single step.
№6: Summary
- We imported the California housing dataset from Kaggle.
- We implemented all the pre-processing steps (like filling missing values, scaling, encoding…etc) on the dataset.
- We trained the ML model.
- Now we repeated the preprocessing steps using ML pipelines.
- We understood the benefits of pipeline implementation and the bonus tip (cross-validation).
- We trained the ML model and received the MAE score.
№7: References
- https://www.kaggle.com/alexisbcook/pipelines
- Notebook link — https://nbviewer.org/github/hargurjeet/MachineLearning/blob/master/ML_Pipelines.ipynb#Model-Implementation
I really hope you guys learned something from this post. Feel free to 👏if you like what you learnt. Let me know if there is anything you need my help with.
Feel free to reach out to me over LinkedIn.