Machine Learning with Python: Implementing XGBoost and Random Forest

A self paced guide for implementing ML models, Learn setting up data transformation pipelines, Hyperparameter tuning your model to achieve best result and finally submitting the model to Kaggle

Hargurjeet
11 min readAug 9, 2021
Photo by Nicole De Khors from Burst

As the field of machine learning is ever evolving keeping yourself up to date and learning continuously is the key. Learning new skills can sometime be time consuming. In this notebook I make an attempt to bring a beginner friendly approach to build machine learning model from scratch, Implementing data pipelines to perform all the required transformation to the data before it can be passed to our ML model and finally hyperparameter tuning to get best results out of our model.

In the end we will learn saving and recalling our trained model so that it can be reused on your convenience without the hassle to training it again before using it. Also submitting your results to kaggle to benchmark your model performance.

Exciting let begin !!!

Before we dive into the dataset I would like to take few mins to refresh few of the concepts. If you are familiar with these please feel free to skip this section and directly jump to Table Of Contents

Machine Learning Models (aka ML models)
In supervised machine learning, Following is the list of commonly/widely used ML models

  1. Linear Regression
  2. Logistic Regression
  3. Decision Tree
  4. SVM
  5. Naive Bayes
  6. kNN
  7. K-Means
  8. Random Forest
  9. Dimensionality Reduction Algorithms
  10. Gradient Boosting algorithms
    1. GBM
    2. XBBoost
    3. LightGBM
    4. CatBoost

Understanding Random Forest and XGBoost will be beyond the scope of this article but I have provided few resources in the reference section which can help you to get started.

DATA Pipelines
A data pipeline is a set of actions that ingest raw data from different sources and move the data to a destination so that it can be used for relevant purposes. A pipeline also may include filtering and features that provide resiliency against failure. So in simple terms we can visualize a pipeline like this, The end product then can be feed into the ML model.

Source: https://hazelcast.com/glossary/data-pipeline/

Hyperparameter Tuning
Hyperparameter Tuning is choosing a set of optimal hyperparameters for a learning algorithm. Every learning algorithm have different set of parameter. for instance Random forest have the following parameter that can be tuned

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

The optimal hyperparameters help to avoid under-fitting (training and test error are both high) and over-fitting (Training error is low but test error is high). You can get the intuition by this diagram

Source: analyticsvidhya.com

I would recommend to give a read to the below article to understand it in details.

Table Of Contents

  1. About the Dataset
  2. Loading and Preprocessing dataset
  3. Exploratory Data Analysis
  4. Feature Engineering
  5. Data Cleaning, Splitting and Pipeline Implementation
  6. Implementing Random Forest
  7. Hyperparameter Tuning — Random Forest
  8. Implementing XGBoost
  9. Hyperparameter Tuning — XGBoost
  10. Sample Prediction and Saving & Recalling the model
  11. Summary
  12. Future Work
  13. References

№1: About the Dataset

One of the biggest challenges of an auto dealership purchasing a used car at an auto auction is the risk of that the vehicle might have serious issues that prevent it from being sold to customers. The auto community calls these unfortunate purchases “kicks”.

Kicked cars often result when there are tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller, or some other unforeseen problem. Kick cars can be very costly to dealers after transportation cost, throw-away repair work, and market losses in reselling the vehicle.

Modelers who can figure out which cars have a higher risk of being kick can provide real value to dealerships trying to provide the best inventory selection possible to their customers.

The challenge of this competition is to predict if the car purchased at the Auction is a Kick (bad buy).

  • The challenge of this competition is to predict if the car purchased at the Auction is a good / bad buy.
  • All the variables in the data set are defined in the file Carvana_Data_Dictionary.txt
  • The data contains missing values
  • The dependent variable (IsBadBuy) is binary (C2)
  • There are 32 Independent variables (C3-C34)
  • The data set is split to 60% training and 40% testing.

Go to TOC

№2: Loading and Preprocessing dataset

Downloading all the required python packages to get started, also imported all the required libraries

To download the dataset from Kaggle, I use the library od.

To connect through kaggle API call enter your Kaggle username name and API key.

Please read through this article to understand the process of getting your API key from kaggle.

Below are the list of files downloaded. You can access all the downloaded filename using os.listdir(<’downloaded folder name’>)

I use pandas to read the train and test data sets. Here I access few records of training dataset

Go to TOC

№3: Exploratory Data Analysis

3.1: Understanding the manufacture year of the vehicles

It is evident that although the max vehicles are with manufacturing of 2005 and 2006, the percentage of vehicle turn out to be kicks are less as compared to the manufacturing year 2001 and 2002 where this rate seems to be close to 50%. Vehicle from the later year of 2008 and 2009 seems to have minimum no of vehicle that turn out bad.

3.2 Checking if the auction has any influence or vehicle being bad.

It is observed at MANHEIM auction most of the car sold have the turned out good.

3.3 Understanding the Manufactures

Cars from manufactures DODGE and FORD seems to have maximum no of kicked cars.

3.4 Impact of color of car being kicked or not

There is a high probability of repainting of kicked cars. High no of kicked cars are painted in White, Sliver, Blue.

3.5 Impact of transmission type on kicked cars

Couple of observations here

  • Most of the car have AUTO transmission and kicked cars out of these are about 16%.
  • Data quality issue, For some cars transmission type is recorded as ‘Manual’ instead of ‘MANUAL’

Go to TOC

№4: Feature Engineering

Before performing the feature engineering, let us first check the data quality and identify null values.Below are the details of training set

Below are the details of test set

Checking Duplicates

Both testing and training sets have no duplicates.

As we see above, many column contains null values. It is important for us to understand the relevant columns that would help model to better generalize. Hence following columns seems not relevant as they contain specific details which may not help model to learn better. Hence I wont pass these pass these columns to the model

  • PurchDate (Date might not be relevant but Year would be)
  • WheelTypeID
  • Model
  • Trim
  • SubModel
  • Make
  • VNZIP1
  • VNST

Here I create few additional column using the existing column to drive some additional features from the datasets.

The additional features help the model in better training, also ensuring the model to be more generalized.

Handling NaN

I observed one particular scenario where the categorical values are varying because of the case sensitive issue.

The target values seems to be highly imbalance. This is not good for our machine learning model

Go to TOC

№5: Data Cleaning, Splitting and Pipeline Implementation

We perform the following tasks under this section

  • Segregating the features and target.
  • Identifying the numerical and categorical columns.
  • Splitting the dataset between train and testing.
  • For numeric columns
    - Used KNN Imputer to fill the missing values. (simple Imputer can also be used in case you are not familiar with KNN Imputer)
    - MinMaxScaler() to normalize the numerical values.
  • For categorical columns
    - Used Simple Imputer to fill up the missing values.
    - OneHotEncoder to all the categorical columns.
  • Finally all the above steps are put inside a pipelines

Go to TOC

№6: Implementing Random Forest

The training set accuracy is close to 100%! But we can’t rely solely on the training set accuracy, we must evaluate the model on the validation/test set too.
We can make predictions and compute accuracy in one step using model.score

It appears that the model has learned the training examples perfect, and doesn’t generalize well to previously unseen examples. One possible reason might be because we to resolve oversampling we ended up inducing some overlaps between testing and training set hence we are observing a very high accuracy. This phenomenon is called “overfitting”, and reducing overfitting is one of the most important parts of any machine learning project.

I can now think of two possible solutions

  1. Hyperparameter tuning to overcome the overfitting. I will cover this in next section.
  2. Instead of splitting the data in train and testing set. Let us train on the entire set in one go this validation strategy is called as K-fold cross validation. To illustrate this via an example I will implement XGboost an apply cross validation on top on it.

Go to TOC

№7: Hyperparameter Tuning — Random Forest

As we saw in the previous section, our random tree classifier memorized all training examples, leading to a 100% training accuracy, while the validation accuracy was only marginally better than a dumb baseline model. This phenomenon is called overfitting, and in this section, we’ll look at some strategies for reducing overfitting. The process of reducing overfitting is known as regularization.

By varying the following fields, we can prevent the tree from memorizing all training examples, which may lead to better generalization

Although the accuracy of overall model is reduced, We have significantly reduced overfitting as we see the correlation between the training and testing results. I now pick the best parameters which would help to me to achieve better accuracy from the available list of parameters.

Sklearn provides us the library randomsearchCV to help us running all the model over the list of parameter instead we doing it manually and best_estimator helps us in selecting the best parameter on which the model would be from best.

Accuracy

It is observed that although the accuracy has dropped significantly but the deviation between the training and test sets is minimal. Hence our model is fairly generalized.

Results captured can now be submitted to kaggle. In the following block we create the output file and the same can be uploaded to kaggle.

The csv file now can be downloaded and submitted to kaggle competition.

Kaggle Submission

  • Go to the competition page
  • click on late submission
  • Upload the csv file and click on ‘Make Submission’. Thats is, It is all that simple

Go to TOC

№8: Implementing XGBoost

First we will setup our pipelines as usual

Here we call the xgboost classifier

Now we fit the model to the entire dataset. Remember we won't be performing train test split instead we will use K fold cross validation to evaluate our model.

Our model has shown the accuracy of about 83% when tested on train set. Let us now implement the k fold cross validation

K Fold Cross Validation

Train Accuracy: 0.7321576783743934, Validation Accuracy: 0.3810881537319845

Train Accuracy: 0.7148939078809894, Validation Accuracy: 0.39284458852478227

Train Accuracy: 0.7126871136889592, Validation Accuracy: 0.48498222864508067

Train Accuracy: 0.7391100565368954, Validation Accuracy: 0.43030113658555635

Train Accuracy: 0.7381556848806781, Validation Accuracy: 0.4356690883524725

As observed here although the accuracy of training set is high but validation set seems to exhibit very low accuracy. So none of these model is good enough but we can take average of these model so that the errors are reduced. Let’s also define a function to average predictions from the 5 different models.

Now we predict the test outcome based on the average learning of all above 5 models

Go to TOC

№9: Hyperparameter Tuning — XGBoost

We write a helper function to perform hyperparameter tuning along with k folds

Now, I start experimenting with parameter. There is no hard and fast rules on picking the parameter to start with. I usually prefer the way they appear in the order as per the sklearn library. The values to pass on these parameter comes with experience but a thumb rule can be to always start with smaller values.

After evaluating all the n_estimators it is observed the model perform best when n_estimators=500 (I have excluded the output block intentionally if you wish to see the results of the output block I have provided a link to my notebook in the reference section)

After evaluating all the n_estimators, max_depth it is observed the model perform best when n_estimators=500 and max_depth=6

After evaluating all the n_estimators, max_depth and learning rate it is observed the model perform best when n_estimators=500 , max_depth=6 and learning rate=0.9

Summary

We have trained the XGBoost on the following parameters and achieved the respective training and val accuracies.

  • n_estimatore
    - n_estimatore=10
    - n_estimatore=100
    - n_estimatore=240
    - n_estimatore=500
  • max_depth
    - max_depth=2
    - max_depth=4
    - max_depth=6
  • learning_rate
    - learning_rate=0.01
    - learning_rate=0.1
    - learning_rate=0.3
    - learning_rate=0.9
    - learning_rate=0.99

Based on the results, we conclude the final model parameters as follow

The csv file now can be downloaded and submitted to kaggle competition. You can revisit the detailed steps for submission in Section No 7.

Go to TOC

№10: Sample Prediction and Saving the model

Making Predictions on New Inputs

The output of our given sample is 0 and as per our model the probability of output being zero is 63%.

Saving the model

We can save the parameters (weights and biases) of our trained model to disk, so that we needn’t retrain the model from scratch each time we wish to use it. Along with the model, it’s also important to save imputers, scalers, encoders and even column names. Anything that will be required while generating predictions using the model should be saved.

We can use the joblib module to save and load Python objects on the disk.

The object can be loaded back using joblib.load

We can check if the reloaded model is behaving as we expected it

The accuracy is matching up the with original models hence we have saved and recalled the model successfully.

Go to TOC

№11: Summary

I summarize the my entire notebook as follows

  • We downloaded the Car quality detection Dataset dataset from Kaggle.
  • We ran EDA and analysed the input features.
  • We then performed feature engineering and data cleaning to filtered out on the relevant car data.
  • After this we take two different algorithm to build Machine learning model.
    - Random Forest
    - XGBoost
  • We applied hyperparameter tuning to get best our of the ml model and to generalize it in best possible ways.
  • The output file generate can be submitted to Kaggle to evaluate your results.

Go to TOC

№12: Future Work

  • Although I tried couple of models. There are many more models which can be tried off like decision trees, light GMS
  • Implementing deep learning to get better model
  • submit your results to Kaggle competitions and evaluate your model performance at the leaderboard.

Go to TOC

№13: References

I took my inspiration from the following notebooks

Lastly, I would like to say thanks to Aakash N S and the Jovian.ml for providing the course on machine learning. If u want to get started this course might be good start.

If you liked my work and feel that you learned something today pls feel free to few 👏 this keeps me motivating. For any queries pleases put a note in the comment section

Thank you, Have a great day

--

--

Hargurjeet

Data Science Practitioner | Machine Learning | Neural Networks | PyTorch | TensorFlow