Learning k-folds Cross Validations

In this tutorial I demonstrate implementation of k-folds Cross Validations on a Supervised Regression Machine Learning Problem

Hargurjeet
4 min readAug 30, 2021
Photo by Matthew Henry from Burst

In this tutorial, you will discover a gentle introduction to the k-fold cross-validation procedure for estimating the skill of machine learning models.

After completing this tutorial, you will know:

  • Understanding of k-fold cross validation.
  • Why and when the k-fold cross validation needs to be implemented.
  • Implementing machine learning algorithm on the k-folds sliced datasets.

Table Of Contents

  1. Understanding k-fold
  2. Implementing k-fold on a dataset
  3. Summary
  4. References

№1: Understanding k-fold

Before we understand k-fold cross validation, It is important for us to understand why we need this technique ? How this technique might be better suited for some of the classical machine learning problem datasets instead of the standard train test split.

To solve machine learning problem a traditional way of segregating test and train data is using sklearn train test split.

Train test split involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model, instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

Hence train test split is suitable for “sufficiently large” dataset.

Conversely, the train-test procedure is not appropriate when the dataset available is small. The reason is that when the dataset is split into train and test sets, there will not be enough data in the training dataset for the model to learn an effective mapping of inputs to outputs. There will also not be enough data in the test set to effectively evaluate the model performance. The estimated performance could be overly optimistic (good) or overly pessimistic (bad).

If you have insufficient data, then a suitable alternate model evaluation procedure would be the k-fold cross-validation procedure.

by ai.plainenglish.io

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

At k-fold cross validation the dataset is divided into k folds and following series of steps are followed

  • Dataset is shuffled randomly.
  • Dataset is split in k groups.
  • For each unique group:
    - Take one group as a hold out or test data set
    - Take the remaining groups as a training data set
    - Fit a model on the training set and evaluate it on the test set
    - Retain the evaluation score and discard the model
  • Summarize the skill of the model using the sample of model evaluation scores

I implement the series of above steps in below regression problem and derive a final model.

Go to TOC

№2: Implementing k-fold on a dataset

Importing dataset from Kaggle

To download the dataset from Kaggle, I use the library od.

To connect to Kaggle enter user Kaggle username name and API key.

Please read though this article to understand the process of getting your API key from Kaggle.

I use pandas to read the train and test data sets. Here I access few records of training dataset.

Splitting the dataset to k-fold

Let us check the values of k-fold column to ensure the split is prefect

Great!!!

We have successfully sliced the data into k-folds

Implementing K-fold to Random forests regressor

Below is the final prediction of all the folds which is the list of 5 arrays on the test set.

Great !!! We finally learned implementing k-folds cross validation and derived the final model results.

Go to TOC

№3: Summary

I summarize the my entire notebook as follows

  • First we understood the k-folds cross validation.
  • Why and when to use k-folds cross validation.
  • We implemented k-folds cross validation on a dataset available on Kaggle.
  • Implemented Random forest regressor on the k-fold dataset.
  • We derived the final model and predictions.

Go to TOC

№4: References

I really hope you guys learned something from this post. Feel free to 👏 if you like what you learnt. Let me know if there is anything you need my help with. Please feel free to reach out to me via LinkedIn for any help.

--

--

Hargurjeet

Data Science Practitioner | Machine Learning | Neural Networks | PyTorch | TensorFlow