Getting Started With Machine Learning — Swedish Auto Insurance Dataset is Ideal for Beginner's

Hargurjeet
4 min readMay 11, 2021

If you are getting started with machine learning and looking for dataset to work with to test you skills and understanding then you are at right place. The Swedish auto insurance dataset is ideal for beginners as the volume of data is low (just 63 records) and you don’t have to do minimal feature engineering to understand its relation with the labels (or the final output).

Table to Contents

  1. Introduction
  2. Loading the data
  3. Feature Analysis
  4. Data cleaning
  5. Applying Train, Test and Split
  6. Training on ML model
  7. Cross validation to select best ML model
  8. Model performance

Introduction

The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims. It is a regression problem. It is comprised of 63 observations with 1 input variable and one output variable. The variable names are as follows:

  1. Number of claims.
  2. Total payment for all claims in thousands of Swedish Kronor.

Data load

Data can be down loaded from Kaggle or from my GitHub repo which ever is convenient.

Cleaning data to remove junk values and assigning meaningful name to the column.

The info() function would help us to understand the datatype and null values with columns

To perform feature analysis it is important for us to convert the columns to their respective datatypes. I have used to_numberic function to do the same.

Feature Analysis

Let us have a look to the spread of values for both the columns

There seems to be linear relation between No of claims and Total Payment. Let us try to understand by another visual.

I observe there is a strong co relation between the claims and payments

Before we split the data in training and testing sets. It is important for us to ensure we pick ‘No of claims’ uniformly in training and testing sets. Stratified sampling help us to achieve this. You can know more about stratified and random sampling here.

Here I created a new column to help me categories the data.

Data Cleaning

This activity is performed to remove the junk values from dataset.

Applying Train, Test split

Applying the Stratifiedshufflesplit and ensuring No_of_claims_category is equally spread across training and testing set

Below results displays records are uniformly split among training and testing set.

After the split we no longer need the ‘No_Of_Category’ column

Training the ML Model

Selecting the right model can be a bit tricky. Hence i follow the below cheat sheet shared by SK learn

In our case, we follow the path: Start → > 50 sample →category → quantity → <100k samples → linear regression → SVR → ridge regression

Linear Regression

I observe the predictions are pretty poor, Calculating the RMSE and MAE below

RMSE and MAE seems to be high

Cross validation

I run the cross validation to improve the model performance. You can know more about the importance of cross validation here.

Evaluating few other models to determine the best model

SVR

The SVR seems to be performing worse than the liner regression. I finally try the ridged regression.

Ridge Regression

By observing the avg RMS and standard deviation the model seems to be performing slightly better than linear regression

Model Performance

Let us see how my predictions are performing against the labels.

Actual vs Predicted

I observe the prediction are not under fitting nor over fitting the data and seems reasonably good.

Hope you had few take away from this post. If you liked my work please give me a 👏.

Thanks for reading. Happy Learning 😃

--

--

Hargurjeet

Data Science Practitioner | Machine Learning | Neural Networks | PyTorch | TensorFlow