Getting Started With Machine Learning — Swedish Auto Insurance Dataset is Ideal for Beginner's
If you are getting started with machine learning and looking for dataset to work with to test you skills and understanding then you are at right place. The Swedish auto insurance dataset is ideal for beginners as the volume of data is low (just 63 records) and you don’t have to do minimal feature engineering to understand its relation with the labels (or the final output).
Table to Contents
- Introduction
- Loading the data
- Feature Analysis
- Data cleaning
- Applying Train, Test and Split
- Training on ML model
- Cross validation to select best ML model
- Model performance
Introduction
The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims. It is a regression problem. It is comprised of 63 observations with 1 input variable and one output variable. The variable names are as follows:
- Number of claims.
- Total payment for all claims in thousands of Swedish Kronor.
Data load
Data can be down loaded from Kaggle or from my GitHub repo which ever is convenient.
Cleaning data to remove junk values and assigning meaningful name to the column.
The info() function would help us to understand the datatype and null values with columns
To perform feature analysis it is important for us to convert the columns to their respective datatypes. I have used to_numberic function to do the same.
Feature Analysis
Let us have a look to the spread of values for both the columns
There seems to be linear relation between No of claims and Total Payment. Let us try to understand by another visual.
I observe there is a strong co relation between the claims and payments
Before we split the data in training and testing sets. It is important for us to ensure we pick ‘No of claims’ uniformly in training and testing sets. Stratified sampling help us to achieve this. You can know more about stratified and random sampling here.
Here I created a new column to help me categories the data.
Data Cleaning
This activity is performed to remove the junk values from dataset.
Applying Train, Test split
Applying the Stratifiedshufflesplit and ensuring No_of_claims_category is equally spread across training and testing set
Below results displays records are uniformly split among training and testing set.
After the split we no longer need the ‘No_Of_Category’ column
Training the ML Model
Selecting the right model can be a bit tricky. Hence i follow the below cheat sheet shared by SK learn
In our case, we follow the path: Start → > 50 sample →category → quantity → <100k samples → linear regression → SVR → ridge regression
Linear Regression
I observe the predictions are pretty poor, Calculating the RMSE and MAE below
RMSE and MAE seems to be high
Cross validation
I run the cross validation to improve the model performance. You can know more about the importance of cross validation here.
Evaluating few other models to determine the best model
SVR
The SVR seems to be performing worse than the liner regression. I finally try the ridged regression.
Ridge Regression
By observing the avg RMS and standard deviation the model seems to be performing slightly better than linear regression
Model Performance
Let us see how my predictions are performing against the labels.
I observe the prediction are not under fitting nor over fitting the data and seems reasonably good.
Hope you had few take away from this post. If you liked my work please give me a 👏.
Thanks for reading. Happy Learning 😃