Best Known Techniques For Data Scientist To Handle Missing/Null Values In Any Tabular Dataset

Being a data scientist, It is quite intimidating to handle missing values at times. This blog will help you to build the right mindset

Hargurjeet
6 min readMar 11, 2022
Photo by Brett Jordan on Unsplash

What are missing values in a Dataset?

As the name suggests, values within the dataset which are usually been displayed or represented as NaN, ‘’(blank) for know or unknown reasons.

While working with real-world datasets it is very common to come across missing values and as a data scientist, you are expected to overcome this issue by following the right technique while maintaining the underlining behaviour of the dataset.

Before we jump to the techniques it is important for us to understand the underlying reason for the missing values like

— For IOT data was sensors were offline any time.

— Survey data where some questions were not answered.

— A Kaggle competition where the host wants to make the problem hard 😜.

— Online form or website allowing blanks or null values for non-mandatory fields.

— Values are just randomly missing.

Your understanding of the above behaviour would determine which technique you would like to follow to overcome the missing values. Hence the why part of the above problem is very important.

For illustration purposes, I take the Housing Prices Competition dataset present at Kaggle. I access this dataset and read into the data frame as follows

Identifying null values

There are many ways to identify null values. I find the below two techniques the easiest.

  1. Using isna() method — The isna() the method returns a boolean value upon the validation of the NaN value. You can run .sum() over it to the sum of all the null values.

2. Using .info() technique — It gives the details of the column-wise value along with the non-null values

Missing Values EDA

  1. Ensure reading training and test values and compare % missing values

In this case, the distribution of missing values in train and test seems to be uniform which is very abnormal. In a real-world scenario we wouldn't expect such symmetry may these values are induced intentionally.

2. Understanding the number of missing values per observation.

Another approach can be to understand the missing values for each row in the dataset. I select the features where we observed the missing values in the first place.

An interesting thing to observe, there are records having missing values up to 14,13,16. Maybe such sparsity of features might help in better learning 😄

3. Checking for imbalance in missing values for categorical features

Let us take a couple of categorical features and see the spread of missing values under each value of the categorical features

Some common FAQ’s ?

Q1: Should the EDA be performed before or after imputing the missing values? which approach gives us a better result?

Answer— It is best to do before because the filling itself might change the distribution.

Q2: Should the missing values be imputed after normalizing or before normalizing the data?

Answer — There is no right answer to this. You can try both and check the results. Whenever in such doubts test it in your K fold cross validation loop and see the result.

Imputation Techniques

Photo by Volodymyr Hryshchenko on Unsplash

Table of content

  1. Do nothing
  2. Just Drop the NA columns
  3. Pandas Imputations
  4. Sklearn Imputations
  5. References

Step 1: Do nothing

Models like LightGBM and XGBoost can deal with missing values on their own. These models do not require imputation and see the missing values as different values.

In other types of models and neural networks, you can not leave the missing values and imputation is required.

Step 2: Just drop the NA columns

The model is not going to perform well if we end up dropping all the null values. There are 3 ways of dropping the null values

  • Dropping by rows — In this technique, we drop all the records having null values based on rows. This can be achieved using pandas as follows

as observed above, using this technique all the rows are getting dropped hence not suited for the current situation.

  • Dropping by columns — This would drop all the columns which have null values.

From the above results, it is observed, out of 82 columns 47 columns non-null values and the rest of the columns are dropped.

  • Dropping subset based on a condition — This technique can be used to drop rows based on a certain column containing null values

The above results show that the feature Alley have 91 non-null records and the rest of the records are dropped.

Now you need to select the techniques that suit you best under the given circumstances.

Step 3: Pandas Imputation

  1. Using this technique you can fill the missing values with a numeric value using fillna

2. Filling missing values by mean/median/mode

You can also fill the missing values by mean using .mean()

similarly with median values. The below representation displayed missing values alongside their mean and median values.

3. Using GroupBy Fills

Using this technique you can fill the missing values using mean/median based on the certain grouping of a feature. For instance, we can fill in the missing values of LotFrontage using the group by an average of MSZoning

Now filling these avg values by replacing the missing values for LotFrontage feature

Step 4: SKLEARN Imputation

This is would be the best technique for imputation as compared to previous techniques. It provides us with a class on which we can create objects allowing us to fit and transform data within the cross-validation loop separately. This is not possible in the previous techniques mentioned above.

In real-world situations, you will want to fit and transform within your cross-validation loop to ensure no leakage.

  • SimpleImputer
  • IterativeImputer
  • KNNImputer

SimpleImputer

The SimpleImputer has an argument strategy which can take values to mean, median, most_frequent and constant. We can choose to fill the values based on our use case. Below I display the way this can be implemented

IterativeImputer

This strategy of imputing is to fill in the missing values using other features in a round-robin fashion. In simple words, it iterates over multiple time until it finds the best fit. It treats in missing values feature as Target columns and tries to fill using the existing features. The Iterations is controlled by my max_iter parameter. It is implemented as follows

KNN Imputer

Imputation for completing missing values using k-Nearest Neighbors. It is similar IterativeImputer the only difference is that it is using K nearest neighbours. You can select the nearest neighbours using the n_neighbors argument.

References

This blog is inspired from the above mentioned kaggle notebook. I really hope you guys learned something from this post. Feel free to clap if you like what you learnt. Let me know if there is anything you need my help with. Feel free to reach out over LinkedIn

--

--

Hargurjeet

Data Science Practitioner | Machine Learning | Neural Networks | PyTorch | TensorFlow