Data Science in Drilling

Basic Missing Data Imputations and Limitations

written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI

Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.

In today’s episode, some very common missing data imputation strategies and their corresponding limitations are discussed. Enjoy! :)

Enjoying great knowledge is just like enjoying delicious Peking Duck.

Introduction

Imputation refers to replacing missing data in the dataset with the help of statistical estimates. The goal of imputation techniques is to produce a complete dataset that can be used to train machine learning models. In today’s blog post, we will cover the common missing date imputation strategies for both numerical and categorical data.

What We'll Cover Today

Mean/median imputation for numerical data.
Frequent category | mode imputation for categorical data.
The corresponding limitations of these strategies.

Mean/Median Imputation

Mean/median imputation refers to replacing all occurrences of missing values (NA) within a variable by its mean (if the variable has a Gaussian distribution) or median (if the variable has a skewed distribution). Since the mean and median can only be calculated on numerical variables, these methods are suitable for continuous and discrete numerical variables only.

The assumptions made when using the mean/median imputation are that the data is missing completely at random, and the missing values are most likely to be very close to the most frequent/average observation of the dataset.

We will use the famous Titanic dataset as an example. The dataset is available from Kaggle. I am using Google Colab as the IDE. First import the necessary dependencies:

import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import plot, iplot
import plotly.figure_factory as ff
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

We are going to use the Plotly library to generate the plots. Then load the dataset, with only three specific columns:

df = pd.read_csv('/content/drive/MyDrive/titanic/train.csv', usecols=['Age', 'Fare', 'Survived'])

Take an initial look at the data:

df.head()

The DataFrame looks as follows:

Let’s check the percentage of missing values for these 3 columns:

df.isnull().mean()

The results are:

Survived    0.000000 
Age         0.198653 
Fare        0.000000 
dtype: float64

The only column with missing data is the Age column. Let’s try to use the mean/median imputation to fill the missing date of it. Find the median of the Age column:

age_median = df['Age'].median()
age_median

The result is:

28.0

Then find the mean of the Age column:

age_mean = df['Age'].mean()
age_mean

The result is:

29.69911764705882

Now let’s create two extra Age columns in the original DataFrame with its missing data filled by its median and mean, respectively:

df['Age_median'] = df['Age'].fillna(age_median)
df['Age_mean'] = np.round(df['Age'].fillna(age_mean), 1)
df.head(10)

The resulted DataFrame is:

To see the impacts of missing data imputations, let’s first look at the variance:

print('Original age variance: ', df['Age'].var())
print('Variance after median imputation: ', df['Age_median'].var())
print('Variance after mean imputation: ', df['Age_mean'].var())

The results are:

Original age variance:  211.01912474630802 
Variance after median imputation:  169.51249827942345 
Variance after mean imputation:  169.05044898422423

It is seen that the variance has changed quite a bit after missing data imputation. Then let’s plot the distribution of the variable before and after missing data imputation:

hist_data = [df['Age'].dropna(), df['Age_median'], df['Age_mean']]
group_labels = ['Age', 'Age_median', 'Age_mean']
fig = ff.create_distplot(hist_data, group_labels, show_hist=False)
fig.show()

The plots are:

It can be seen that the mean/median imputation distorts the original distribution of the variable Age. The transformed variable shows more values around the mean/median values. Let's then take a look at the outliers:

fig = go.Figure()
fig.add_trace(go.Box(y=df['Age'], name='Age', marker_color = 'indianred'))
fig.add_trace(go.Box(y=df['Age_median'], name='Age_median', marker_color = 'lightseagreen'))
fig.add_trace(go.Box(y=df['Age_mean'], name='Age_mean', marker_color = 'burlywood'))
fig.show()

The plots are:

It can be seen that after the imputation not only we have more outliers on the higher Age values, but we have now outliers as well for the lower values of Age.

In conclusion, the usage of mean/median imputation will distort the original data towards its mean/median and potentially create more outliers. Due to these potential limitations, it is recommended to apply mean/median imputations to numerical variable with less that 5% of missing data. However, in real world application, this technique has been widely applied to numerical variables with more than 5% of missing data. In that case, we still need to understand its potential impacts.

Frequent Category | Mode Imputation

Mode imputation refers to replacing all the missing data (NA) within a variable by its mode, which in other words refers to the most frequent value or most frequent category. Although the mode, or most frequent value can be calculated for both numerical and categorical variables, in practice, we only use this technique on categorical variables. The reason is that for numerical variables, the mean or the median tend to better represent the average value of the population.

The assumptions made when using the mode imputation are that the data is missing completely at random, and the missing observations, most likely look like the majority of the observations in the variable (aka, the mode).

In this case, we will use the famous House Prices dataset for demonstration. This dataset is also available from Kaggle. Load the dataset, with only three specific columns:

df = pd.read_csv('/content/drive/MyDrive/house_price/train.csv', usecols=['BsmtQual', 'FireplaceQu', 'SalePrice'])

Take an initial look at the data:

df.head()

The DataFrame looks as follows:

Check the percentage of missing values for these 3 columns:

df.isnull().mean()

The results are:

BsmtQual       0.025342 
FireplaceQu    0.472603 
SalePrice      0.000000 
dtype: float64

There are two columns with missing data, BsmtQual and FireplaceQu. Let’s apply mode imputation to fill the missing data for both of them, starting from BsmtQual. First take a look at the categorical distributions of BsmtQual before missing data imputation:

sorted_counts = df['BsmtQual'].value_counts().sort_values(ascending=False)
sorted_counts

The results are:

TA    649 
Gd    618 
Ex    121 
Fa     35 
Name: BsmtQual, dtype: int64

Now generate the plots:

px.bar(sorted_counts, labels={'index': 'BsmtQual', 'value': 'counts'})

The plots are:

Find the mode of BsmtQual:

df['BsmtQual'].mode()

The result is:

0    TA 
dtype: object

Keep the original SalePrice for BsmtQual equals to TA for later usage:

tmp = df[df['BsmtQual'] == 'TA']['SalePrice']

Fill the missing data with the mode:

df['BsmtQual'].fillna('TA', inplace=True)

Let's plot the categorical distributions of BsmtQual again:

sorted_counts = df['BsmtQual'].value_counts().sort_values(ascending=False)
px.bar(sorted_counts, labels={'index': 'BsmtQual', 'value': 'counts'})

The plots are:

It can be seen that the overall effect on the distribution is not dramatic. The plot looks almost identical to that one before the imputation. Now let's check did the distribution of SalePrice for the most frequent category change after imputation:

hist_data = [tmp.dropna(), df[df['BsmtQual']=='TA']['SalePrice'].dropna()]
group_labels = ['Original', 'After Imputation']
fig = ff.create_distplot(hist_data, group_labels, show_hist=False)
fig.show()

The plots are:

It is seen that the distribution of house prices between the original and imputed variable is quite similar, because the number of missing data was small.

Let's perform the same procedures for FireplaceQu, which has over 47% of missing data. First check the categorical distributions before missing data imputation.

sorted_counts = df['FireplaceQu'].value_counts().sort_values(ascending=False)
px.bar(sorted_counts, labels={'index': 'FireplaceQu', 'value': 'counts'})

The plots are:

Find the mode of FireplaceQu:

df['FireplaceQu'].mode()

The result is:

0    Gd 
dtype: object

Keep the original SalePrice for FireplaceQu equals to Gd for later usage:

tmp = df[df['FireplaceQu'] == 'Gd']['SalePrice']

Fill the missing data with the mode:

df['FireplaceQu'].fillna('Gd', inplace=True)

Plot the categorical distributions of FireplaceQu again:

sorted_counts = df['FireplaceQu'].value_counts().sort_values(ascending=False)
px.bar(sorted_counts, labels={'index': 'FireplaceQu', 'value': 'counts'})

The plots are:

In this case, a big distortion can be found from the original distribution of the data. Originally, near half of the observartions in FirePlaceQu were missing. Now, all of those values contain Gd. Now let's check did the distribution of SalePrice for the most frequent category change after imputation:

hist_data = [tmp.dropna(), df[df['FireplaceQu']=='Gd']['SalePrice'].dropna()]
group_labels = ['Original', 'After Imputation']
fig = ff.create_distplot(hist_data, group_labels, show_hist=False)
fig.show()

The plots are:

It can be seen that the average house value for the most frequent category has decreased dramatically after using the most frequent label to replace NAs. Thus, the importance of this particular label for prediction of house price, is masked or changed by the imputation of missing values. In cases like this, when the percentage of missing values is so high (~50%), we are better off creating a new category to enclose the observations with NA.

In conclusion, the usage of mode imputation can also distort the original data when the percentage of missing values is high. Therefore, it is also recommended to apply mode imputations to categorical variable with less that 5% of missing data. It is better to create a new category to enclose the observations with NA when the percentage of missing values is high.

Conclusions

In this article, we covered common missing data imputation strategies for both numeical and categorical data, and their corresponding limitations. Hope you enjoy it! More skills about preprocessing and feature selection will be covered in future episodes. Stay tuned!

Get in Touch

Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.

If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Data Science in Drilling - Episode 14

Basic Missing Data Imputations and Limitations

Introduction

What We'll Cover Today

Mean/Median Imputation

Frequent Category | Mode Imputation

Conclusions

Get in Touch

Recent Posts

Comentários