top of page

Data Science in Drilling - Episode 18

Writer: Zeyu YanZeyu Yan

How to Correctly Apply One-Hot Encoding?


written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI


Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.


In this episode, how to correctly apply one-hot encoding technique will be discussed. Enjoy. :)

Enjoying great knowledge is just like enjoying delicious ice cream.


Introduction


One hot encoding refers to the process by which categorical variables are converted into a numerical form consists of ones and zeros that could be consumed by ML models to make predictions. In most of the machine learning cases, one-hot encoding is often a necessary step.


Although one-hot encoding is a common technique in machine learning, there are still some details need to be figured out. The first question is, if the categorical variable has k different categories, should be create k or k – 1 dummy variables for the one-hot encoding?


If we take the traffic light as a categorical variable, it has three different categories, namely green, yellow and red. Here k = 3. If we choose to use 3 (k) dummy variables for the one-hot encoding, then it will be as follows:

  • If the traffic light is green, it will be represented as (green = 1, yellow = 0, red = 0);

  • If the traffic light is yellow, it will be represented as (green = 0, yellow = 1, red = 0);

  • If the traffic light is red, it will be represented as (green = 0, yellow = 0, red = 1).

On the other hand, if we choose to use 2 (k - 1) dummy variables for the one-hot encoding, the result will be different:

  • If the traffic light is green, it will be represented as (green = 1, yellow = 0);

  • If the traffic light is yellow, it will be represented as (green = 0, yellow = 1);

  • If the traffic light is red, it will be represented as (green = 0, yellow = 0).

It is seen that k – 1 dummy variables are actually enough for a categorical variable with k different categories, which means that we can use one less dimension but still be able to represent all the necessary information. In fact, it is better to encode the categorical variables into k – 1 dummy variables, rather than k dummy variables, for most of the cases. However, there are still some exceptions.


Unlike other machine learning algorithms, tree based algorithms do not use the entire dataset at each node of each tree. Instead, they randomly extract a subset of features. Therefore, if we want all the categories from a categorical variable to be considered by a tree based algorithm, we need to encode the categorical variable into k dummy variables.


Both pandas and scikit-learn can perform one-hot encoding transformation on the dataset. Their differences will also be discussed in this blog post.


Data Exploration


Google Colab will be used as the IDE for the following tutorial. First import the necessary dependencies:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

The classic Titanic dataset will be used in this tutorial. The entire dataset is available from Kaggle. I have already mounted my Google Drive to my Colab, the next step is to load the data:

train = pd.read_csv('/content/drive/MyDrive/test_data/titanic/train.csv')
test = pd.read_csv('/content/drive/MyDrive/test_data/titanic/test.csv')

Let's take a look at both the training and testing data:

train.head()

The training data looks as follows:


test.head()

The testing data looks as follows:


Let's identify the categorical columns:

train.dtypes[(train.dtypes == 'object') | (train.dtypes == 'category')].index.tolist()

The result is:

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

We will mainly focus on Sex, Embarked and Cabin columns in this tutorial. Let's check the unique categories for these columns. For the Sex column in the training data:

train['Sex'].unique()

The result is:

array(['male', 'female'], dtype=object)

For the Sex column in the testing data:

test['Sex'].unique()

The result is:

array(['male', 'female'], dtype=object)

For the Embarked column in the training data:

train['Embarked'].unique()

The result is:

array(['S', 'C', 'Q', nan], dtype=object)

It is noticed that NaN is also considered as a unique category. In general, missing value imputations should be performed before one-hot encoding. Therefore, the dataset shouldn't contain any NaN values before one-hot encoding transformation. For the sake of simplicity, we won't fill the NaN values in the datasets in this tutorial.


For the Embarked column in the testing data:

test['Embarked'].unique()

The result is:

array(['Q', 'S', 'C'], dtype=object)

For the Cabin column in the training data:

train['Cabin'].unique()

The result is:

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',        'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',        'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',        'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',        'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',        'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',        'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',        'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',        'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',        'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',        'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',        'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',        'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',        'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',        'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',        'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',        'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',        'C148'], dtype=object)

There are too many categorical values in the Cabin column. For the sake of simplicity, let’s process the Cabin column to reduce the number of categories:

train['Cabin'] = train['Cabin'].str[0]
test['Cabin'] = test['Cabin'].str[0]

Now check the Cabin column in the training data again:

train['Cabin'].unique()

The result is:

array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

Check the Cabin column in the testing data again:

test['Cabin'].unique()

The result is:

array([nan, 'B', 'E', 'A', 'C', 'D', 'F', 'G'], dtype=object)

One-Hot Encoding with Pandas


Into k Dummy Variables


Pandas has the built-in method get_dummies which can make one-hot encoding transformations. Let's first try to use the get_dummies method to transform categorical columns into k dummy variables. For the Sex column in the training set:

one_hot_encoded_sex_train = pd.get_dummies(train['Sex'])
pd.concat([train['Sex'], one_hot_encoded_sex_train], axis=1).head()

The result is:


For the Embarked column in the training set:

one_hot_encoded_embarked_train = pd.get_dummies(train['Embarked'])
pd.concat([train['Embarked'], one_hot_encoded_embarked_train], axis=1).head()

The result is:


For the Cabin column in the training set:

one_hot_encoded_cabin_train = pd.get_dummies(train['Cabin'])
pd.concat([train['Cabin'], one_hot_encoded_cabin_train], axis=1).head(10)

The result is:


It seems that the get_dummies method works pretty well on each individual column. This method can also be directly applied to the entire DataFrame:

one_hot_encoded_train = pd.get_dummies(train, columns=['Sex', 'Embarked', 'Cabin'])
one_hot_encoded_train.head()

Here we specify the columns which we want to be transformed. The resulted DataFrame is as follows:

It can be seen that all the selected columns were one-hot encoded.


Into k - 1 Dummy Variables


It turned out that to use the get_dummies method to transform categorical columns into k - 1 dummy variables is pretty easy with the drop_first option set to True. For the Sex column in the training set:

one_hot_encoded_sex_train = pd.get_dummies(train['Sex'], drop_first=True)
pd.concat([train['Sex'], one_hot_encoded_sex_train], axis=1).head()

The result is:


For the Embarked column in the training set:

one_hot_encoded_embarked_train = pd.get_dummies(train['Embarked'], drop_first=True)
pd.concat([train['Embarked'], one_hot_encoded_embarked_train], axis=1).head()

The result is:


One issue needs to be noticed is that both C and NaN are represented by (Q = 0, S = 0) in this case. When missing value imputation is performed before one-hot encoding, this issue won't happend again. Another way to deal with NaN values is to through the dummy_na option of the get_dummies method. Let's try this option on the Cabin column:

one_hot_encoded_embarked_train = pd.get_dummies(train['Embarked'], drop_first=True, dummy_na=True)
pd.concat([train['Embarked'], one_hot_encoded_embarked_train], axis=1).head()

The result is:


It is seen that when the dummy_na option is set to True, NaN becomes a category.


Apply the get_dummies method to the entire training set with the drop_first option set to True:

one_hot_encoded_train = pd.get_dummies(train, columns=['Sex', 'Embarked', 'Cabin'], drop_first=True)
one_hot_encoded_train.head()

The result is:


Although one-hot encoding through Pandas is pretty straight forward and easy to apply, in practice, we shouldn’t be using it in the maching learning pipeline. The reason is that this method will process the training set and the testing set separately. No information from the training set will be reserved and applied during the transformation of the testing set. The training set and the testing set may also end up with different number of features, which results in incompatibility with training and scoring using scikit-learn. One-hot encoding through Pandas should only be used for quick data exploration purpose.


One-Hot Encoding with Scikit-learn


Into k Dummy Variables

One-hot encoding can also be realized through the scikit-learn package and it's the way to do it in practice. The first step is to create an one-hot encoder instance:

oh_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

The handle_unknown option is set to 'ignore', which will treat NaN as a separate category. Extract three columns of interest and perform one-hot encoding transformation:

temp = train[['Sex', 'Embarked', 'Cabin']]
one_hot_encoded_temp = oh_encoder.fit_transform(temp)

Take a look at the categorical information contained in the one-hot encoder:

oh_encoder.categories_

The results are:

[array(['female', 'male'], dtype=object),  
 array(['C', 'Q', 'S', nan], dtype=object),  
 array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T', nan], dtype=object)]

Let's take a look at the encoded data:

one_hot_encoded_temp

The result is:

array([[0., 1., 0., ..., 0., 0., 1.],        
       [1., 0., 1., ..., 0., 0., 0.],        
       [1., 0., 0., ..., 0., 0., 1.],        
       ...,        
       [1., 0., 0., ..., 0., 0., 1.],        
       [0., 1., 1., ..., 0., 0., 0.],        
       [0., 1., 0., ..., 0., 0., 1.]])

It is seen that the encoded data through scikit-learn is a numpy array. It can be converted into a DataFrame with the right column names as follows:

one_hot_encoded_temp = pd.DataFrame(one_hot_encoded_temp, columns=oh_encoder.get_feature_names_out())
one_hot_encoded_temp.head()

The result is:


To combine the encoded data with the training set:

train.drop(['Sex', 'Embarked', 'Cabin'], axis=1, inplace=True)
pd.concat([train, one_hot_encoded_temp], axis=1).head()

The result is:


The same encoder can be used to transform the testing data:

temp2 = test[['Sex', 'Embarked', 'Cabin']]
one_hot_encoded_temp2 = oh_encoder.transform(temp2)
one_hot_encoded_temp2 = pd.DataFrame(one_hot_encoded_temp2, columns=oh_encoder.get_feature_names_out())
test.drop(['Sex', 'Embarked', 'Cabin'], axis=1, inplace=True)
pd.concat([test, one_hot_encoded_temp2], axis=1).head()

The result is:


Into k - 1 Dummy Variables

To transform categorical columns into k - 1 dummy variables, the drop option can be set to 'first' in OneHotEncoder:

oh_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore', drop='first')

If we reload the data and use this new encoder to encode the training set:

temp = train[['Sex', 'Embarked', 'Cabin']]
one_hot_encoded_temp = oh_encoder.fit_transform(temp)
one_hot_encoded_temp = pd.DataFrame(one_hot_encoded_temp, columns=oh_encoder.get_feature_names_out())
one_hot_encoded_temp.head()

The result is:


Conclusions


In this article, we covered how to correctly apply one-hot encoding technique to datasets. One-hot encoding is a straight-forward, easy-to-apply technique. However, you may have already noticed, one-hot encoding expands the feature space, especially for categorical variables with many categories. The solution of this potential issue will be covered in future episodes. Stay tuned!


Get in Touch


Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.


If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!


Comments


bottom of page