Data Science in Drilling - Episode 26

Zeyu Yan
Oct 13, 2022
3 min read

Basic Missing Value Handling in PySpark DataFrames

written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI

Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.

We will continue our journey in Spark. Let's get started!

As always, the first step is to import the necessary dependencies and load the data into a PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('app').getOrCreate()

data_path = '/content/drive/MyDrive/test_data/ContainsNull.csv'
df = spark.read.csv(data_path, inferSchema=True, header=True)
df.show()

The loaded DataFrame is as follows:

It can be seen that the DataFrame contains missing (null) values. The main topic of this blog post is about missing value handling in PySpark.

Dropping Missing Value

The easiest way of handling missing values is to remove the entire rows which contain missing values. This can be realized as follows:

# Drop any row that contains missing data
df.na.drop().show()

Which is equivalent to:

df.na.drop(how='any').show()

Because how='any' is the default option.

The result is:

It can be seen that all the rows with missing values have been dropped.

A threshold which indicats the minimum number of non-null values can be passed into the drop function, then only the rows which have at least the threshold number of non-null values won't be dropped. For example, we want to keep the rows in the original DataFrame which have at least non-null values:

# Has to have at least 2 NON-null values
df.na.drop(thresh=2).show()

The result is:

Now say that we only want to drop the rows whose Sales value is missing/null. This can be realized by using the subset parameter of the drop function:

df.na.drop(subset=['Sales']).show()

The result is:

Now say we only want to drop the rows with all their values to be missing/null. The can be realized using the how parameter of the drop function:

df.na.drop(how='all').show()

The result is:

It can be seen that no row has been dropped since there is no row with all the values to be missing/null.

Filling Missing Value

Besides dropping missing values, we can also fill missing values using different strategies. Take a look at the following command:

df.na.fill('NEW VALUE').show()

The result is:

It can be seen that only the null values in the columns of a string type has been replaced. Let's try another command:

df.na.fill(0).show()

The result is:

This time, only the null values in the columns of a number type has been replaced. Spark is smart enough to fill the missing values in the columns with the right data type using the value we provide.

We can also tell Spark to only fill the missing values of a specific column. Say we only want to fill the missing values in the Name column:

df.na.fill('No Name', subset=['Name']).show()

The result is:

Now we have a bit more complicated requirement. We want to fill the missing values in the Sales column using mean value of the entire column. Before we actually fulfill this requirement, let's first calculate the mean value of the Sales column. To calculate the mean, we need to import some new dependencies:

from pyspark.sql.functions import mean

Let's take a look at the mean of the Sales column as a DataFrame:

df.select(mean(df['Sales'])).show()

The result is:

Now need to collect that mean value to use in our fill function:

mean_val_row = df.select(mean(df['Sales'])).collect()
mean_sales = mean_val_row[0][0]

Now we are ready to use this mean value in our fill function:

df.na.fill(mean_sales, ['Sales']).show()

The result is:

For those who are interested in how to realize this using one line of code, this is it:

df.na.fill(df.select(mean(df['Sales'])).collect()[0][0], ['Sales']).show()

Conclusions

In this article, we covered basic missing value handling in PySpark DataFrames. We will cover great details about Spark in this upcoming Spark tutorial series. Stay tuned!

Get in Touch

Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.

If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Data Science in Drilling - Episode 26

Basic Missing Value Handling in PySpark DataFrames

Dropping Missing Value

Filling Missing Value

Conclusions

Get in Touch

Recent Posts

Comments