Data Science in Drilling - Episode 23

Sep 22, 20223 min read

Introduction to the Upcoming Spark Series

written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI

Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.

If you are enthusiastic about data science, you must have already heard of the name of Spark. Apache Spark is an Open source analytical processing engine for large scale powerful distributed data processing and machine learning applications. Spark is Originally developed at the University of California, Berkeley’s, and later donated to Apache Software Foundation. Apache Spark is so powerful that it becomes an essential skill for data scientists, especially when dealing with big data. For more details about Spark's architecture and APIs, please refer to its official documentations.

We are preparing a series of Spark tutorials. This blog post, as the first one, as well as a soft introduction, mainly focused on how to painlessly set up a development for Spark using the Python API.

Environment Set Up

Spark usually runs in cluster mode, which fulfills its big data processing capabilities. However, it doesn’t mean that we cannot run Spark with a single node. Today we will use Google Colab to set up a single-node development environment and use Spark’s Python API (PySpark) to test if the environment actually works.

To my personal experience, setting up a single-node PySpark development environment on Google Colab is the most painless one. All you need to do is to install the following dependencies:

!pip install pyspark py4j

After these two dependenceis have been installed, you are good to go! Colab already has all the rest ready for you. Try importing the following dependencies to see if the environment actually works:

from pyspark.sql import SparkSession

If there was no error during importing the dependencies, it means that the PySpark development environment works as expected. Now let's perform some simple operations using PySpark. If you are not familiar with these operations, don't worry, we will cover all of them in details in this Spark Tutorial series.

Creating DataFrames

As a data scientist, you must be very familiar with Pandas DataFrames. PySpark also has its own DataFrame data structure. The nice thing is that PySpark DataFrames and Pandas DataFrames can be converted to each other very easily. For every Spark program, the first thing is to create a Spark Session:

spark = SparkSession.builder.appName('Basics').getOrCreate()

Then let's load some JSON data into a PySpark DataFrame. I have already mounted my Google Drive to the Colab notebook and will read data from there:

data_path = '/content/drive/MyDrive/test_data/people.json'
df = spark.read.json(data_path)

Now let's take a look at the PySpark DataFrame:

df.show()

The result is:

We can check the schema of the DataFrame:

df.printSchema()

The result is:

We can retrieve the columns of the DataFrame:

df.columns

The result is:

['age', 'name']

We can also describe the DataFrame for some detailed information, the same as what we can do in Pandas:

df.describe().show()

The result is:

Using SQL to Query Data

Spark is so powerful that we can directly use SQL language to query data from a Spark DataFrame. To do so, the original DataFrame needs to be registered as a SQL temporary view:

df.createOrReplaceTempView('people')

Let's try the simplest SQL query to query all the data from the original DataFrame:

sql_results = spark.sql('SELECT * FROM people')
sql_results.show()

The result is:

Let's try another query to find all the people whose age is 30:

spark.sql("SELECT * FROM people WHERE age = 30").show()

The result is:

Conclusions

In this article, we provide a soft introduction on Spark, especially how to set up a development environment with Spark's Python API (PySpark). We will cover great details about Spark in this upcoming Spark tutorial series. Stay tuned!

Get in Touch

Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.

If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Data Science in Drilling - Episode 23

Introduction to the Upcoming Spark Series

Environment Set Up

Creating DataFrames

Using SQL to Query Data

Conclusions

Get in Touch

Recent Posts

Comments