Data Science in Drilling - Episode 4

May 10, 20227 min read

Database Manipulation Through Python

written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI

Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.

Enjoy this brand new episode of this series!

Enjoying great knowledge is just like enjoying delicious ramen.

Introduction

In the previous episode of this series, we have covered data acquisition through REST APIs using Python, in both synchronous and asynchronous patterns. In this episode, we will cover another basic but very important way to retrieve data from the database, i.e., making database queries, as well as some other basic database operations, all through Python.

What We'll Cover Today

How to use SQL statements in SQLAlchemy to make database queries and convert the query results into Pandas DataFrames.
How to use pure Python code to make database queries and convert the query results into Pandas DataFrames based on SQLAlchemy's APIs.
How to use SQLAlchemy to check the properties of the columns of a table in the database conveniently.
How to insert, update and delete DB data through SQLAlchemy.

Overview of ORMs

When talking about data acquisition from a relational database, the most straightforward way is to use SQL language. SQL language is one of the essential skills for data scientists. If you are not familiar with SQL language, there are many good online resources to learn it. However, the SQL language has a little caveat, i.e., it was not designed to be mixed with normal code in programming languages and would bring security vulnerability to the system through SQL injection. Here are some good materials to learn more details about SQL injection. To avoid SQL injections, an ORM (Object-Relational Mapper) is usually used to interact with databases in programming languages. ORMs are types of libraries that translate between SQL and programming languages with classes and objects. Learn more details about ORMs here. SQLAlchemy is the most popular ORM in Python because of its simplicity, speed and extra features. We will be using SQLAlchemy in this article.

Using SQL Statements in SQLAlchemy

In the previous episode, we created a PostgreSQL database through Docker for our own web server. There was one table in the database, called "depth_data" which stored 10 records of depth data in drilling engineering. In this tutorial, we will be using the same database and table.

To use SQLAlchemy and make it able to connect to a PostgreSQL database, the following libraries need to be installed:

pip install SQLAlchemy psycopg2-binary

Or if you are using Jupyter or Colab:

!pip install SQLAlchemy psycopg2-binary

First, an engine object needs to be created:

from sqlalchemy import create_engine

db_uri = 'postgresql://postgres:postgres@localhost:5433/postgres'
engine = create_engine(db_uri)

Recall that the db_uri format for PostgreSQL is:

f'postgresql://{username}:{password}@{host}:{port}/{db}'

The username, password, port number, etc. have been setup in the docker-compose file from the previous episode. In stead of retrieving them as environment variables, we hard code them in the db_uri for the purpose of simplicity.

Now we are able to execute SQL statements through the engine object. First let's try to retrieve all the records from the depth_data table:

data = engine.execute('SELECT * FROM depth_data')

for record in data:
    print(record)

Here are the results:

(1, 15000.0, 56.243, 45.947, 76.448, 62.322, 87.409, 312.187)
(2, 15001.0, 43.597, 33.67, 83.843, 72.249, 4.22, 194.275)
(3, 15002.0, 49.442, 76.74, 106.511, 68.637, 67.067, 143.114)
(4, 15003.0, 32.94, 51.017, 95.515, 77.759, 90.089, 50.369)
(5, 15004.0, 44.429, 63.638, 174.767, 82.036, 11.661, 343.726)
(6, 15005.0, 56.903, 41.481, 187.984, 75.141, 26.373, 337.822)
(7, 15006.0, 63.375, 75.195, 137.234, 64.905, 70.027, 25.347)
(8, 15007.0, 62.239, 50.627, 138.014, 52.983, 18.162, 52.817)
(9, 15008.0, 40.621, 53.702, 156.886, 57.317, 9.211, 240.9)
(10, 15009.0, 62.763, 63.413, 123.597, 31.029, 66.287, 207.705)

Let's then try to skip the first 3 records and query 5 records using SQL statements:

data = engine.execute('SELECT * FROM depth_data OFFSET 3 LIMIT 5')

for record in data:
    print(record)

Here are the results:

(4, 15003.0, 32.94, 51.017, 95.515, 77.759, 90.089, 50.369)
(5, 15004.0, 44.429, 63.638, 174.767, 82.036, 11.661, 343.726)
(6, 15005.0, 56.903, 41.481, 187.984, 75.141, 26.373, 337.822)
(7, 15006.0, 63.375, 75.195, 137.234, 64.905, 70.027, 25.347)
(8, 15007.0, 62.239, 50.627, 138.014, 52.983, 18.162, 52.817)

Furthermore, we can query the database through SQL statements and save the results as a Pandas DataFrame:

import pandas as pd

conn = engine.connect()
df = pd.read_sql('SELECT * FROM depth_data', conn)
df.head()

We get:

Using Pure Python Code

We have shown that SQL statements can be directly used in SQLAlchemy to make database queries and then save the query results as Pandas DataFrames. If one is not familiar with SQL statements and prefer to write pure Python code, this is also feasible.

An automap_base needs to be created to reflect the tables in the database:

from sqlalchemy.ext.automap import automap_base

Base = automap_base()
Base.prepare(engine, reflect=True)
Base.classes.keys()

All the table names in the database will be displayed as a list:

['depth_data']

Get the class corresponding to the depth_data table:

DepthData = Base.classes.depth_data

Then a session needs to be created based on the engine to make database queries:

from sqlalchemy.orm import Session

session = Session(engine)

Now we are ready to make the queries again. Let's first try to query only the id, rpm and mse columns, skip the first 3 records and limit to 5 records:

for row in session.query(DepthData.id, DepthData.rpm, DepthData.mse).offset(3).limit(5):
    print(row)

Here are the query results:

(4, 32.94, 77.759)
(5, 44.429, 82.036)
(6, 56.903, 75.141)
(7, 63.375, 64.905)
(8, 62.239, 52.983)

The session.query method returns an iterator, which can be iterated to get each result. The .offset method is for skipping records and the .limit method is for limiting the number of records to be returned.

Then let's try to query the records with wob greater than 50:

depth_data_iterator = session.query(DepthData).filter(DepthData.wob > 50)

for depth_data in depth_data_iterator:
    print(depth_data.wob)

Here are the query results:

The .filter method is an important method for passing conditions to the query. Now let's say we want to query all the mse which are greater than 70, but less than 80. There are two ways to implement this. The first way is through chaining the .filter method:

depth_data_iterator = session.query(DepthData).filter(DepthData.mse > 70).filter(DepthData.mse < 80)

for depth_data in depth_data_iterator:
    print(depth_data.mse)

The second way is through combining the .filter method and the and_ method:

from sqlalchemy import and_

depth_data_iterator = session.query(DepthData).filter(and_(DepthData.mse > 70, DepthData.mse < 80))

for depth_data in depth_data_iterator:
    print(depth_data.mse)

Results from both the queries are:

72.249
77.759
75.141

If we want to apply equal condition to the query, both .filter and .filter_by method can be used. Say we want to query the id of the record whose rop equals to 95.515:

depth_data_iterator = session.query(DepthData).filter(DepthData.rop == 95.515)

for depth_data in depth_data_iterator:
    print(depth_data.id)

Using .filter_by method:

depth_data_iterator = session.query(DepthData).filter_by(rop=95.515)

for depth_data in depth_data_iterator:
    print(depth_data.id)

Both the queries return 4, which is the id of the record, as the result.

Lastly, say we want to query all the records from the depth_data table and convert the results as a Pandas DataFrame without using SQL statements, this can be achieved as follows:

df = pd.read_sql(session.query(DepthData).statement, conn)
df.head()

The result is exactly the same as the one generated through SQL statements in the previous section of this article.

Using Inspector to Check Column Properties

The inspect method provided by SQLAlchemy makes it really convenient for us to check the properties of the columns of a table in the database. First let's create an inspector bind to the engine:

from sqlalchemy import inspect

inspector = inspect(engine)

Here is how to get the table names in the database through the inspector:

inspector.get_table_names()

The result is:

['depth_data']

We are able to retrieve the properties of the columns of the depth_data table as follows:

columns = inspector.get_columns('depth_data')
print(columns)

The details of each column are listed:

[{'name': 'id', 'type': INTEGER(), 'nullable': False, 'default': "nextval('depth_data_id_seq'::regclass)", 'autoincrement': True, 'comment': None}, {'name': 'depth', 'type': DOUBLE_PRECISION(precision=53), 'nullable': False, 'default': None, 'autoincrement': False, 'comment': None}, {'name': 'rpm', 'type': DOUBLE_PRECISION(precision=53), 'nullable': False, 'default': None, 'autoincrement': False, 'comment': None}, {'name': 'wob', 'type': DOUBLE_PRECISION(precision=53), 'nullable': False, 'default': None, 'autoincrement': False, 'comment': None}, {'name': 'rop', 'type': DOUBLE_PRECISION(precision=53), 'nullable': False, 'default': None, 'autoincrement': False, 'comment': None}, {'name': 'mse', 'type': DOUBLE_PRECISION(precision=53), 'nullable': False, 'default': None, 'autoincrement': False, 'comment': None}, {'name': 'inclination', 'type': DOUBLE_PRECISION(precision=53), 'nullable': False, 'default': None, 'autoincrement': False, 'comment': None}, {'name': 'azimuth', 'type': DOUBLE_PRECISION(precision=53), 'nullable': False, 'default': None, 'autoincrement': False, 'comment': None}]

Insert, Update and Delete Data

We have already covered how to query data from the database using SQLAlchemy in the previous sections. Insert, update and delete data using SQLAlchemy can also be implemented in a similar fashion. First, let's insert one group of new depth_data into the database:

# Insert data.
session.add(DepthData(
    depth=15010.0,
    rpm=48.372,
    wob=72.339,
    rop=90.275,
    mse=65.962,
    inclination=32.874,
    azimuth=67.278
))
session.commit()

Note that the session.commit method here is necessary for the DB operation to take effect. Let's take another look at the data in the database:

data = engine.execute('SELECT * FROM depth_data')

for record in data:
    print(record)

The results are:

(1, 15000.0, 56.243, 45.947, 76.448, 62.322, 87.409, 312.187)
(2, 15001.0, 43.597, 33.67, 83.843, 72.249, 4.22, 194.275)
(3, 15002.0, 49.442, 76.74, 106.511, 68.637, 67.067, 143.114)
(4, 15003.0, 32.94, 51.017, 95.515, 77.759, 90.089, 50.369)
(5, 15004.0, 44.429, 63.638, 174.767, 82.036, 11.661, 343.726)
(6, 15005.0, 56.903, 41.481, 187.984, 75.141, 26.373, 337.822)
(7, 15006.0, 63.375, 75.195, 137.234, 64.905, 70.027, 25.347)
(8, 15007.0, 62.239, 50.627, 138.014, 52.983, 18.162, 52.817)
(9, 15008.0, 40.621, 53.702, 156.886, 57.317, 9.211, 240.9)
(10, 15009.0, 62.763, 63.413, 123.597, 31.029, 66.287, 207.705)
(11, 15010.0, 48.372, 72.339, 90.275, 65.962, 32.874, 67.278)

The new data has been successfully inserted into the database. Then let's try to update the newly inserted row of data to make its rpm a different value:

# Update data.
record = session.query(DepthData).filter(DepthData.id == 11).first()
record.rpm = 50.124
session.commit()

Let's check if the update has taken effect:

record = session.query(DepthData).filter(DepthData.id == 11).first()
print(record.rpm)

record = session.query(DepthData).get(11)
print(record.rpm)

Both the results are 50.124, as expected.

Lastly, let's delete the newly added row of data:

# Delete data.
number_of_rows_affected = session.query(DepthData).filter_by(id=11).delete()
session.commit()

Let's check the data in the DB again:

data = engine.execute('SELECT * FROM depth_data')

for record in data:
    print(record)

The results are:

(1, 15000.0, 56.243, 45.947, 76.448, 62.322, 87.409, 312.187)
(2, 15001.0, 43.597, 33.67, 83.843, 72.249, 4.22, 194.275)
(3, 15002.0, 49.442, 76.74, 106.511, 68.637, 67.067, 143.114)
(4, 15003.0, 32.94, 51.017, 95.515, 77.759, 90.089, 50.369)
(5, 15004.0, 44.429, 63.638, 174.767, 82.036, 11.661, 343.726)
(6, 15005.0, 56.903, 41.481, 187.984, 75.141, 26.373, 337.822)
(7, 15006.0, 63.375, 75.195, 137.234, 64.905, 70.027, 25.347)
(8, 15007.0, 62.239, 50.627, 138.014, 52.983, 18.162, 52.817)
(9, 15008.0, 40.621, 53.702, 156.886, 57.317, 9.211, 240.9)
(10, 15009.0, 62.763, 63.413, 123.597, 31.029, 66.287, 207.705)

It is seen that the newly added row of data was gone.

Conclusions

In this article, we mainly covered how to use the SQLAlchemy ORM to query a PostgreSQL database and save the query results as Pandas DataFrames, in a synchronous pattern. We also covered how to use SQLAlchemy to check the properties of the columns of a table in the database conveniently, as well as how to make insert, update and delete operations. How to make these DB operations in an asynchronous pattern will be covered in the future episodes of this series. Stay tuned!

Get in Touch

Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.

If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!