Data Science in Drilling

Data Acquisition through REST APIs Using Python

written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI

Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.

Welcome to this brand new episode of this series!

Enjoying great knowledge is just like enjoying delicious sushi.

Introduction

In Episode 1 of this series, we have covered how to load data files stored in AWS S3 buckets into in-memory Pandas DataFrames. AWS services may appear to be advanced topics for those who are not familiar with them, but don't worry, we promise that major AWS services will be covered in details in future episodes of this series. For this episode, let's return to the basics and focus on the most commonly used but also one of the most important ways of data acquisition, i.e., through REST API calls.

What We'll Cover Today

How to build a simple web server to server data through REST-APIs using FastAPI, SQLModel, SQLAlchemy (async), PostgreSQL and Docker.
How to use Python's requests library to retrieve data through REST APIs in a synchronous pattern.
How to use Python's AIOHTTP library together with Python's asyncio to retrieve data through REST APIs in an asynchronous pattern.

Overview of REST API

To explain what is a REST API in one sentence: "A REST API is a Web API (or Web Service) conforming to the REST architectural style." Read more about what is the so-called "REST architectural style" and other features about a REST API through this useful link. Essentially, a REST API serves data as http responses when provided with the right credentials and parameters through http requests.

Build A Simple Web Server

In this tutorial, we build our own simple web server using FastAPI, SQLModel, SQLAlchemy (async), PostgreSQL and Docker to serve data through REST APIs. FastAPI is a modern, high-performance Python web framework for building APIs. PostgreSQL is a powerful, open source relational database. Docker is an open source platform for building, deploying, and managing containerized applications. SQLModel and SQLAlchemy are two Python ORMs for interacting with relational databases. If you are not familiar with these tech stacks, don't worry, each of them will be a topic and discussed in details in future episodes of this series. Stay tuned!

First, let's have an overview of the file structure of the web server project under its project root folder:

All the dependencies for the project are in requirements.txt file and can be installed through:

pip install -r requirements.txt

To keep this project simple and minimal, the docker-compose.yml file is only used to run the PostgreSQL database as a Docker container. The FastAPI based web server will be running locally instead of running inside a Docker container, so that the readers don't need to worry about understanding the communications and interactions between Docker containers, which are not the main focus of this article.

Here is the docker-compose.yml file:

version: '3.8'

services:

  db:
    image: postgres
    container_name: postgres
    environment:
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
    ports:
      - '5433:5432'
    volumes:
      - ./postgres-data:/var/lib/postgresql/data
      - ./sql/create_tables.sql:/docker-entrypoint-initdb.d/create_tables.sql

The postgres container maps the default port of the database which is port 5432 to port 5433 of the local machine, so that the web server which runs on the local machine can connect to the database through port 5433. To turn on the PostgreSQL database service as a Docker container, using the following command in terminal:

docker compose up --build

The same as what we did in Episode 1, we still use depth data in drilling engineering as an example, but with fewer fields this time. The schema of the depth_data table, which will be stored in the PostgreSQL database, is defined in depth_data.py file inside the models package:

from sqlmodel import Field, SQLModel
from typing import Optional

class DepthData(SQLModel, table=True):
    __tablename__ = 'depth_data'

    id: Optional[int] = Field(default=None, primary_key=True)
    depth: float
    rpm: float
    wob: float
    rop: float
    mse: float
    inclination: float
    azimuth: float

create_tables.sql file is used to initialize 10 groups of depth data in the database:

CREATE TABLE IF NOT EXISTS depth_data (
    id SERIAL PRIMARY KEY,
    depth DECIMAL NOT NULL,
    rpm DECIMAL NOT NULL,
    wob DECIMAL NOT NULL,
    rop DECIMAL NOT NULL,
    mse DECIMAL NOT NULL,
    inclination DECIMAL NOT NULL,
    azimuth DECIMAL NOT NULL
);

INSERT INTO depth_data (depth, rpm, wob, rop, mse, inclination, azimuth) VALUES (15000.0, 56.243, 45.947, 76.448, 62.322, 87.409, 312.187);
INSERT INTO depth_data (depth, rpm, wob, rop, mse, inclination, azimuth) VALUES (15001.0, 43.597, 33.67, 83.843, 72.249, 4.22, 194.275);
INSERT INTO depth_data (depth, rpm, wob, rop, mse, inclination, azimuth) VALUES (15002.0, 49.442, 76.74, 106.511, 68.637, 67.067, 143.114);
INSERT INTO depth_data (depth, rpm, wob, rop, mse, inclination, azimuth) VALUES (15003.0, 32.94, 51.017, 95.515, 77.759, 90.089, 50.369);
INSERT INTO depth_data (depth, rpm, wob, rop, mse, inclination, azimuth) VALUES (15004.0, 44.429, 63.638, 174.767, 82.036, 11.661, 343.726);
INSERT INTO depth_data (depth, rpm, wob, rop, mse, inclination, azimuth) VALUES (15005.0, 56.903, 41.481, 187.984, 75.141, 26.373, 337.822);
INSERT INTO depth_data (depth, rpm, wob, rop, mse, inclination, azimuth) VALUES (15006.0, 63.375, 75.195, 137.234, 64.905, 70.027, 25.347);
INSERT INTO depth_data (depth, rpm, wob, rop, mse, inclination, azimuth) VALUES (15007.0, 62.239, 50.627, 138.014, 52.983, 18.162, 52.817);
INSERT INTO depth_data (depth, rpm, wob, rop, mse, inclination, azimuth) VALUES (15008.0, 40.621, 53.702, 156.886, 57.317, 9.211, 240.9);
INSERT INTO depth_data (depth, rpm, wob, rop, mse, inclination, azimuth) VALUES (15009.0, 62.763, 63.413, 123.597, 31.029, 66.287, 207.705);

database.py file provides utility functions for initializing the database and generating (async) db sessions:

from sqlmodel import SQLModel
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine
from sqlalchemy.orm import sessionmaker

DATABASE_URL = 'postgresql+asyncpg://postgres:postgres@localhost:5433/postgres'

engine = create_async_engine(DATABASE_URL, echo=True, future=True)

async def init_db():
    async with engine.begin() as conn:
        await conn.run_sync(SQLModel.metadata.create_all)

async def get_session() -> AsyncSession:
    async_session = sessionmaker(
        engine, class_=AsyncSession, expire_on_commit=False
    )
    async with async_session() as session:
        yield session

For security purpose, sensitive information like database url should be set as environment variable and retrieved through something like os.environ.get method. But for the purpose of simplicity, we hard code it here.

Finally, the REST APIs are defined in main.py file:

from fastapi import Depends, FastAPI
from sqlalchemy.future import select
from sqlalchemy.ext.asyncio import AsyncSession
from typing import Optional

from .database import get_session, init_db
from .models.depth_data import DepthData

app = FastAPI()

@app.on_event("startup")
async def on_startup():
    await init_db()

@app.get("/depth-data", response_model=list[DepthData])
async def get_depth_data(
        session: AsyncSession = Depends(get_session),
        skip: int = 0,
        limit: Optional[int] = None
):
    statement = select(DepthData).offset(skip)
    if limit is not None:
        statement = statement.limit(limit)
    result = await session.execute(statement)
    return result.scalars().all()

In our case, the web server only serves one REST API for the user to retrieve depth data from the database. It accepts two query parameters skip and limit, which controls how many records to skip and retrieve.

In real scenarios, authentication is an indispensable part of a web server. For example, a JWT authentication system requires the http requests to carry the correct access tokens to be proceeded and when the access tokens have expired, it is possible to 'refresh' them through correct refresh tokens. Due to space limitations, authentication will be a separate topic covered in a separate episode. The REST APIs from our own web server don't require any authentication for now.

Now with everything ready, let's start our web server by running the following command in terminal from the project root directory:

uvicorn app.main:app --reload

The default port for FastAPI applications is port 8000. One advantage of FastAPI is that it provides built-in Swagger UI support. Once the web server has been started, the Swagger UI for all the APIs can be accessed through:

http://127.0.0.1:8000/docs

Something like this will pop up in your browser:

Instead of using tools like Postman, all the APIs can be tested directly through Swagger UI. That's pretty cool and convenient, but is not the main focus of this article. Our goal is to use Python language to retrieve data through REST APIs.

Data Retrieval in A Synchronous Pattern

In order to make synchronous http requests through Python, requests library will be used in this tutorial. If you don't have it, install it through:

pip install requests

Or if you are using Jupyter or Colab:

!pip install requests

Let's try to retrieve all the depth data in the database through REST API call. There are different types of http methods. Because we are retrieving data, the GET method is used through requests.get:

import requests

api_url = 'http://127.0.0.1:8000/depth-data'

response = requests.get(api_url)
print(response.json())

This will print all the 10 records of depth data stored in the database as a List of Python Dictionaries:

In real scenarios, it is most likely that there are already hundreds of thousands of records stored in the database and retrieve them all in one API call seems not a good idea. In this case one can use the query parameter limit to limit the total number of records returned by the API. Say we want to retrieve at most 3 records from the API:

# Equivalent to requests.get('http://127.0.0.1:8000/depth-data?limit=3')
response = requests.get(api_url, params={'limit': 3})
print(response.json())

We use params parameter to pass the right query parameters as a Python Dictionary to our http requests. This time, the first 3 records from the database have been retrieved:

If the total number of records in the database is less than limit, the API will return all the records. Now let's say we want to skip the first 2 records in the database and only retrieve the third, forth and fifth record. In this case, we can combine query parameters skip and limit to fulfill the requirement:

# Equivalent to requests.get('http://127.0.0.1:8000/depth-data?skip=2&limit=3')
response = requests.get(api_url, params={'skip': 2, 'limit': 3})
print(response.json())

The third, forth and fifth record from the database were retrieved as expected:

Data Retrieval in An Asynchronous Pattern

We have addressed how asynchronous programming pattern can boost the performance of the applications in Episode 1 and Episode 2 of this series. To make asynchronous API calls, the AIOHTTP library is used in this tutorial. The library can be install through:

pip install aiohttp

The following is a complete piece of code which uses AIOHTTP and Python's asyncio to retrieve the data from the API using an asynchronous pattern:

import aiohttp
import asyncio

api_url = 'http://127.0.0.1:8000/depth-data'

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get(url=api_url, params={'skip': 2, 'limit': 3}) as response:
            resp_json = await response.json()
            print(resp_json)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

This asynchronous pattern could be integrated in any Python frameworks which support asyncio to boost the performance.

Conclusions

In this article, we mainly went through how to retrieve data through REST APIs either in a synchronous or an asynchronous pattern. We also covered how to build a simple web server using some of the state-of-art tech stacks to serve the data through REST APIs.

Get in Touch

Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.

If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Data Science in Drilling - Episode 3