Data Science in Drilling

Compress File in Memory and Upload to AWS S3

written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI

Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.

In the first episode of this series, we introduced how to efficiently load data from AWS S3 buckets into Pandas DataFrames. You may wonder that how can we reverse this process, say we have either some Pandas DataFrames or some Matplotlib figures in memory, how can we efficiently upload them to AWS S3 buckets? Or even take one step further, compress them into a .zip file then upload? We will cover these in today's blog post.

Compress An in-memory Pandas DataFrame and Upload to AWS S3

First, say you have created a large Pandas DataFrame in memory and you want to upload it to an AWS S3 bucket as a .csv file. Since the DataFrame is large, it would be better to compress it to reduce its size before uploading to S3. Let's see how to do these. First let's install some necessary dependencies:

pip install boto3 pandas

Next, import necessary dependencies and define some useful variables:

import boto3
import pandas as pd
from io import BytesIO
import os

aws_access_key_id = os.getenv('AWS_ACCESS_KEY_ID')
aws_secret_access_key = os.getenv('AWS_SECRET_ACCESS_KEY')
region_name = os.getenv('AWS_REGION_NAME')

# Don't forget to replace with your own.
s3_bucket_name = 'your-own-bucket-name'

# key defines the folders hierarchy in your bucket, replace with 
  your own
key = 'your-own-key'

Create an S3 client:

s3 = boto3.client(
    's3',
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key,
    region_name=region_name
)

Assume we have a large in-memory DataFrame denoted as df:

df = a-large-in-momery-DataFrame

Use the following code to convert the DataFrame into a .csv file, compress it into a .zip file and upload the .zip file to the S3 bucket. All these operations are performed in memory:

# Replace this with you own csv file name
archive_name = 'name-of-the-csv-file'
    
with BytesIO() as csv_buffer:
    df.to_csv(
        csv_buffer,
        compression={"method": "zip", "archive_name": archive_name},
        index=False
    )
    csv_buffer.seek(0)
    s3.upload_fileobj(
        csv_buffer,
        s3_bucket_name,
        key
    )

Compress Multiple Matplotlib Figures and Upload to AWS S3

Now let's solve another problem. Say I want to generate two figures using Matplotlib, compress them into one .zip file and upload it to an S3 bucket. The figures don't need to be displayed and all the operations should be performed in memory. Let's first import the necessary dependencies as always:

import numpy as np
import matplotlib.pyplot as plt
import copy
from zipfile import ZipFile

Define the data used to be generate the figures:

X = np.linspace(-np.pi, np.pi, 256, endpoint=True)
C, S = np.cos(X), np.sin(X)

The following code will realize what we have described:

key = 'your-own-key'

with BytesIO() as data:
    with ZipFile(data, 'w') as zf:
        for i, val in enumerate([C, S]):
            fig = plt.figure(num=(i + 1), figsize=(6, 4))
            plt.plot(X, val)
            with BytesIO() as buf:
                plt.savefig(buf)
                plt.close()
                img_name = f'{i}.png'
                zf.writestr(img_name, buf.getvalue())
        data.seek(0)
        data_copy = copy.deepcopy(data)
        s3.upload_fileobj(
            data_copy,
            s3_bucket_name,
            key
        )

Multipart Upload to S3

Finally, let's talk about multipart upload to S3 using boto3, which is especially useful when uploading large files. Import the necessary dependencies:

from boto3.s3.transfer import TransferConfig
import threading
import sys

Define an S3 resource:

boto3_session = boto3.Session(
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key,
    region_name=region_name
)

s3 = boto3_session.resource('s3')

Define a config object:

config = TransferConfig(multipart_threshold=1024 * 25,  # 25mb
                        max_concurrency=10,
                        multipart_chunksize=1024 * 25,  # 25mb
                        use_threads=True)

Define a class which is used to display the uploading progress in the terminal:

class ProgressPercentage(object):
    def __init__(self, filename):
        self._filename = filename
        self._size = float(os.path.getsize(filename))
        self._seen_so_far = 0
        self._lock = threading.Lock()

    def __call__(self, bytes_amount):
        # To simplify we'll assume this is hooked up
        # to a single filename.
        with self._lock:
            self._seen_so_far += bytes_amount
            percentage = (self._seen_so_far / self._size) * 100
            sys.stdout.write(
                "\r%s  %s / %s  (%.2f%%)" % (
                    self._filename, self._seen_so_far, self._size,
                    percentage))
            sys.stdout.flush()

Finally, define a function to handle the uploads:

def multipart_upload_boto3(file_path, key):
    # file_path is the path to the local file to be uploaded
    # key is the same as defined in previous examples
    s3.Object(BUCKET_NAME, key).upload_file(
        file_path,
        Config=config,
        Callback=ProgressPercentage(file_path)
    )

Conclusions

In this article, we went through how to compress files and upload them to AWS S3 buckets efficiently. All the operations are performed in memory. We also covered multipart upload to S3 using boto3. More AWS tricks will be covered in future episodes. Stay tuned!

Get in Touch

Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.

If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Data Science in Drilling - Episode 22