Compress File in Memory and Upload to AWS S3
written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI
Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.
In the first episode of this series, we introduced how to efficiently load data from AWS S3 buckets into Pandas DataFrames. You may wonder that how can we reverse this process, say we have either some Pandas DataFrames or some Matplotlib figures in memory, how can we efficiently upload them to AWS S3 buckets? Or even take one step further, compress them into a .zip file then upload? We will cover these in today's blog post.
Compress An in-memory Pandas DataFrame and Upload to AWS S3
First, say you have created a large Pandas DataFrame in memory and you want to upload it to an AWS S3 bucket as a .csv file. Since the DataFrame is large, it would be better to compress it to reduce its size before uploading to S3. Let's see how to do these. First let's install some necessary dependencies:
pip install boto3 pandas
Next, import necessary dependencies and define some useful variables:
import boto3
import pandas as pd
from io import BytesIO
import os
aws_access_key_id = os.getenv('AWS_ACCESS_KEY_ID')
aws_secret_access_key = os.getenv('AWS_SECRET_ACCESS_KEY')
region_name = os.getenv('AWS_REGION_NAME')
# Don't forget to replace with your own.
s3_bucket_name = 'your-own-bucket-name'
# key defines the folders hierarchy in your bucket, replace with
your own
key = 'your-own-key'
Create an S3 client:
s3 = boto3.client(
's3',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region_name
)
Assume we have a large in-memory DataFrame denoted as df:
df = a-large-in-momery-DataFrame
Use the following code to convert the DataFrame into a .csv file, compress it into a .zip file and upload the .zip file to the S3 bucket. All these operations are performed in memory:
# Replace this with you own csv file name
archive_name = 'name-of-the-csv-file'
with BytesIO() as csv_buffer:
df.to_csv(
csv_buffer,
compression={"method": "zip", "archive_name": archive_name},
index=False
)
csv_buffer.seek(0)
s3.upload_fileobj(
csv_buffer,
s3_bucket_name,
key
)
Compress Multiple Matplotlib Figures and Upload to AWS S3
Now let's solve another problem. Say I want to generate two figures using Matplotlib, compress them into one .zip file and upload it to an S3 bucket. The figures don't need to be displayed and all the operations should be performed in memory. Let's first import the necessary dependencies as always:
import numpy as np
import matplotlib.pyplot as plt
import copy
from zipfile import ZipFile
Define the data used to be generate the figures:
X = np.linspace(-np.pi, np.pi, 256, endpoint=True)
C, S = np.cos(X), np.sin(X)
The following code will realize what we have described:
key = 'your-own-key'
with BytesIO() as data:
with ZipFile(data, 'w') as zf:
for i, val in enumerate([C, S]):
fig = plt.figure(num=(i + 1), figsize=(6, 4))
plt.plot(X, val)
with BytesIO() as buf:
plt.savefig(buf)
plt.close()
img_name = f'{i}.png'
zf.writestr(img_name, buf.getvalue())
data.seek(0)
data_copy = copy.deepcopy(data)
s3.upload_fileobj(
data_copy,
s3_bucket_name,
key
)
Multipart Upload to S3
Finally, let's talk about multipart upload to S3 using boto3, which is especially useful when uploading large files. Import the necessary dependencies:
from boto3.s3.transfer import TransferConfig
import threading
import sys
Define an S3 resource:
boto3_session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region_name
)
s3 = boto3_session.resource('s3')
Define a config object:
config = TransferConfig(multipart_threshold=1024 * 25, # 25mb
max_concurrency=10,
multipart_chunksize=1024 * 25, # 25mb
use_threads=True)
Define a class which is used to display the uploading progress in the terminal:
class ProgressPercentage(object):
def __init__(self, filename):
self._filename = filename
self._size = float(os.path.getsize(filename))
self._seen_so_far = 0
self._lock = threading.Lock()
def __call__(self, bytes_amount):
# To simplify we'll assume this is hooked up
# to a single filename.
with self._lock:
self._seen_so_far += bytes_amount
percentage = (self._seen_so_far / self._size) * 100
sys.stdout.write(
"\r%s %s / %s (%.2f%%)" % (
self._filename, self._seen_so_far, self._size,
percentage))
sys.stdout.flush()
Finally, define a function to handle the uploads:
def multipart_upload_boto3(file_path, key):
# file_path is the path to the local file to be uploaded
# key is the same as defined in previous examples
s3.Object(BUCKET_NAME, key).upload_file(
file_path,
Config=config,
Callback=ProgressPercentage(file_path)
)
Conclusions
In this article, we went through how to compress files and upload them to AWS S3 buckets efficiently. All the operations are performed in memory. We also covered multipart upload to S3 using boto3. More AWS tricks will be covered in future episodes. Stay tuned!
Get in Touch
Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.
If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!
Comments