Data Science in Drilling

AutoML with AutoGluon - A Soft Introduction

written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI

Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.

This episode is a soft introduction to a popular open-source AutoML library - AutoGluon. Enjoy. :)

Enjoying great knowledge is just like enjoying delicious hot-stone steak.

Introduction

Automated Machine Learning (AutoML) has become more and more popular nowadays. Major cloud service providers like Amazon, Google and Microsoft all provide their own AutoML service from their cloud platform. Besides using these cloud-based AutoML services, there are also some awesome open-source AutoML libraries we can use. Today, our main character is the open-source AutoML library from Amazon: AutoGluon.

What We'll Cover Today

What is AutoGluon.
A simple example of usage.

What is AutoGluon

AutoML is the way toward automating the whole cycle of machine learning, which includes preprocessing, feature selection, model selection, hyperparameter tuning, etc. There are already some pretty nice open-source AutoML libraries in the market as the time of writing this blog post, including Auto-Sklearn, Auto-Keras, TPOT, MLBox, etc. The library we will focus on today is the one from Amazon, called AutoGluon.

AWS provides AutoML service in his famous cloud-based machine learning service SageMaker, which is SageMaker Autopilot. On the other hand, Amazon also provides this open-source AutoML library called AutoGluon which can be easily installed and integrated in your own projects. According to AutoGluon's offical documents:

AutoGluon enables easy-to-use and easy-to-extend AutoML with a focus on automated stack ensembling, deep learning, and real-world applications spanning image, text, and tabular data. Intended for both ML beginners and experts, AutoGluon enables you to:

Quickly prototype deep learning and classical ML solutions for your raw data with a few lines of code.
Automatically utilize state-of-the-art techniques (where appropriate) without expert knowledge.
Leverage automatic hyperparameter tuning, model selection/ensembling, architecture search, and data processing.
Easily improve/tune your bespoke models and data pipelines, or customize AutoGluon for your use-case.

The tasks which can be handled by AutoGluon include but not limited to tabular prediction, multimodal prediction, image prediction, object detection, text prediction, time series forecasting, feature selection, etc.

A Simple Example

Let’s go through a simple example to see how AutoGluon actually works. We will use AutoGluon to attend Kaggle’s Titanic prediction competition. I am using Google's Colab as the IDE for this project. First install AutoGluon using the following command:

!pip install autogluon

Import the necessary dependencies:

import pandas as pd
from autogluon.tabular import TabularPredictor

pd.options.display.max_columns = None

Next, let's the load the data. Note that I have already mounted my Google Drive to the Colab notebook.

data_folder_path = '/content/drive/MyDrive/test_data/titanic'

train_df = pd.read_csv(data_folder_path + '/' + 'train.csv')
test_df = pd.read_csv(data_folder_path + '/' + 'test.csv')
print(f'Train data shape: {train_df.shape}')
print(f'Test data shape: {test_df.shape}')

The shapes of the two datasets are:

Train data shape: (891, 12) 
Test data shape: (418, 11)

Take a look at the train dataset:

train_df.head()

The head of the train dataset looks as follows:

Take a look at the test dataset:

test_df.head()

The head of the test dataset look as follows:

AutoGluon only needs to know which column is the "label column". Then it can automatically infer the type of the prediction problem based on the data. In this case, the "label column" is the Survived column:

label = 'Survived'

Defined the path (folder) to store the trained models:

save_path = 'saved_model'

Next, define the time limit for the training:

time_limit = 1800  # half a hour

Lastly, the presets option needs to be defined. The following is a table from AutoGluon's offical documents which summarizes all the available options for presets:

In our case, the "best_quality" option is selected:

presets = 'best_quality'

Define and train our predictor:

predictor = TabularPredictor(label=label, path=save_path).fit(train_df, time_limit=time_limit, presets=presets)

After the training process has been finished, the predictor can be used to generate predictions for the test dataset:

y_pred = predictor.predict(test_df)
print("Predictions:  \n", y_pred)

The results are:

Predictions:    
0      0 
1      1 
2      0 
3      0 
4      1       
      .. 
413    0 
414    1 
415    0 
416    0 
417    0 
Name: Survived, Length: 418, dtype: int64

We can also check the ranks of different algorithms AutoGluon tried on the train dataset in terms of performance:

predictor.leaderboard(train_df, silent=True)

The results are:

The final step is to submit our own submission to Kaggle. Save your submission as a .csv file and then submit through Kaggle's official website:

submission = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': y_pred
})
submission.to_csv('/content/drive/MyDrive/test_data/submission.csv', index=False)

It turned out that we are able to get a pretty good rank in this competition by simply training the data using AutoGluon for half a hour. No preprocessing, model selection or hyperparameter tuning is needed since AutoGluon handles all these for you. It is this simple and the outcome is pretty good!

If one is interested in the feature importance of the train dataset, the following command can be used:

predictor.feature_importance(train_df)

The results are:

This wraps up today's basic example usage of AutoGluon. For more details about AutoGluon, please refer to its official documents.

Conclusions

In this article, we covered what is AutoGluon and its basic usage through a simple Kaggle example. Most of the cases, AutoGluon can be used as a baseline or for fast prototyping purpose and achieve pretty good performance. Overall it's a really powerful tool, especially for those who are not data science experts! Hope you enjoy this article! More interesting contents will be covered in the future episodes. Stay tuned!

Get in Touch

Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.

If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Data Science in Drilling - Episode 16

AutoML with AutoGluon - A Soft Introduction

Introduction

What We'll Cover Today

What is AutoGluon

A Simple Example

Conclusions

Get in Touch

Recent Posts

Comments