top of page

Data Science in Drilling - Episode 20

Writer: Zeyu YanZeyu Yan

Sentiment Analysis Using Hugging Face Transformer


written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI


Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.


In this episode, we will go through an quick example on how to use the transformer models from Hugging Face to make sentiment analysis.


Hugging Face and Transformers


Transformer models are state-of-art models used to solve all kinds of NLP tasks. Hugging Face is a community and data science platform which provides the tools that enable users to build, train and deploy ML models based on open source code and technologies, especially transformer-based advanced NLP models. Hugging Face's tutorials on its own transformer models are fantastic. I highly recommend to take a look at them to learn the details.


In this tutorial, we are going to fine tune a pretrained transformer model from Hugging Face to complete a sentiment analysis task. The IDE we are going to use is Google Colab.


Get the Data


We will use the "sentiment-analysis-on-movie-reviews" dataset from Kaggle. Make sure you have already created a Kaggle account and mounted your Google Drive to the current Colab notebook. I am using the following commands to move the kaggle.json file to the right location.

!mkdir -p /root/.kaggle/
!cp /content/drive/MyDrive/kaggle.json /root/.kaggle/

Install the necessary dependencies:

!pip install transformers

Import the necessary dependencies:

from kaggle.api.kaggle_api_extended import KaggleApi
import zipfile
import os
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, TextClassificationPipeline
import tensorflow as tf

Authenticate the Kaggle API:

api = KaggleApi()
api.authenticate()

Download and unzip the data files:

for file in ['train.tsv', 'test.tsv']:
    api.competition_download_file('sentiment-analysis-on-movie-reviews', f'{file}.zip', path='./')

    with zipfile.ZipFile(f'{file}.zip', 'r') as zip_ref:
        zip_ref.extractall('./')

    os.remove(f'{file}.zip')

Now we should have train.tsv and test.tsv files in the same folder as the current Colab notebook. Let's take a look at the data:

df = pd.read_csv('train.tsv', sep='\t')
df.head()

The train data looks as follows:


Take a look at its shape:

df.shape

The result is:

(156060, 4)

Plot a bar chart for the counts of different sentiments:

df['Sentiment'].value_counts().plot(kind='bar')

The plot looks as follows:


Tokenizer


A tokenizer is a necessary part for a transformer model. We need to use the right tokenizer corresponds to the model we plan to use. We are going to use the "bert-base-cased" model, therefore the same check point also applies to the tokenizer:

checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The maximum length after tokenization for a BERT model is 512. Therefore we define our maximum length:

max_length = 512

Now let's test to see if our tokenizer actually works:

tokens = tokenizer(
    df['Phrase'].tolist(), 
    max_length=max_length, 
    truncation=True, 
    padding=True, 
    return_tensors='tf'
)

for key, value in tokens.items():
    print(key, value.shape)

We are going to use TensorFlow in this example, therefore we are returning TF tensors. The results are as follows:

input_ids (156060, 84) 
token_type_ids (156060, 84) 
attention_mask (156060, 84)

Here 84 corresponds to the longest one in the "Phrase" column. It seems that our tokenizer is working.


Preprocess the Data


First let's one-hot encode the "Sentiment" column:

sentiment_values = df['Sentiment'].values
labels = np.zeros((df.shape[0], sentiment_values.max() + 1))
labels[np.arange(df.shape[0]), sentiment_values] = 1
labels

The labels variable looks as follows:

array([[0., 1., 0., 0., 0.],        
       [0., 0., 1., 0., 0.],        
       [0., 0., 1., 0., 0.],        
       ...,        
       [0., 0., 0., 1., 0.],        
       [0., 0., 1., 0., 0.],        
       [0., 0., 1., 0., 0.]])

Also check its shape:

labels.shape

The result is:

(156060, 5)

Everything looks good. The next step is to create a TensorFlow Dataset instance:

dataset = tf.data.Dataset.from_tensor_slices((tokens['input_ids'], tokens['token_type_ids'], tokens['attention_mask'], labels))

However, this dataset is not of the right format to be passed into a Hugging Face transformer model. Some extra processing is needed:

def map_func(input_ids, token_type_ids, attention_mask, labels):
    return {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': attention_mask}, labels

dataset = dataset.map(map_func)

Now the dataset is of the right format. Let's separate it into mini-batches and perform a train-validate split:

batch_size = 16
dataset = dataset.shuffle(10000).batch(batch_size, drop_remainder=True)

train_val_split = 0.9
train_size = int(df.shape[0] / batch_size * train_val_split)
train_ds = dataset.take(train_size)
val_ds = dataset.skip(train_size)

Let's have a final check:

train_ds.take(1)

The result is:

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 84), dtype=tf.int32, name=None), 'token_type_ids': TensorSpec(shape=(16, 84), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(16, 84), dtype=tf.int32, name=None)}, TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))>

Model


Defining the model is pretty easy through Hugging Face:

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=5) 

The next step is to define the optimizer and the loss to compile the model:

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5, decay=1e-6)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

Finally we are able to train our model. I am training the model using a GPU for 12 epochs:

history = model.fit(train_ds, validation_data=val_ds, epochs=12)

The results are as follows:


We can see that after 12 epochs, the training accuracy is around 92% and the validation accuracy is around 79%.


Use the Trained Model to Make Predictions


The easiest way to use the trained model to make predictions is to wrap the model as a Hugging Face Pipeline object:

pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, top_k=5)

Now let's try to make some predictions:

temp_df = df.drop_duplicates(subset=['SentenceId'], keep='first')

for i in range(10):
    print(pipe(temp_df.iloc[i]['Phrase'])[0][0])
    print(temp_df.iloc[i]['Sentiment'])
    print()

The results are:

{'label': 'LABEL_1', 'score': 0.899242103099823} 
1  

{'label': 'LABEL_4', 'score': 0.8477073311805725} 
4  

{'label': 'LABEL_1', 'score': 0.9607570171356201} 
1  

{'label': 'LABEL_3', 'score': 0.5494729280471802} 
3  

{'label': 'LABEL_0', 'score': 0.6285638809204102} 
1  

{'label': 'LABEL_4', 'score': 0.9250724911689758} 
4  

{'label': 'LABEL_1', 'score': 0.8566941618919373} 
1  

{'label': 'LABEL_2', 'score': 0.546370267868042} 
3  

{'label': 'LABEL_1', 'score': 0.9433814287185669} 
1  

{'label': 'LABEL_1', 'score': 0.9836373925209045} 
1

It is seen that the correct predictions are 8 out of 10, which is not bad. You are also able to upload your trained model to the Hugging Face Hub to make online inferences. We will cover how to do this in the future episodes.


Conclusions


In this article, we went through a quick example to fine tune a pretrained transformer model from Hugging Face to complete sentiment analysis tasks. More about NLP and Hugging Face will be covered in future episodes. Stay tuned!


Get in Touch


Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.


If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!


Commentaires


bottom of page