top of page

Data Science in Drilling - Episode 21

Writer: Zeyu YanZeyu Yan

Hugging Face Transformer's Trainer API


written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI


Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.


In the previous episode, we introduced how to use Hugging Face's transformers together with TensorFlow to train a model and perform sentiment analysis. In this episode, we will perform the same task using Hugging Face transformers' Trainer API.


Preparations


Here are the same preparations as from the previous episode. The only difference is that this time we only download the datasets, without uncompression. Refer to the previous episode for more details on this part.

!mkdir -p /root/.kaggle/
!cp /content/drive/MyDrive/kaggle.json /root/.kaggle/

!pip install transformers datasets

from kaggle.api.kaggle_api_extended import KaggleApi
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, AdamW, get_scheduler, TrainingArguments, Trainer
from torch.utils.data import DataLoader
import torch
from tqdm.auto import tqdm
import numpy as np

api = KaggleApi()
api.authenticate()

for file in ['train.tsv', 'test.tsv']:
    api.competition_download_file('sentiment-analysis-on-movie-reviews', f'{file}.zip', path='./')

Data Preprocessing


The load_dataset method from the datasets library of Hugging Face can load the data into the memory directly from compressed files:

data_files = {
    'train': 'train.tsv.zip'
}

raw_datasets = load_dataset('csv', sep='\t', data_files=data_files)

Let's take a look at raw_datasets:

raw_datasets

The result is:

DatasetDict({     
    train: Dataset({         
        features: ['PhraseId', 'SentenceId', 'Phrase', 'Sentiment'],         
        num_rows: 156060     
    }) 
})

The original dataset has 156060 rows, which is quite big. Let's sample 10000 data from it:

raw_datasets['train'] = raw_datasets['train'].shuffle(seed=42).select(range(10000))
raw_datasets

The result is:

DatasetDict({     
    train: Dataset({         
        features: ['PhraseId', 'SentenceId', 'Phrase', 'Sentiment'],         
        num_rows: 10000     
    }) 
})

Let's then perform a train-validation-test split:

train_test = raw_datasets['train'].train_test_split(train_size=0.8, seed=42)
valid_test = train_test['test'].train_test_split(train_size=0.5, seed=42)
raw_datasets['train'] = train_test['train']
raw_datasets['valid'] = valid_test['train']
raw_datasets['test'] = valid_test['test']
raw_datasets

The result is:

DatasetDict({     
    train: Dataset({         
        features: ['PhraseId', 'SentenceId', 'Phrase', 'Sentiment'],         
        num_rows: 8000     
    })     
    valid: Dataset({         
        features: ['PhraseId', 'SentenceId', 'Phrase', 'Sentiment'],         
        num_rows: 1000     
    })     
    test: Dataset({         
        features: ['PhraseId', 'SentenceId', 'Phrase', 'Sentiment'],         
        num_rows: 1000     
    }) 
})

Let's take a look at an instance from the train set:

raw_datasets['train'][0]

The result is:

{'PhraseId': 120026,  
 'SentenceId': 6420,  
 'Phrase': "use the word `` new '' in its title",  
 'Sentiment': 2}

Now it's time to define our tokenizer. This time we will use the "distilbert-base-uncased" model and its corresponding tokenizer:

checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The map method on a Hugging Face DatasetDict is able to process the data in parallel, which is highly efficient. To use the map method, we need to first define a function:

def tokenize_function(example):
    return tokenizer(example['Phrase'], truncation=True)

Then feed this function to the map method:

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

The result is:

DatasetDict({     
    train: Dataset({         
        features: ['PhraseId', 'SentenceId', 'Phrase', 'Sentiment', 'input_ids', 'attention_mask'],         
        num_rows: 8000     
    })     
    valid: Dataset({         
        features: ['PhraseId', 'SentenceId', 'Phrase', 'Sentiment', 'input_ids', 'attention_mask'],         
        num_rows: 1000     
    })     
    test: Dataset({         
        features: ['PhraseId', 'SentenceId', 'Phrase', 'Sentiment', 'input_ids', 'attention_mask'],         
        num_rows: 1000     
    }) 
})

Since the transformer model we use expects the features to be input_ids, attention_mask and labels, some further processing is needed:

tokenized_datasets = tokenized_datasets.remove_columns(['PhraseId', 'SentenceId', 'Phrase'])
tokenized_datasets = tokenized_datasets.rename_column('Sentiment', 'labels')
tokenized_datasets.set_format('torch')
tokenized_datasets

The result is:

DatasetDict({     
    train: Dataset({         
        features: ['labels', 'input_ids', 'attention_mask'],             
        num_rows: 8000     
    })     
    valid: Dataset({         
        features: ['labels', 'input_ids', 'attention_mask'],         
        num_rows: 1000     
    })     
    test: Dataset({         
        features: ['labels', 'input_ids', 'attention_mask'],         
        num_rows: 1000     
    }) 
})

We will perform a dynamic padding on each batch of the data passing to the model. To achieve dynamic padding, we will need to use a data_collator and pass in the tokenizer we use:

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The model and Training


Now it's time to define our model and train it. As mentioned, we will use the "distilbert-base-uncased" model this time:

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=5)

Before training, there are a couple of things need to be defined. We first need to define the metrics for evaluations:

def compute_metrics(eval_pred):
   load_accuracy = load_metric('accuracy')
   load_f1 = load_metric('f1')
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions,  
   references=labels)['accuracy']
   f1 = load_f1.compute(predictions=predictions, references=labels, 
   average='weighted')['f1']
   return {'accuracy': accuracy, 'f1': f1}

Hugging Face's Trainer API requires a TrainingArguments object as parameter. Therefore we need to define it as well:

training_args = TrainingArguments(
   output_dir='finetuning-sentiment-model-10000-samples',
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=12,
   weight_decay=0.01,
   save_strategy='epoch'
)

The output_dir is a directory where our model will be saved. We choose to save model on every epoch in this case.


Finally we are ready to define our Trainer instance and train the model:

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_datasets['train'],
   eval_dataset=tokenized_datasets['train'],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

trainer.train()

The training outputs are as follows:


Let's evaluate our model on the train set:

trainer.evaluate()

The results are:

It is seen that our model has achieved over 99% accuracy on our train set, which is quite a nice job. We covered how to used the trained model for inference in the last episode, please refer to it for more details.


Conclusions


In this article, we went through how to use Hugging Face's Trainer API to train a transformer model and perform sentiment analysis task. More about NLP and Hugging Face will be covered in future episodes. Stay tuned!


Get in Touch


Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.


If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Comments


bottom of page