Fine-tune SciBERT¶
We are planning to do a simple classification task on scientific text. For that, SciBERT is an ideal model to fine-tune since it has been pretrained of academic publications.
This notebook was updated so that it can run in Google Colab.
First, we need to install the dependencies.
In [1]:
Copied!
!pip install transformers datasets great-ai > /dev/null
!pip install transformers datasets great-ai > /dev/null
Load the training data from S3. (We have uploaded this to S3 in the data
notebook.)
In [2]:
Copied!
from great_ai.large_file import LargeFileS3
import json
LargeFileS3.configure_credentials_from_file("config.ini")
with LargeFileS3("summary-train-dataset-small", encoding="utf-8") as f:
# splitting training and test data is done later by `datasets`
X, y = json.load(f)
from great_ai.large_file import LargeFileS3
import json
LargeFileS3.configure_credentials_from_file("config.ini")
with LargeFileS3("summary-train-dataset-small", encoding="utf-8") as f:
# splitting training and test data is done later by `datasets`
X, y = json.load(f)
Latest version of summary-train-dataset-small is 0 (from versions: 0) File summary-train-dataset-small-0 found in cache
Finetune SciBERT, for more info about this step, check out HuggingFace.
If you're only here for great-ai
, feel free to skip the next cell.
In [22]:
Copied!
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
DataCollatorWithPadding,
Trainer,
TrainingArguments,
EarlyStoppingCallback,
)
from pathlib import Path
import numpy as np
from datasets import Dataset, load_metric
MODEL = "allenai/scibert_scivocab_uncased"
BATCH_SIZE = 32
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
def tokenize_function(v):
return tokenizer(v["text"])
dataset = (
Dataset.from_dict({"text": X, "label": y})
.map(lambda v: tokenizer(v["text"], truncation=True), batched=True)
.remove_columns("text")
.train_test_split(test_size=0.2, shuffle=True) # test is actually validation
)
f1_score = load_metric("f1")
def compute_metrics(p):
pred, labels = p
pred = np.argmax(pred, axis=1)
return f1_score.compute(predictions=pred, references=labels)
training_args = TrainingArguments(
output_dir=Path("models"),
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
save_total_limit=5,
num_train_epochs=50,
save_strategy="epoch",
evaluation_strategy="epoch",
logging_strategy="epoch",
weight_decay=0.01,
metric_for_best_model="f1",
load_best_model_at_end=True,
)
result = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
).train()
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
DataCollatorWithPadding,
Trainer,
TrainingArguments,
EarlyStoppingCallback,
)
from pathlib import Path
import numpy as np
from datasets import Dataset, load_metric
MODEL = "allenai/scibert_scivocab_uncased"
BATCH_SIZE = 32
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
def tokenize_function(v):
return tokenizer(v["text"])
dataset = (
Dataset.from_dict({"text": X, "label": y})
.map(lambda v: tokenizer(v["text"], truncation=True), batched=True)
.remove_columns("text")
.train_test_split(test_size=0.2, shuffle=True) # test is actually validation
)
f1_score = load_metric("f1")
def compute_metrics(p):
pred, labels = p
pred = np.argmax(pred, axis=1)
return f1_score.compute(predictions=pred, references=labels)
training_args = TrainingArguments(
output_dir=Path("models"),
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
save_total_limit=5,
num_train_epochs=50,
save_strategy="epoch",
evaluation_strategy="epoch",
logging_strategy="epoch",
weight_decay=0.01,
metric_for_best_model="f1",
load_best_model_at_end=True,
)
result = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
).train()
0%| | 0/1 [00:00<?, ?ba/s]
[130/650 01:43 < 07:01, 1.23 it/s, Epoch 10/50]
Epoch | Training Loss | Validation Loss | F1 |
---|---|---|---|
1 | 0.586800 | 0.512138 | 0.719101 |
2 | 0.411600 | 0.416675 | 0.849057 |
3 | 0.245600 | 0.417070 | 0.864000 |
4 | 0.147800 | 0.575878 | 0.852459 |
5 | 0.056800 | 0.474259 | 0.896552 |
6 | 0.022500 | 0.754236 | 0.843137 |
7 | 0.001000 | 0.857636 | 0.834783 |
8 | 0.000500 | 0.920232 | 0.869565 |
9 | 0.000300 | 0.970790 | 0.877193 |
10 | 0.000300 | 0.948689 | 0.862385 |
... Deleting older checkpoint [models/checkpoint-39] due to args.save_total_limit ***** Running Evaluation ***** Num examples = 100 Batch size = 32 Saving model checkpoint to models/checkpoint-117 Configuration saved in models/checkpoint-117/config.json Model weights saved in models/checkpoint-117/pytorch_model.bin Deleting older checkpoint [models/checkpoint-52] due to args.save_total_limit ***** Running Evaluation ***** Num examples = 100 Batch size = 32 Saving model checkpoint to models/checkpoint-130 Configuration saved in models/checkpoint-130/config.json Model weights saved in models/checkpoint-130/pytorch_model.bin Deleting older checkpoint [models/checkpoint-78] due to args.save_total_limit Training completed. Do not forget to share your model on huggingface.co/models =) Loading best model from models/checkpoint-65 (score: 0.896551724137931).
The best macro F1-score on the test set is 0.89 which is (not surprisingly) substantially more than the SVM achieved. We have a great model, it's time to deploy it. But first, we have to store it in a secure place.
In [44]:
Copied!
from great_ai import save_model
# save Torch model to local disk
model.save_pretrained("pretrained")
# upload model from local disk to S3
# (because the S3 credentials have been already set, `save_model` will use LargeFileS3)
save_model("pretrained", key="scibert-highlights")
from great_ai import save_model
# save Torch model to local disk
model.save_pretrained("pretrained")
# upload model from local disk to S3
# (because the S3 credentials have been already set, `save_model` will use LargeFileS3)
save_model("pretrained", key="scibert-highlights")
Configuration saved in pretrained/config.json Model weights saved in pretrained/pytorch_model.bin
adding: pretrained/ (stored 0%) adding: pretrained/config.json (deflated 49%) adding: pretrained/pytorch_model.bin (deflated 7%)
Next: Part 3
Last update:
July 29, 2022