Fine-tune SciBERT¶

We are planning to do a simple classification task on scientific text. For that, SciBERT is an ideal model to fine-tune since it has been pretrained of academic publications.

This notebook was updated so that it can run in Google Colab.

First, we need to install the dependencies.

In [1]:

Copied!

!pip install transformers datasets great-ai > /dev/null
!pip install transformers datasets great-ai > /dev/null

Load the training data from S3. (We have uploaded this to S3 in the data notebook.)

In [2]:

Copied!





from great_ai.large_file import LargeFileS3
import json

LargeFileS3.configure_credentials_from_file("config.ini")

with LargeFileS3("summary-train-dataset-small", encoding="utf-8") as f:
    # splitting training and test data is done later by `datasets`
    X, y = json.load(f)
from great_ai.large_file import LargeFileS3
import json

LargeFileS3.configure_credentials_from_file("config.ini")

with LargeFileS3("summary-train-dataset-small", encoding="utf-8") as f:
    # splitting training and test data is done later by `datasets`
    X, y = json.load(f)

Latest version of summary-train-dataset-small is 0 (from versions: 0)
File summary-train-dataset-small-0 found in cache

Finetune SciBERT, for more info about this step, check out HuggingFace. If you're only here for great-ai, feel free to skip the next cell.

In [22]:

Copied!





from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback,
)
from pathlib import Path
import numpy as np
from datasets import Dataset, load_metric

MODEL = "allenai/scibert_scivocab_uncased"
BATCH_SIZE = 32

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


def tokenize_function(v):
    return tokenizer(v["text"])


dataset = (
    Dataset.from_dict({"text": X, "label": y})
    .map(lambda v: tokenizer(v["text"], truncation=True), batched=True)
    .remove_columns("text")
    .train_test_split(test_size=0.2, shuffle=True)  # test is actually validation
)

f1_score = load_metric("f1")


def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    return f1_score.compute(predictions=pred, references=labels)


training_args = TrainingArguments(
    output_dir=Path("models"),
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    save_total_limit=5,
    num_train_epochs=50,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    weight_decay=0.01,
    metric_for_best_model="f1",
    load_best_model_at_end=True,
)

result = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
).train()
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback,
)
from pathlib import Path
import numpy as np
from datasets import Dataset, load_metric

MODEL = "allenai/scibert_scivocab_uncased"
BATCH_SIZE = 32

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


def tokenize_function(v):
    return tokenizer(v["text"])


dataset = (
    Dataset.from_dict({"text": X, "label": y})
    .map(lambda v: tokenizer(v["text"], truncation=True), batched=True)
    .remove_columns("text")
    .train_test_split(test_size=0.2, shuffle=True)  # test is actually validation
)

f1_score = load_metric("f1")


def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    return f1_score.compute(predictions=pred, references=labels)


training_args = TrainingArguments(
    output_dir=Path("models"),
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    save_total_limit=5,
    num_train_epochs=50,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    weight_decay=0.01,
    metric_for_best_model="f1",
    load_best_model_at_end=True,
)

result = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
).train()

  0%|          | 0/1 [00:00<?, ?ba/s]

[130/650 01:43 < 07:01, 1.23 it/s, Epoch 10/50]

Epoch	Training Loss	Validation Loss	F1
1	0.586800	0.512138	0.719101
2	0.411600	0.416675	0.849057
3	0.245600	0.417070	0.864000
4	0.147800	0.575878	0.852459
5	0.056800	0.474259	0.896552
6	0.022500	0.754236	0.843137
7	0.001000	0.857636	0.834783
8	0.000500	0.920232	0.869565
9	0.000300	0.970790	0.877193
10	0.000300	0.948689	0.862385

...
Deleting older checkpoint [models/checkpoint-39] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 100
  Batch size = 32
Saving model checkpoint to models/checkpoint-117
Configuration saved in models/checkpoint-117/config.json
Model weights saved in models/checkpoint-117/pytorch_model.bin
Deleting older checkpoint [models/checkpoint-52] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 100
  Batch size = 32
Saving model checkpoint to models/checkpoint-130
Configuration saved in models/checkpoint-130/config.json
Model weights saved in models/checkpoint-130/pytorch_model.bin
Deleting older checkpoint [models/checkpoint-78] due to args.save_total_limit


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from models/checkpoint-65 (score: 0.896551724137931).

The best macro F1-score on the test set is 0.89 which is (not surprisingly) substantially more than the SVM achieved. We have a great model, it's time to deploy it. But first, we have to store it in a secure place.

In [44]:

Copied!





from great_ai import save_model

# save Torch model to local disk
model.save_pretrained("pretrained")

# upload model from local disk to S3
# (because the S3 credentials have been already set, `save_model` will use LargeFileS3)
save_model("pretrained", key="scibert-highlights")
from great_ai import save_model

# save Torch model to local disk
model.save_pretrained("pretrained")

# upload model from local disk to S3
# (because the S3 credentials have been already set, `save_model` will use LargeFileS3)
save_model("pretrained", key="scibert-highlights")

Configuration saved in pretrained/config.json
Model weights saved in pretrained/pytorch_model.bin

  adding: pretrained/ (stored 0%)
  adding: pretrained/config.json (deflated 49%)
  adding: pretrained/pytorch_model.bin (deflated 7%)

Next: Part 3

Last update: July 29, 2022