CausalLM Part 2: Fine-Tuning a Model
3 ways to fine-tune a CausalLM model on chat data
In the last post, we talked about what CausalLM is and how Hugging Face expects data to be formatted. In this post, we’re going to walk through an abridged notebook with three ways to format the data to fine-tune a model. The first is a straightforward approach building on the intuition from the previous post simply copying input_ids into labels. The second approach utilizes masking to learn select parts of the text. The third approach uses a separate library, TRL, so that we don’t have to manually mask the data.
I’ll leave out some function definitions to keep it readable, so it’s best to reference the full notebook to get all the code.
Fine-tuning with labels copied from input ids
We’re going to be using Bloom-560m, a multilingual model which is small enough that we can fine-tune it on a standard laptop.
model_name = "bigscience/bloom-560m"
tokenizer = AutoTokenizer.from_pretrained(
model_name, trust_remote_code=True, padding_side="right"
) # padding side should be right for CausalLM models
# overfit to 5 made up examples
str1 = 'nn### Human: How do you say "dog" in Spanish?nn### Assistant: perro'
str2 = 'nn### Human: How do you say "water" in Spanish?nn### Assistant: agua'
str3 = 'nn### Human: How do you say "hello" in Spanish?nn### Assistant: hola'
str4 = 'nn### Human: How do you say "tree" in Spanish?nn### Assistant: árbol'
str5 = 'nn### Human: How do you say "mother" in Spanish?nn### Assistant: madre'
train_data = {
"text": [str1, str2, str3, str4, str5],
}
dataset_text = Dataset.from_dict(train_data)
# to test if we learn how to generate an unknown word.
holdout_str = (
'nn### Human: How do you say "day" in Spanish?nn### Assistant:<s>' # día
)
device = "cuda" if torch.cuda.is_available() else "cpu"
holdout_input = tokenizer(holdout_str, return_tensors="pt").to(device)
Let’s start by doing some preprocessing. We’re going to add some special tokens, namely “end of sequence” (eos) and “beginning of sequence“ (bos). These special tokens can be helpful for the model to know when it’s supposed to start and stop generating text.
INSTRUCTION_TEMPLATE_BASE = "nn### Human:"
RESPONSE_TEMPLATE_BASE = "nn### Assistant:"
def add_special_tokens(
example: Dict,
tokenizer: PreTrainedTokenizerBase,
) -> Dict:
# add eos_token before human text and bos_token before assistant text
example["text"] = (
example["text"]
.replace(
INSTRUCTION_TEMPLATE_BASE, tokenizer.eos_token + INSTRUCTION_TEMPLATE_BASE
)
.replace(RESPONSE_TEMPLATE_BASE, RESPONSE_TEMPLATE_BASE + tokenizer.bos_token)
)
if not example["text"].endswith(tokenizer.eos_token):
example["text"] += tokenizer.eos_token
# Remove leading EOS tokens
while example["text"].startswith(tokenizer.eos_token):
example["text"] = example["text"][len(tokenizer.eos_token) :]
return example
dataset_text = dataset_text.map(lambda x: add_special_tokens(x, tokenizer))
print(f"{dataset_text=}")
print(f"{dataset_text[0]=}")
>>> dataset_text=Dataset({
features: ['text'],
num_rows: 5
})
>>> dataset_text[0]={'text': 'nn### Human: How do you say "dog" in Spanish?nn### Assistant:<s> perro</s>'}
Now, we’re going to do what we learned last session: create an input with a labels key copied from input_ids.
# tokenize the text
dataset = dataset_text.map(
lambda example: tokenizer(example["text"]), batched=True, remove_columns=["text"]
)
# copy the input_ids to labels
dataset = dataset.map(lambda x: {"labels": x["input_ids"]}, batched=True)
print(f"{dataset=}")
print(f"{dataset[0]['input_ids']=}")
print(f"{dataset[0]['labels']=}")
>>> dataset=Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 5
})
>>> dataset[0]['input_ids']=[603, 105311, 22256, 29, 7535, 727, 1152, 5894, 20587, 744, 5, 361, 49063, 7076, 105311, 143005, 29, 1, 82208, 2]
>>> dataset[0]['labels']=[603, 105311, 22256, 29, 7535, 727, 1152, 5894, 20587, 744, 5, 361, 49063, 7076, 105311, 143005, 29, 1, 82208, 2]
To start, labels and input_ids are identical. Let’s see what happens when we train a model like that.
# training code inspired by
#https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html
model = load_model(model_name)
output_dir = "./results"
# How many times to iterate over the entire dataset
num_train_epochs = 15
# We're not aligning the sequence length (ie padding or truncating)
# so batch training won't work for our toy example.
per_device_train_batch_size = 1
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
seed=1,
)
trainer = Trainer(
model=model,
train_dataset=dataset,
args=training_arguments,
)
training1 = trainer.train()
# Sample generate prediction on holdout set
“nn### Human: How do you say "good" in Spanish?nn### Assistant:”
# the correct output is “bueno</s>”
sample_generate(model, tokenizer, holdout_inputs, max_new_tokens=5)
>>> ‘</s>’
After 15 epochs, we’re still kind of confused. We output ‘</s>’ which is close but we really want to output “perro</s>”. Let’s learn another 15 epochs.
trainer.train()
sample_generate(model, tokenizer, holdout_input, max_new_tokens=5)
>>> bueno </s>
After 30 epochs we learned what we were supposed to!
Let’s simulate what happens in training by iteratively predicting the prompt one token at a time, based on the previous tokens.
print_iterative_generate(model, tokenizer, inputs)
>>>
#
: How do you say "how morning in Spanish?
### Assistant: gu buenopu
That’s pretty close to the actual prompt, as we expected. But the task is translation, so we don’t really care about being able to predict the user prompt. Is there a way to learn just the response part?
Masked approach
Hugging Face allows you to only learn to predict certain tokens by “masking” the tokens you don’t care about in “labels.” This is different from the attention mask, which hides previous tokens we use to generate a new token. Masking the labels hides the token you’re supposed to output at a certain index from the loss function. Note the wording: Hugging Face has it implemented such that during training, we still generate predictions for that masked token. However, because we hide the true label to compare the predictions with, we don’t directly learn how to improve on that prediction.
We create the “mask” by flipping those tokens to -100 in the labels key.
def create_special_mask(example: Dict) -> Dict:
"""Mask human text and keep assistant text as it is.
Args:
example (Dict): Result of tokenizing some text
Returns:
Dict: The dict with the label masked
"""
# setting a token to -100 is how we "mask" a token
# and tell the model to ignore it when calculating the loss
mask_token_id = -100
# assume we always start with a human text
human_text = True
for idx, tok_id in enumerate(example["labels"]):
if human_text:
# mask all human text up until and including the bos token
example["labels"][idx] = mask_token_id
if tok_id == tokenizer.bos_token_id:
human_text = False
elif not human_text and tok_id == tokenizer.eos_token_id:
# don’t mask the eos token, but the next token will be human text to mask
human_text = True
elif not human_text:
# leave example['labels'] text as it is when assistant text
continue
return example
dataset_masked = dataset.map(create_special_mask)
# convert dataset from lists to torch tensors
dataset_masked.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
print(f"{dataset_masked[0]["labels"]=}")
>>> dataset[0]["labels"]=tensor([ -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 82208, 2])
model = load_model(model_name)
trainer = Trainer(
model=model,
train_dataset=dataset_masked,
args=training_arguments,
)
training2 = trainer.train()
print(f"{training2.metrics['train_runtime']=}")
print(f"{training1.metrics['train_runtime'] =}")
print(
f"{100*round((training1.metrics['train_runtime'] - training2.metrics['train_runtime']) / training1.metrics['train_runtime'] , 2)}%"
)
>>> training2.metrics['train_runtime']=61.7164
>>> training1.metrics['train_runtime'] =70.8013
>>> 13.0%
First off, we were faster this time by more than 10%. Presumably, the fact that we have fewer loss calculations makes things a bit quicker.
I wouldn’t bank on the speed up being this large — our example is pretty lopsided with much more human text than generated text. But when training times are in the hours, every little percentage is helpful.
The big question: did we learn the task?
sample_generate(model, tokenizer, holdout_input, max_new_tokens=5)
>>> bueno </s>
This time we only need 15 epochs to learn the task. Let’s go back to how things are under the hood during training
print_iterative_generate(model, tokenizer, inputs)
>>>#include
code
to I get "we" in English?
A: Spanish: How bueno
Iteratively predicting the prompt leads to non-sense compared with our first training approach. This checks out: we masked the prompt during training and therefore don’t learn how to predict anything up until our real target: the assistant response.
Using TRL’s supervised fine-tuning trainer
Hugging Face semi-recently rolled out a TRL (transformer reinforcement learning) library to add end-to-end support for the LLM training process. One feature is supervised fine-tuning. Using the DataCollatorForCompletionOnlyLM and SFTTrainer classes, we can create the labels like we did with create_special_mask with just a few configs.
model = load_model(model_name)
# a hugging face function to do the copying of labels for you.
# using the instruction and response templates will mask everything between the instruction template and the start of the response_template
collator = DataCollatorForCompletionOnlyLM(
instruction_template=tokenizer.eos_token,
response_template=tokenizer.bos_token,
tokenizer=tokenizer,
)
trainersft = SFTTrainer(
model,
train_dataset=dataset_text,
dataset_text_field="text",
data_collator=collator,
args=training_arguments,
tokenizer=tokenizer,
)
sftrain = trainersft.train()
sample_generate(model, tokenizer, holdout_input, max_new_tokens=5)
>>> ' perro</s>'
Success! If you dig deeper, training actually took longer using SFT. This might be credited to the fact that we have to tokenize at training time rather than as a preprocessing step in the masked approach. However, this approach gives us free batching (you’d need to tweak the tokenization process to use the masked approach to batch properly), which should make things faster in the long run.
The full notebook explores a few other things like training off multi-turn chats and using special_tokens to indicate human vs chat text.
Obviously, this example is a bit basic. However, hopefully you can start to see the power of using CausalLM: You can imagine taking interactions from a large, reliable model, and using the techniques above to fine-tune a smaller model on the large model’s outputs. This is called knowledge distillation.
If we’ve learned anything over the last couple years of LLMs, it’s that we can do some surprisingly intelligent things just by training on next token prediction. Causal language models are designed to do just that. Even if the Hugging Face class is a bit confusing at first, once you’re used to it, you have a very powerful interface to train your own generative models.
CausalLM Part 2: Finetuning a model was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
CausalLM Part 2: Finetuning a model
Go Here to Read this Fast! CausalLM Part 2: Finetuning a model