A guide to iterative fine-tuning and serialisation
So, you recently discovered Hugging Face and the host of open source models like BERT, Llama, BART and a whole host of generative language models by Mistral AI, Facebook, Salesforce and other companies. Now you want to experiment with fine tuning some Large Language Models for your side projects. Things start off great, but then you discover how computationally greedy they are and you do not have a GPU processor handy.
Google Colab generously offers you a way to access to free computation so you can solve this problem. The downside is, you need to do it all inside a transitory browser based environment. To make matter worse, the whole thing is time limited, so it seems like no matter what you do, you are going to lose your precious fine tuned model and all the results when the kernel is eventually shut down and the environment nuked.
Never fear. There is a way around this: make use of Google Drive to save any of your intermediate results or model parameters. This will allow you to continue experimentation at a later stage, or take and use a trained model for inference elsewhere.
To do this you will need a Google account that has sufficient Google Drive space for both your training data and you model checkpoints. I will presume you have created a folder called data in Google Drive containing your dataset. Then another called checkpoints that is empty.
Inside your Google Colab Notebook you then mount your Drive using the following command:
from google.colab import drive
drive.mount('/content/drive')
You now list the contents of your data and checkpoints directories with the following two commands in a new cell:
!ls /content/drive/MyDrive/data
!ls /content/drive/MyDrive/checkpoint
If these commands work then you now have access to these directories inside your notebook. If the commands do not work then you might have missed the authorisation step. The drive.mount command above should have spawned a pop up window which requires you to click through and authorise access. You may have missed the pop up, or not selected all of the required access rights. Try re-running the cell and checking.
Once you have that access sorted, you can then write your scripts such that models and results are serialised into the Google Drive directories so they persist over sessions. In an ideal world, you would code your training job so that any script that takes too long to run can load partially trained models from the previous session and continue training from that point.
A simple way for achieving that is creating a save and load function that gets used by your training scripts. The training process should always check if there is a partially trained model, before initialising a new one. Here is an example save function:
def save_checkpoint(epoch, model, optimizer, scheduler, loss, model_name, overwrite=True):
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
'loss': loss
}
direc = get_checkpoint_dir(model_name)
if overwrite:
file_path = direc + '/checkpoint.pth'
else:
file_path = direc + '/epoch_'+str(epoch) + '_checkpoint.pth'
if not os.path.isdir(direc):
try:
os.mkdir(direc)
except:
print("Error: directory does not exist and cannot be created")
file_path = direc +'_epoch_'+str(epoch) + '_checkpoint.pth'
torch.save(checkpoint, file_path)
print(f"Checkpoint saved at epoch {epoch}")
In this instance we are saving the model state along with some meta-data (epochs and loss) inside a dictionary structure. We include an option to overwrite a single checkpoint file, or create a new file for every epoch. We are using the torch save function, but in principle you could use other serialisation methods. The key idea is that your program opens the file and determines how many epochs of training were used for the existing file. This allows the program to decide whether to continue training or move on.
Similarly, in the load function we pass in a reference to a model we wish to use. If there is already a serialised model we load the parameters into our model and return the number of epochs it was trained for. This epoch value will determine how many additional epochs are required. If there is no model then we get the default value of zero epochs and we know the model still has the parameters it was initialised with.
def load_checkpoint(model_name, model, optimizer, scheduler):
direc = get_checkpoint_dir(model_name)
if os.path.exists(direc):
file_path = get_path_with_max_epochs(direc)
checkpoint = torch.load(file_path, map_location=torch.device('cpu'))
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
print(f"Checkpoint loaded from {epoch} epoch")
return epoch, loss
else:
print(f"No checkpoint found, starting from epoch 1.")
return 0, None
These two functions will need to be called inside your training loop, and you need to ensure that the returned value for epochs value is used to update the value of epochs in your training iterations. The result is you now have a training process that can be re-started when a kernel dies, and it will pick up and continue from where it left off.
That core training loop might look something like the following:
EPOCHS = 10
for exp in experiments:
model, optimizer, scheduler = initialise_model_components(exp)
train_loader, val_loader = generate_data_loaders(exp)
start_epoch, prev_loss = load_checkpoint(exp, model, optimizer, scheduler)
for epoch in range(start_epoch, EPOCHS):
print(f'Epoch {epoch + 1}/{EPOCHS}')
# ALL YOUR TRAINING CODE HERE
save_checkpoint(epoch + 1, model, optimizer, scheduler, train_loss, exp)
Note: In this example I am experimenting with training multiple different model setups (in a list called experiments), potentially using different training datasets. The supporting functions initialise_model_components and generate_data_loaders are taking care of ensuring that I get the correct model and data for each experiment.
The core training loop above allows us to reuse the overall code structure that trains and serialises these models, ensuring that each model gets to the desired number of epochs of training. If we restart the process, it will iterate through the experiment list again, but it will abandon any experiments that have already reached the maximum number of epochs.
Hopefully you can use this boilerplate code to setup your own process for experimenting with training some deep learning language models inside Google Colab. Please comment and let me know what you are building and how you use this code.
Massive thank you to Aditya Pramar for his initial scripts that prompted this piece of work.
Training Language Models on Google Colab was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Training Language Models on Google Colab
Go Here to Read this Fast! Training Language Models on Google Colab