Deliberately Exploring Design Decisions for Parameter Efficient Finetuning (PEFT) with LoRA
Good news: Using LoRA for Parameter Efficient Finetuning (PEFT) can be straightforward. With a simple strategy of adapting all linear modules and some light tuning of the learning rate, you can achieve good performance. You could stop reading here!
But what if you want more? What if you’re seeking a deeper understanding of which modules to tune, and how to optimize your model for performance, GPU memory utilization or training speed? If you’re looking for a more nuanced understanding and control over these aspects, then you’re in the right place.
Join me on this journey as we navigate the winding road to parameter efficiency. We’ll delve into the deliberate design decisions that can help you to get the most out of LoRA while offering you more control and a better understanding of your model’s performance. Let’s embark on this exciting exploration together.
You would get the most out of this article if you already have at least a basic understanding of LoRA, like what we covered in the previous article. Furthermore we are optimizing a RoBERTa model [1], which uses the transformer architecture. A general understanding of the basic components helps, but is not absolutely necessary to follow along on the main subject.
In the previous article, we explored how to apply LoRA to train adapters that only require a fraction of the parameters needed for a full finetuning. We also saw how such an implementation might look like in code. However, our focus was primarily on the mechanical aspects. We did not address which modules to adapt, nor how to size the adapters for efficiency and performance.
Today, this is our focus.
We zoom out and recognize that there are a lot of algorithmic design decisions that we have to make, many of which influence each other. These are often expressed as hyperparameters by the original algorithm creators. To handle the sheer number of possible combinations of hyperparameters and their values we’ll use a systematic approach to learn about the relative impact; of these design decisions. Our aim is not only to eventually achieve good performance for our model at hand, but we also want to run experiments to gather empirical feedback to strengthen our intuitive understanding of the model and its design. This will not only serve us well for today’s model, task, and dataset, but much of what we learn will be transferable. It will give us greater confidence moving forward as we work on variations of the model, new tasks, and datasets in the future.
Execution of Experiments:
I will be using Amazon SageMaker Automatic Model Tuning (AMT) to run the experiments throughout this article. With AMT I will either deliberately explore and analyze the search space, or, automatically find a good combination of hyperparameter values.
As a side note, ‘tuning’ is a term that serves two purposes in this article. On one hand, we use ‘hyperparameter tuning’ to refer to the adjustment of hyperparameter values in model training, a process automated by SageMaker’s Automatic Model Tuning. On the other hand, we use ‘tuning’ to describe the process of starting with a pre-trained model and then finetuning its parameters (not the hyperparameters) for our specific downstream task.
To maintain focus, I will keep the implementation details in this article brief. However, you will find all the experiments with all their details in the linked notebooks.
I also encourage you to learn more background about using AMT, the differences between the search strategies Random Search and Bayesian Optimization, the concept of warm starting tuning jobs and about visualizing/analyzing the results. All of which, are discussed in this article:
Baselines: What to compare to?
We will focus on architectural decisions:
- Which modules should we adapt?
- On what layers? All of them? Some? Just the middle layers?
- How large should the module adapters be? What should r, the rank of the LoRA matrices, be?
However, before we start experimenting, how can we ensure that we are on the right track and that our changes have a positive impact? Let’s define some baselines to compare our progress to.
If finding baselines for comparison does not appeal to you, feel free to skip ahead to the next section “What to tune?”.
Over time, we hope to observe that our training runs are producing better results. But when are we done and can stop experimenting?
Seeing no further improvements after a while could indicate that we have achieved the optimum. However, it could also mean that we have ran out of ideas to try out, even though more was possible.
Performance Expectations and Reproducibility
In order to interpret the results of our experiments, we need to establish clear performance expectations for our model. This includes an understanding of the ideal performance as an upper bound, as well as the minimum performance we expect to see.
Deep learning is inherently noisy, meaning that no two runs will produce the exact same result. This raises important questions about the results we observe. Is the performance we’re seeing reproducible using the hyperparameter values we tested with, or did we just get lucky with this particular run? To answer these questions, we need to validate a set of hyperparameter values that we’ve found to perform well. In this article I’ll do this by running the same hyperparameter values five times to calculate the mean performance and its variance.
Expected performance — Full Finetuning: In our case reasoning about the expected performance is easy. We are finetuning a sentiment analysis task on the sst-2 dataset using the RoBERTa base model, as was done in the RoBERTa paper [1].
Therefore, we can directly use the numbers reported by the authors as a sanity check. We will align our setup and the hyperparameters used with those in the paper.
We still run the training ourselves, so that we have a verified setup and training procedure before we apply LoRA to it. Consequently, we can perform a sanity check to ensure that the numbers we observe roughly match those from the paper. If we cannot match the numbers, we would need to check our setup.
The RoBERTa paper [1] reported an accuracy of 94.8in table 8. This serves as our benchmark for expected performance during full fine-tuning. After checking that we are in the ball park of that number, we will use our own setup and the results as a baseline for comparing all the following experiments, which are derived from our setup.
Expected performance — LoRA Finetuning: This is easy as well. The promise of LoRA is to almost match the full finetuning performance, but with only a fraction of the parameters of a full finetuning.
Hence, we will compare to our results from the full finetuning performance as described in the preceding section.
Expected minimum performance: One possible baseline would be random performance. For our task with two classes that would be 0.5. But we are not building a model from scratch and from the papers we already know that the LoRA approach is working very well, so random performance would not be an informative baseline.
Instead, let’s use a baseline where we only train the classifier and keep the embeddings and transformer layers frozen, in the state they came from the pre-training. This should result in a much lower performance than a full finetuning, but much better than random, though. Importantly, it should also serve as a comparison point to reason about non-functional aspects like parameter efficiency, memory usage, and training throughput.
All scenarios above have been run five times, and the mean performance is shown in the diagram. You can also deduce that we are in the ballpark of the performance from the RoBERTa paper with the scenarios “Full Finetuning”. As we hoped for, “LoRA Base” (adapting all linear modules) matches that performance, but uses fewer parameters. The scenario “Classifier Only” performs much worse, as expected, but is cheaper in terms of parameters and trains faster.
Moving forward, we will now take our numbers as baselines to compare future experiments to.
You can find more details in the accompanying notebook.
Execution of Experiments:
First, for each baseline, we search for an optimal learning rate parameter value. We use Bayesian Optimization to efficiently explore and then exploit the search space.
Second, the best hyperparameter values we found for a scenario may or may not necessarily reproduce good results. It could be that the hyperparameter values we identified are only the best relative to the other values we explored. Maybe the values we found were not relevant at all, e.g. the model was not sensitive in this value range? To estimate how good the findings hold up, for each scenario, we run the best combination of hyperparameter values again five times and report the observed standard deviation on the objective metric.
LoRA Base Scenario — First Result: It’s encouraging to see that the LoRA finetuning approach, scenario “LoRA Base”, is already performing on par with “Full Finetuning”, despite it just using ~1% of the parameters. Furthermore, in this approach we are adapting all linear modules with the same adapter size (r=8). This is a simple starting point that apparently produces good performance despite its simplicity.
Secondary Hyperparameters: As a point of note, we primarily search for good values for the hyperparameter r and the modules we want to adapt. To keep things simple, we only tune very few additional hyperparameters. For the baselines it is just the learning rate and the number of epochs. We use Bayesian Optimization as the search strategy using Amazon SageMaker Automatic Model Tuning (AMT).
We follow guidance from the referenced papers on setting other hyperparameters, such as weight decay and dropout. We keep those hyperparameters fixed throughout the article, so that we can isolate the impact of the hyperparameters that define the LoRA architecture, making it easier to see how our main hyperparameters influence performance.
Do you, dear reader, plan to repeat the steps from this article? Are you aiming to find the best hyperparameters for your own model, task, and dataset that you intend to use in production? If so, it would make sense to also include the secondary hyperparameters. Ideally, you should do this towards the end of your exploration and tuning effort — when you have already significantly narrowed the search scope — and then aim to further improve performance, even if just slightly.
Hyperparameters: What to tune?
Let’s get started with our main activity.
The design decisions left for us in the model architecture are typically expressed as hyperparameters. For LoRA specifically, we can define which modules to adapt and how large r should be for each module’s adapter.
In the last article we only suggested selecting these modules based on our understanding of the task and the architecture.
Now, we’ll dive deeper. Where should we apply finetuning at all?
In the illustration above, you can see all the potential modules that we could finetune–including the classifier and the embeddings–on the left. On the right, I’ve made a sample selection for the illustration . But how do we arrive at an actual selection?
Let’s look at our options from a high level:
- Classifier
It is clear that we absolutely need to train the classifier. This is because it has not been trained during pre-training and, hence, for our finetuning, it is randomly initialized. Furthermore, its central position makes it highly impactful on the model performance, as all information must flow through it. It also has the most immediate impact on the loss calculation as it starts at the classifier. Lastly, it has few parameters, therefore, it is efficient to train.
In conclusion, we always finetune the classifier, but do not adapt it (with LoRA). - Embeddings
The embeddings reside at the bottom–close to the inputs–and carry the semantic meaning of the tokens. This is important for our downstream task. However, it’s not “empty”. Even without finetuning, we would get all of what was learned during pre-training. At this point, we are considering whether finetuning the embeddings directly would give us additional abilities and if our downstream task would benefit from a refined understanding of the token meanings?
Let’s reflect. If this were the case, could this additional knowledge not also be learned in one of the layers above the embeddings, perhaps even more efficiently?
Finally, the embeddings typically have lots of parameters, so we would have to adapt them before finetuning.
Taking both aspects together, we decided to pass on this option and not make the embeddings trainable (and consequently not apply LoRA to them). - Transformer Layers
Finetuning all parameters in the transformer layers would be inefficient. Therefore, we need to at least adapt them with LoRA to become parameter-efficient. This leads us to consider whether we should train all layers, and all components within each layer? Or should we train some layers, some components, or specific combinations of both?
There is no general answer here. We’ll adapt these layers and their modules and explore the details further in this article.
In the illustration above, on the right, you can see an exemplary selection of modules to finetune on the right. This is just one combination, but many other combinations are possible. Keep in mind as well that the illustration only shows five layers, while your model likely has more. For instance, the RoBERTa base model–used in our example–has 12 layers, a number that is considered small by today’s standards. Each layer also has 6 components:
- Attention: Query, Key, Value, Output
- Feed Forward: Up, Down
Even if we disregard that we also want to tune r and — for now — just focus on the binary decision of which modules to include, this will leave us with 64 (2**6) combinations per layer. Given this only looks at the combinations of one layer, but that we have 12 layers that can be combined, we end up with more than a sextillion combinations:
In [1]: (2**6)**12.
Out[1]: 4.722366482869645e+21
It’s easy to see that we can’t exhaustively compute all combinations, let alone to explore the space manually.
Typically in computer science, we turn to the dice when we want to explore a space that is too large to fully investigate. But in this case, we could sample from that space, but how would we interpret the results? We would get back a number of arbitrary combination of layers and components (at least 12*6=72 following the small example of above). How would we generalize from these details to find higher-level rules that align with our natural understanding of the problem space? We need to align these details with our conceptual understanding on a more abstract level.
Hence, we need to consider groups of modules and look for structures or patterns that we can use in our experiments, rather than operating on a collection of individual components or layers. We need to develop an intuition about how things should work, and then formulate and test hypotheses.
Question: Does it help to experiment on defined groups of parameters in isolation? The answer is yes. These isolated groups of parameters can lead the way even though we may need to combine some of them later to achieve the best results. Testing in isolation allows us to see patterns of impact more clearly.
However, there is a risk. When these patterns are used in combination, their impact may change. That’s not perfect, but let’s not be so negative about it 🙂 We need to start somewhere, and then refine our approach if needed.
Ready? Let’s try this out.
Tuning Vertically / Layer-wise
I suspect that the upper layers, closer to the classification head, will be more impactful than the lower layers. Here is my thinking: Our task is sentiment analysis. It would make sense, wouldn’t it, that most of the specific decisions have to be made either in the classification head or close to it? Like recognizing certain phrases (“I needed that like a hole in my head”) or composed constructs (“The check-in experience negated the otherwise wonderful service”). This would suggest that it is crucial to finetune the parameters of our network that define how different tokens are used together–in context–to create a sentiment as opposed to changing the meaning of words (in the embeddings) compared to their meaning during the pre-training.
Even if that’s not always the case, adapting the upper layers still provides the opportunity to override or refine decisions from the lower layers and the embeddings. On the other hand, this suggests that finetuning the lower layers is less important.
That sounds like a solid hypothesis to try out (Oops. Message from future Mariano: Don’t stop reading here).
As an aside, we are not reflecting on the general necessity of the embeddings or any of the transformer layers. That decision has already been made: all of them were part of the pre-training and will be part of our finetuned model. What we’re considering at this point is how we can best help the model learn about our downstream task, which is sentiment analysis. The question we’re asking is: which weights should we finetune for impact and to achieve parameter efficiency?
Let’s put this to the test.
To clearly see the effect of our hypothesis, what do we test it against? Let’s design experiments that should exaggerate the effect:
- In our first experiment we finetune and adapt all components of the upper half of the model, namely layers 7–12 in our example. This is our hypothesis.
- In contrast, we run another experiment where we only finetune the layers in the lower half of the model. Specifically, we train layers 1–6 with all components. That’s the opposite of our hypothesis.
- Let’s consider another contrastive hypothesis as well: that a light touch to all layers is more beneficial than just tuning the top layers. So, let’s also include a third scenario where we finetune half of the layers but spread them out evenly.
- Let’s also include an experiment where we tune all layers (not depicted in the illustration above). This is not a fair performance comparison as we train twice as many parameters as in the first three experiments. However, for that reason, it highlights how much performance we potentially lose in the previous scenarios where we were tuning only half the number of parameters.
In summary, we have 3+1 scenarios that we want to run as experiments. Here are the results:
Execution of Experiments:
We start by using the already tuned learning rate, epochs. Then, we run trials (training runs) with different values for the scenario settings, such as lower, upper, even, all. Within AMT, we run these experiments as a Grid Search.
Question: Grid Search is known to be simple, but inefficient in finding the best solution. So why are we using it?
Let’s take a step back. If we were to run a few trials with Bayesian Search, we’d quickly learn about hyperparameter values that are performing well. This would bias the subsequent trials to focus on these values, i.e., pre-dominantly stay closer to known good values. While increasingly exploiting what we learn about the search space is a good strategy to find the best values, its bias makes it difficult to understand the explored space, as we under-sample in areas that showed low performance early on.
With Grid Search, we can precisely define which parameter values to explore, making the results easier to interpret.
In fact, if you were to look at the provided code, you’d see that AMT would reject sampling the same values more than once. But we want that, hence, we introduce a dummy variable with values from 0 to the number of trials we want to conduct. This is helpful, allowing us to repeat the trials with the same hyperparameter values to estimate the standard deviation of this combination.
While we used 5 trials each for an already tuned baseline scenario above to see how well we can reproduce a chosen combination of hyperparameter values, here we use 7 trials per combination to get a slightly more precise understanding of this combination’s variance to see tiny differences.
The same principles are applied to the following two scenarios in this article and will not be mentioned again henceforth.
Let’s get the easy thing out of the way first: As expected, tuning all layers and consequently using double the number of parameters, improves performance the most. This improvement is evident in the bottom figure.
Also, the peaks of all scenarios, as shown in the density plots on the right of the individual figures, are relatively close. When comparing these peaks, which represent the most frequently observed performance, we only see an improvement of ~0.08 in validation accuracy between the worst and best scenario. That’s not much. Therefore, we consider it a wash.
Regardless, let’s still examine our original hypothesis: We (me, really) expected that finetuning the upper six layers would yield better performance than finetuning the lower six layers. However, the data disagrees. For this task it makes no difference. Hence, I need to update my understanding.
We have two potential takeaways:
- Spreading the layers evenly is a little better than focusing on the top or bottom layers. That said, the improvement is so small that this insight may be brittle and might not generalize well, not event to new runs of the same model. Hence, we will discard our “discovery”.
- Tuning all layers, with double the cost, produces marginally better results. This outcome, however, is not surprising anyone. Still good to see confirmed though, as we otherwise would have found an opportunity to save trainable parameters, i.e., cost.
Overall, good to know all of that, but as we do not consider it actionable, we are moving on. If you are interested, you can find more details in this notebook.
Tuning Horizontally / Component-wise
Within each transformer layer, we have four learned projections used for attention that can be adapted during finetuning:
- Q — Query, 768 -> 768
- K — Key, 768 -> 768
- V — Value, 768 -> 768
- O — Output, 768 -> 768
In addition to these, we use two linear modules in each position-wise feedforward layer that live within the same transformer layer as the projections from above:
- Up — Up projection, 768 -> 3072
- Down — Down projection, 3072 -> 768
We can already see from the numbers above that the feedforward layers (ff) are four times as large as the QKVO projections we previously discussed. Hence the ff components will have a potentially larger impact and certainly higher cost.
Besides this, what other expectations could we have? It’s hard to say. We know from Multi-Query Attention [3] that the query projection is particularly important, but does this importance hold when finetuning with an adapter on our task (as opposed to, for example, pre-training)? Instead, let’s try out what the impact of the individual components is and proceed based on those results. We will be able to see which components are the strongest and maybe this will allow us to just pick those for tuning going forward.
Let’s run these experiments and inspect the results:
As was to be expected, the ff layers use their four-times size advantage to outperform the attention projections. Still, we can see that there are differences within these two groups. These differences are relatively minor, and if you want to leverage them, it’s necessary to validate their applicability for your specific task.
An important observation is that by merely tuning one of the ff layers (~0.943), we could almost achieve the performance of tuning all modules from the “LoRA Base” scenario (~0.946). Consequently, if we’re looking to balance between overall performance and the parameter count, this could be a good strategy. We’ll keep this in mind for the final comparison.
Within the attention projections (middle figure) it turns out that the query projection did not prove as impactful as expected. Contrarily, the output and value projections proved more useful. However, on their own, they were not that impressive.
So far, we have looked at the individual contributions of the components. Let’s also check if their impact overlaps or if combining components can improve the results.
Let’s run some of the possible combinations and see if this is informative. Here are the results:
Looking at the numbers charted above the first takeaway is that we have no performance regressions. Given that we added more parameters and combined existing combinations, that’s how it should be. Nevertheless, there is always the chance that when combining design decisions their combined performance is worse than their individual performance. Not here though, good!
We should not over-interpret the results, but it is interesting to recognize that when we testing our hypothesis individually the output projection’s performance was slightly ahead of the performance of the value projection. Here now, in combination with the position-wise feed forward up projection this relationship is reversed (now: o+up ~0.945, v+up ~0.948).
We’ll also recognize in the previous experiment, that the up projection was already performing almost on that level on its own. Therefore, we keep our enthusiasm in check, but include this scenario in our final comparison. If only, because we get a performance that is slightly better than when tuning and adapting all components in all layers, “LoRA Base”, but with much fewer parameters.
You can find more details in this notebook.
Tuning r
We know from the literature [2] that it is recommended to use a small r value, meaning that r is only a fraction of the minimum dimension of the original module, e.g. to use 8 instead of 768. However, let’s validate this for ourselves and get some empirical feedback. Could it be worth investigating using a larger value for r, despite the conventional wisdom?
For the previous trials, we used r=8 and invested more time to tune learning-rate and the number of epochs to train for this value. Now trying different values for r will significantly alter the capacity of the linear modules. Ideally, we would re-tune the learning-rate for each value of r, but we aim to be frugal. Consequently, for now, we stick to the same learning-rate. However, as farther we go away from our tuned r=8value as stronger the need to retune the other hyperparameters mentioned above.
A consideration we need to remember when reviewing the results:
In the first figure, we see that the model performance is not particularly sensitive to additional capacity with good performances at r=4 and r=8. r=16was a tiny bit better, but is also more expensive in terms of parameter count. So let’s keep r=4 and r=8 in mind for our final comparison.
To see the effect of r on the parameter count, we will also include r=1 in the final comparison.
One odd thing to observe in the figures above is that the performance is falling off sharply at r=32. Providing a model, that uses residual connections, more capacity should yield the same or better performance than with a lower capacity. This is clearly not the case here. But as we tuned the learning-rate for r=8 and we now have many more learnable parameters with r=32 (see the upper right panel in preceding figure) we should also reduce the learning-rate, or ideally, re-tune the learning-rate and number of epochs to adapt to the much larger capacity. Looking at the lower right panel in the previous figure we should then also consider adding more regularization to deal with the more pronounced overfitting we see.
Despite the general potential for improvement when providing the model with more capacity, the other values of r we observed did not indicate that more capacity would improve performance without also markedly increasing the number of parameters. Therefore, we’ll skip chasing an even larger r.
More details in this notebook.
Final Comparison
Throughout this long article, we have gathered numerous analytical results. To consolidate these findings, let’s explore and compare several interesting combinations of hyperparameter values in one place. For our purposes, a result is considered interesting if it either improves the overall performance of the model or gives us additional insights about how the model works to ultimately strengthen our intuitive understanding
All experiments finetune the sst2 task on RoBERTa base as seen in the RoBERTa paper [1].
Execution of Experiments:
As before, when I show the results of a scenario (reported as the “target_tuner_name” column in the table above, and as labels on the y-axis in the graph), it’s based on executing the same combination of hyperparameter values five times. This allows me to report the mean and standard deviation of the objective metric.
Now, let’s discuss some observations from the scenarios depicted in the graph above.
Classifier Only
This baseline—where we only train the classifier head—has the lowest cost. Refer to parameters_relative, which indicates the percentage of parameters needed, compared to a full finetuning. This is illustrated in the second panel, showing that ~0.5% is the lowest parameter count of all scenarios.
This has a beneficial impact on the “GPU Memory” panel (where lower is better) and markedly in the “Train Speed” panel (where higher is better). The latter indicates that this scenario is the fastest to train, because of the lower parameter count, and also because there are fewer modules to handle, as we do not add additional modules in this scenario.
This serves as an informative bare-bones baseline to see relative improvements in training speed and GPU memory use, but also highlights a tradeoff: the model performance (first panel) is the lowest by a wide margin.
Additionally, this scenario reveals that 0.48% of the full fine-tuning parameters represent the minimum parameter count. We allocate that fraction of the parameters exclusively for the classifier. Additionally, as all other scenarios tune the classifier, we consistently include that 0.48% in addition to whatever parameters are further tuned in those scenarios.
LoRA Base
This scenario serves as the foundation for all experiments beyond the baselines. We user=8 and adapt and finetune all linear modules across all layers.
We can observe that the model performance matches the full finetuning performance. We might have been lucky in this case, but the literature suggest that we can expect to nearly match the full finetuning performance with just about 1% of the parameters. We can see evidence of this here.
Additionally, because of adapting all linear modules, we see that the train speed is the lowest of all experiments and the GPU memory utilization is amongst the highest, but in line with most of the other scenarios.
LoRA all, r={1,4,8}
Overall, these scenarios are variations of “LoRA Base” but with different values of r. There is only a small difference in the performance. However, as expected, there is a positive correlation between r and the parameter count and a slightly positive correlation between r and GPU memory utilization. Despite the latter, the value of r remains so low that this does not have a substantial impact on the bottom line, specifically the GPU memory usage. This confirms what we explored in the original experiments, component-wise, as discussed above.
When reviewing r=1, however, we see that this is a special case. With 0.61% for the relative parameter count, we are just a smidgen above the 0.48% of the “Classifier Only” scenario. But we see a validation accuracy of ~0.94 with r=1, compared to ~0.82 with “Classifier Only”. With just 0.13% of the total parameters, adapted solely in the transformer layers, we can elevate the model’s validation accuracy by ~0.12. Bam! This is impressive, and hence, if we are interested in a low parameter count, this could be our winner.
Regarding GPU memory utilization, we’ll review this a bit later. But briefly, besides allocating memory for each parameter in the model, the optimizer, and the gradients, we also need to keep the activations around to calculate the gradients during backpropagation.
Additionally, larger models will show a bigger impact of choosing a small value for r.
For what it’s worth, the scenario “LoRA all, r=8” used identical hyperparameter values to “LoRA Base”, but was executed independently. To make it easier to compare r=1, r=4 and r=8, this scenario was still evaluated.
LoRA ff_u
In this scenario we are tuning only the position-wise feed forward up projections, across all layers. This leads to a reduction in both the number of parameters and the number of modules to adapt. Consequently, the data shows an improvement in training speed and a reduction in GPU memory utilization.
But we also see a small performance hit. For “LoRA Base” we saw ~0.946, while in this scenario we only see ~0.942, a drop of ~0.04.
Details on the comparisons in this notebook.
Sidestep: GPU Memory / Gradient Checkpointing
When looking at the GPU memory panel above, two things become obvious:
One — LoRA, on its own, does not dramatically reduce the memory footprint
This is especially true when we adapt small models like RoBERTa base with its 125M parameters.
In the previous article’s section on intrinsic dimensionality, we learned that for current generation models (e.g., with 7B parameters), the absolute value of r can be even smaller than for smaller capacity models. Hence, the memory-saving effect will become more pronounced with larger models.
Additionally using LoRA makes using quantization easier and more efficient – a perfect match. With LoRA, only a small percentage of parameters need to be processed with high precision: This is because we update the parameters of the adapters, not the weights of the original modules. Hence, the majority of the model weights can be quantized and used at much lower precision.
Furthermore, we typically use AdamW as our optimizer. Unlike SGD, which tracks only a single global learning rate, AdamW tracks moving averages of both the gradients and the squares of the gradients for each parameter. This implies that for each trainable parameter, we need to keep track of two values, which could potentially be in FP32. This process can be quite costly. However, as described in the previous paragraph, when using LoRA, we only have a few parameters that are trainable. This can significantly reduce the cost, so that we can use the typically parameter-intensive AdamW, even with large r values.
We may look into these aspects in part four of our article series, given enough interest of you, dear reader.
Two–GPU memory utilization is only indirectly correlated with parameter count
Wouldn’t it be great if there was a direct linear relationship between the parameter count and the needed GPU memory? Unfortunately there are several findings in the diagrams above that illustrate that it is not that easy. Let’s find out why.
First we need to allocate memory for the model itself, i.e., storing all parameters. Then, for the trainable parameters, we also need to store the optimizer state and gradients (for each trainable parameter individually). In addition we need to consider memory for the activations, which not only depends on the parameters and layers of the model, but also on the input sequence length. Plus, it’s crucial to remember that we need to maintain those activations from the forward pass in order to apply the chain rule during the backward pass to do backpropagation.
If, during backpropagation, we were to re-calculate the activations for each layer when calculating the gradients for that layer, we would not maintain the activations for so lang and could save memory at the cost of increased computation.
This approach is known as gradient checkpointing. The amount of memory that can be saved depends on how much additional memory for activations needs to be retained. It’s important to remember that backpropagation involves repeatedly applying the chain rule, step by step, layer by layer:
Recap — Chain Rule during Back Propagation
During backpropagation, we calculate the error at the top of the network (in the classifier) and then propagate the error back to all trainable parameters that were involved. These parameters are adjusted based on their contributions to the error, to do better in the future. We calculate the parameters’ contributions by repeatedly applying the chain rule, start at the top and traversing the computation graph towards the inputs. This is necessary because any change in a parameter on a lower layer can potentially impact the parameters in all the layers above.
To calculate the local gradients (for each step), we may need the values of the activations for all the steps between the respective trainable parameter and the top (the loss function which is applied at the classification head). Thus, if we have a parameter in one of the top layers (close to the head), we need to maintain fewer activations compared to when training a parameter in the lower layers. For those lower layer parameters, we need to traverse a much longer graph to reach the classification head and, hence, need to maintain more memory to keep the activations around.
In our specific model and task, you can see the effect illustrated below. We train an individual model for each layer, in which only that particular layer undergoes training. This way, we can isolate the effect of the layer’s relative position. We then plot the amount of GPU memory required for each model, and therefore for each layer, during training.
In the graph below (see left panel) you can see that if we are closer to the bottom of the model (i.e., low layer number) the GPU memory requirement is lower than if we are close to the top of the model (i.e., high layer number) where the loss originates.
With gradient checkpointing enabled (see right panel), we no longer can recognize this effect. Instead of saving the activations until backprop we re-calculate them when needed. Hence, the difference in memory usage between the left and right panel are the activations that we maintain for the backward pass.
Execution of Experiments:
As with previous experiments, I used AMT with Grid Search to provide unbiased results.
It is important to remember, that recalculating the activations during backpropagation is slow, so we are trading of computational speed with memory usage.
More details on the testing can be found in this notebook.
As an aside, to the best of my understanding, using Gradient Checkpointing should only have non-functional impact. Unfortunately, this is not what I am seeing though (issue). I may be misunderstanding how to use Hugging Face’s Transformers library. If anyone has an idea why this may be the case, please let me know.
Consequently, take the graphs from above with a bit of caution.
We may revisit the topic of memory in part four of this article series, although it’s not strictly a LoRA topic. If you’re interested, please let me know in the comments below.
Conclusion
That was quite a lot to absorb. Thank you for sticking with me this far. I hope you found it worthwhile and were able, at a high level, to confirm for yourself that LoRA works: It matches the performance of a full finetuning while only using ~1% of the parameters of a full finetuning.
But now, let’s dive into the details: What specific design decisions should we consider when exploring the hyperparameter values that we want to use with our model and our task when applying LoRA?
Our approach
We formulated several hypotheses about how our model is likely to behave and then collected empirical feedback to validate or invalidate these hypotheses. We chose this approach because we wanted to use our prior knowledge to guide the scope our experiments, rather than haphazardly testing random configurations.
This approach proved beneficial, given that the solution space was extensive and impossible to explore exhaustively. Even with the experiments scoped using our prior knowledge, interpreting the results was challenging. Had we just had randomly sampled in this vast space, it would have likely led to wasted computation and unstructured results. Such an approach would have prevented us from drawing generalizable conclusions to make intentional decisions for our model, which would have been frustrating.
We learned several things, like the relative impact of r, the nuances in its effect on parameter count, GPU memory and training speed. We also observed that the count of trainable parameters alone is not a predictor for GPU memory usage. Interestingly, the location of these parameters in the network architecture plays a crucial role. Moreover, we found that when using the same number of parameters, the training speed is slower with multiple LoRA modules compared to using just a single module.
Adapt all linear modules — A practical choice
Understanding more about how LoRA works was just one of two goals. We were also aiming for a good set of hyperparameter values for our training. Regarding this, we discovered that adapting all linear modules with a low value of r is an effective strategy. This approach is attractive as it results in good performance, moderate costs, and very low complexity; making it a practical choice.
Of course, attention should still be paid to learning-rate and batch-size, as with any other training of a neural network.
We are all examining different aspects of the topic, but considering the overlap at the core, the above guidance aligns closely with Sebastian Raschka’s findings from this and that excellent article on the topic, as well as Tim Dettmers’s findings from the QLoRA paper [3]. These are valuable resources for learning about more facets of using LoRA.
- Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments – Lightning AI
- Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)
Carefully select a subset of modules–Better performance at lower cost
On the other hand, if you do want to invest more time, you could achieve slightly better performance, as well as lower training time and memory usage. When it comes to selecting the modules to adapt, we found that it’s possible to match the performance of adapting all modules by actually adapting fewer modules.
Moreover, we discovered that spreading the LoRA modules evenly across all layers is apparently a good choice for model performance.
For our specific example we got the best performance and a relatively low cost from tuning the feed-forward up projections and the attention value projections across all layers:
However, for a different task, I may want to re-evaluate this finding.
Also, when analyzing a future task I will be on the lookout if just adapting the upper layers results in good performance? This did not work out for our task in this article, but we saw earlier, that it would reduce GPU memory utilization significantly otherwise.
One thing to remember is that training neural networks is inherently a noisy process, and investing time into gaining more and more certainty about the best hyperparameters can compete with efforts to improve other potential areas. Maybe this extra time would be better invested into data curation or enhancing the overall feedback loop. I hope that this article has demonstrated a common-sense approach that strikes a balance between the cost of exploration and the potential reward.
Please also keep in mind not to overfit on the specific model and findings we discussed here. This is merely a toy example, not a use cases requested by a business department. Nobody needs to train the sst-2 task on RoBERTa.
However, please do share your experience with your models; including where you felt led astray by this article.
One last thought to conclude the topic. Moving forward I would always start with a low value of r in general. Then consider how big the differences between the pre-training task and the finetuning task(s) are. The bigger the necessary adaptations during finetuning are, the larger r should be.
Furthermore, if I can identify where the adaptations need to occur— specifically, which layers or components would be most impacted — I would use that knowledge to select the right modules to adapt and their relative r.
Now that we have our tuned model, let’s move on to deploying it. In the following article, we will explore how using adapters naturally leads to the ability of creating multi-task endpoints with vastly improved non-functional properties over creating one dedicated endpoint for each task.
Thanks to Valerio Perrone, Ümit Yoldas, Andreas Gleixner, André Liebscher, Karsten Schroer and Vladimir Palkhouski for providing invaluable feedback during the writing of this article.
The header image was created using Clipdrop. All other images are by the author.
[3] Noam Shazeer, Fast Transformer Decoding: One Write-Head is All You Need, 2019
A Winding Road to Parameter Efficiency was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
A Winding Road to Parameter Efficiency
Go Here to Read this Fast! A Winding Road to Parameter Efficiency