LLM Fine-Tuning — FAQs
Answering the most common questions I received as an AI consultant
Last year, I posted an article on fine-tuning large language models (LLMs). To my surprise, this turned out to be one of my most-read blogs ever, and it led to dozens of conversations with clients about their fine-tuning questions and AI projects. Here, I will summarize these conversations’ most frequently asked questions and my responses.
What is Fine-tuning?
I like to define fine-tuning as taking an existing (pre-trained) model and training at least 1 model parameter to adapt it to a particular use case.
It’s important to note the “training at least 1 model parameter” part of the definition. Some will define fine-tuning without this nuance (including me at times). However, this distinguishes fine-tuning from approaches like prompt engineering or prefix-tuning, which adapt a model’s behavior without modifying its internal operations.
Fine-Tuning Large Language Models (LLMs)
When NOT to Fine-tune
People tend to think that LLM engineering techniques like prompt engineering, RAG, and fine-tuning all live on a spectrum, where fine-tuning is an objectively more powerful version of these other approaches. However, this is misleading.
The effectiveness of any approach will depend on the details of the use case. For example, fine-tuning is less effective than retrieval augmented generation (RAG) to provide LLMs with specialized knowledge [1].
This makes sense when we consider training data volume. For instance, Llama 3.1 8B was trained on ~15T tokens (~1M novels) [2]. A robust fine-tuning dataset might be 1M tokens (1,000 examples consisting of ~1000 tokens each), which is about 10 million times smaller! Thus, any knowledge from a fine-tuning dataset is negligible compared to what was learned at pre-training.
When do I Fine-tune?
This is not to say that fine-tuning is useless. A central benefit of fine-tuning an AI assistant is lowering inference costs [3].
The standard way of adapting LLM outputs for a particular application is prompt engineering. In this method, users craft prompts that elicit helpful responses from the model. While this provides a simple and flexible way to adjust model responses, effective prompts may require detailed descriptions of the desired task and several examples the model can mimic, like the one shown below.
--Example Prompt (before fine-tuning)--
You are an expert sentiment analysis tool. Given a text, classify its
sentiment as "Positive," "Negative," or "Neutral." Here are some examples:
Examples
Text: "I absolutely love the new design of this product! It’s user-friendly
and looks amazing." Sentiment: Positive
Text: "The service was terrible. I had to wait for over an hour, and the staff
was rude." Sentiment: Negative
Text: "The movie was okay. It wasn't particularly exciting, but it wasn’t bad
either." Sentiment: Neutral
Text: "I'm so happy with the customer support I received. They resolved my
issue quickly." Sentiment: Positive
Text: "The meal was bland, and I wouldn’t recommend it to anyone." Sentiment:
Negative
Text: "The weather is unpredictable today." Sentiment: Neutral
Now analyze the following text:
Text: "[Your input text here]" Sentiment:
Fine-tuning, on the other hand, can compress prompt sizes by directly training the model on examples. Shorter prompts mean fewer tokens at inference, leading to lower compute costs and faster model responses [3]. For instance, after fine-tuning, the above prompt could be compressed to the following.
--Example Prompt (after fine-tuning)--
You are an expert sentiment analysis tool. Given a text, classify its
sentiment as "Positive," "Negative," or "Neutral."
Analyze the following text:
Text: "[Your input text here]" Sentiment:
RAG vs Fine-tuning?
We’ve already mentioned situations where RAG and fine-tuning perform well. However, since this is such a common question, it’s worth reemphasizing when each approach works best.
RAG is when we inject relevant context into an LLM’s input prompt so that it can generate more helpful responses. For example, if we have a domain-specific knowledge base (e.g., internal company documents and emails), we might identify the items most relevant to the user’s query so that an LLM can synthesize information in an accurate and digestible way.
Here’s high-level guidance on when to use each.
- RAG: necessary knowledge for the task is not commonly known or available on the web but can be stored in a database
- Fine-tuning: necessary knowledge for the task is already baked into the model, but you want to reduce the prompt size or refine response quality
- RAG + Fine-tuning: the task requires specialized knowledge, and we would like to reduce the prompt size or refine the response quality
Notice that these approaches are not mutually exclusive. In fact, the original RAG system proposed by Facebook researchers used fine-tuning to better use retrieved information for generating responses [4].
How Do I Pick a Model?
Most state-of-the-art language models are trained in similar ways on similar datasets. Thus, performance differences of comparable models are often negligible across use cases.
However, performance is only one of many considerations important for an AI project. Others include privacy, technical requirements, and cost.
Open-weights vs closed-weights?
An open-weight model’s parameters are accessible to the public, while the parameters of a closed-weight model are not.
- Open-weight models: Llama (Meta), Mistral, Gemma (Google), Phi (Microsoft)
- Closed-weight models: GPT 3+ (OpenAI), Claude (Anthropic), Gemini (Google)
The two main considerations when deciding open-weight vs closed-weight are privacy and flexibility. For instance, some projects may have strict controls on where user data can be sent, which may disqualify models accessible only via API.
On the other hand, some use cases require greater flexibility than those afforded by closed-weight model APIs, such as retraining specific parameters in a model or accessing intermediate model representations.
How big?
Another key question is how big a model should be used. This comes down to a trade-off between model performance and inference costs.
The approach I’d recommend is to start big to confirm the desired performance is obtainable, then gradually explore smaller and smaller models until performance drops below what’s required. From here, an evaluation can be made on which size choice provides the greatest ROI based on the use case.
Instruction-tuned or foundation model?
Today’s most popular large language models undergo instruction tuning. In this process, an initial foundation model is fine-tuned to respond to user queries.
Most business cases I’ve encountered involved an AI chatbot or assistant. For these types of applications, instruction-tuned models are the best choice.
However, there are situations where directly using foundation models may work better. Namely, if the LLM is not used for question-answering or conversational tasks.
How to Prepare Data for Fine-tuning?
While much of the attention goes to model leaderboards and fine-tuning approaches, these are secondary to the quality of the training dataset. In other words, the data you use to fine-tune your model will be the key driver of its performance.
From my experience, most clients are interested in making a “custom chatbot” or “ChatGPT for X.” In these cases, the best fine-tuning approach is so-called supervised fine-tuning. This consists of generating a set of example query-response pairs from which a model can be fine-tuned.
For example, if I wanted to fine-tune an LLM to respond to viewer questions on YouTube, I would need to gather a set of comments with questions and my associated responses. For a concrete example of this, check out the code walk-through on YouTube.
Advanced Fine-tuning
Much of this article has focused on fine-tuning large language models to create AI assistants. Despite the popularity of such use cases, this is a relatively narrow application of model fine-tuning.
Classification
Another way we can fine-tune language models is for classification tasks, such as classifying support ticket tiers, detecting spam emails, or determining the sentiment of a customer review. A classic fine-tuning approach for this is called transfer learning, where we replace the head of a language model to perform a new classification task.
Text Embeddings
Fine-tuning can also be applied to text embeddings, which are most commonly used today to form vector databases in RAG systems. A major shortcoming of out-the-shelf text embedding models is that they may not perform well on domain-specific language or jargon. Fine-tuning a text embedding model for a particular domain can help overcome this.
Compression
Model compression aims to reduce the size of an LLM without sacrificing performance. While this does not necessarily fit under the fine-tuning definition above, it is in the same spirit of adapting a model for a particular use case. Three popular compression techniques include quantization, pruning, and knowledge distillation.
Compressing Large Language Models (LLMs)
What’s Next?
Here, I summarized the most common fine-tuning questions I’ve received over the past 12 months. While fine-tuning is not a panacea for all LLM use cases, it has key benefits.
If you have questions that were not covered in the article, drop a comment, and I’ll share my thoughts 🙂
My website: https://www.shawhintalebi.com/
[1] https://arxiv.org/abs/2312.05934
[2] Llama 3 paper
[3] OpenAI API Doc
LLM Fine-tuning — FAQs was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
LLM Fine-tuning — FAQs