An MLLM fine-tuning tutorial using the newest pocket-sized Mini-InternVL model
The world of large language models (LLMs) is constantly evolving, with new advancements emerging rapidly. One exciting area is the development of multi-modal LLMs (MLLMs), capable of understanding and interacting with both texts and images. This opens up a world of possibilities for tasks like document understanding, visual question answering, and more.
I recently wrote a general post about one such model that you can check out here:
6 Real-World Uses of Microsoft’s Newest Phi-3 Vision-Language Model
But in this one, we’ll explore a powerful combination: the InternVL model and the QLoRA fine-tuning technique. We will focus on how we can easily customize such models for any specific use-case. We’ll use these tools to create a receipt understanding pipeline that extracts key information like company name, address, and total amount of purchase with high accuracy.
Understanding the Task and Dataset
This project aims to develop a system that can accurately extract specific information from scanned receipts, using InternVL’s capabilities. The task presents a unique challenge, requiring not only robust natural language processing (NLP) but also the ability to interpret the visual layout of the input image. This will enable us to create a single, OCR-free, end-to-end pipeline that demonstrates strong generalization across complex documents.
To train and evaluate our model, we’ll use the SROIE dataset. SROIE provides 1000 scanned receipt images, each annotated with key entities like:
- Company: The name of the store or business.
- Date: The purchase date.
- Address: The store’s address.
- Total: The total amount paid.
We will evaluate the performance of our model using a fuzzy similarity score, a metric that measures the similarity between predicted and ground truth entities. This metric ranges from 0 (irrelevant results) to a 100 (perfect predictions).
InternVL: A Multi-modal Powerhouse
InternVL is a family of multi-modal LLMs from the OpenGVLab, designed to excel in tasks involving image and text. Its architecture combines a vision model (like InternViT) with a language model (like InternLM2 or Phi-3). We’ll focus on the Mini-InternVL-Chat-2B-V1–5 variant, a smaller version that is well-suited for running on consumer-grade GPUs.
InternVL’s key strengths:
- Efficiency: Its compact size allows for efficient training and inference.
- Accuracy: Despite being smaller, it achieves competitive performance in various benchmarks.
- Multi-modal Capabilities: It seamlessly combines image and text understanding.
Demo: You can explore a live demo of InternVL here.
Finetuning with QLoRA: A Memory-Efficient Approach
To further boost our model’s performance, we’ll use QLoRA which is a fine-tuning technique that significantly reduces memory consumption while preserving performance. Here’s how it works:
- Quantization: The pre-trained LLM is quantized to 4-bit precision, reducing its memory footprint.
- Low-Rank Adapters (LoRA): Instead of modifying all parameters of the pre-trained model, LoRA adds small, trainable adapters to the network. These adapters capture task-specific information without requiring changes to the main model.
- Efficient Training: The combination of quantization and LoRA enables efficient fine-tuning even on GPUs with limited memory.
Code Walk-through: Baseline Performance
Let’s dive into the code. First, we’ll assess the baseline performance of Mini-InternVL-Chat-2B-V1–5 without any fine-tuning:
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = InternVLChatModel.from_pretrained(
args.path,
device_map={"": 0},
quantization_config=quant_config if args.quant else None,
torch_dtype=torch.bfloat16,
)
tokenizer = InternLM2Tokenizer.from_pretrained(args.path)
# set the max number of tiles in `max_num`
model.eval()
pixel_values = (
load_image(image_base_path / "X51005255805.jpg", max_num=6)
.to(torch.bfloat16)
.cuda()
)
generation_config = dict(
num_beams=1,
max_new_tokens=512,
do_sample=False,
)
# single-round single-image conversation
question = (
"Extract the company, date, address and total in json format."
"Respond with a valid JSON only."
)
# print(model)
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(response)
The result:
```json
{
"company": "SAM SAM TRADING CO",
"date": "Fri, 29-12-2017",
"address": "67, JLN MENHAW 25/63 TNN SRI HUDA, 40400 SHAH ALAM",
"total": "RM 14.10"
}
```
This code:
- Loads the model from the Hugging Face hub.
- Loads a sample receipt image and converts it to a tensor.
- Formulates a question asking the model to extract relevant information from the image.
- Runs the model and outputs the extracted information in JSON format.
This zero-shot evaluation shows impressive results, achieving an average fuzzy similarity score of 74.24%. This demonstrates InternVL’s ability to understand receipts and extract information with no fine-tuning.
Fine-tuning: Enhancing Performance with QLoRA
To further boost accuracy, we’ll fine-tune the model using QLoRA. Here’s how we implement it:
_data = load_data(args.data_path, fold="train")
# Quantization Config
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = InternVLChatModel.from_pretrained(
path,
device_map={"": 0},
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)
tokenizer = InternLM2Tokenizer.from_pretrained(path)
# set the max number of tiles in `max_num`
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
print("img_context_token_id", img_context_token_id)
model.img_context_token_id = img_context_token_id
model.config.llm_config.use_cache = False
model = wrap_lora(model, r=128, lora_alpha=256)
training_data = SFTDataset(
data=_data, template=model.config.template, tokenizer=tokenizer
)
collator = CustomDataCollator(pad_token=tokenizer.pad_token_id, ignore_index=-100)
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
print("img_context_token_id", img_context_token_id)
model.img_context_token_id = img_context_token_id
print("model.img_context_token_id", model.img_context_token_id)
train_params = TrainingArguments(
output_dir=str(BASE_PATH / "results_modified"),
num_train_epochs=EPOCHS,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
optim="paged_adamw_32bit",
save_steps=len(training_data) // 10,
logging_steps=len(training_data) // 50,
learning_rate=5e-4,
lr_scheduler_type="cosine",
warmup_steps=100,
weight_decay=0.001,
max_steps=-1,
group_by_length=False,
max_grad_norm=1.0,
)
# Trainer
fine_tuning = SFTTrainer(
model=model,
train_dataset=training_data,
dataset_text_field="###",
tokenizer=tokenizer,
args=train_params,
data_collator=collator,
max_seq_length=tokenizer.model_max_length,
)
print(fine_tuning.model.print_trainable_parameters())
# Training
fine_tuning.train()
# Save Model
fine_tuning.model.save_pretrained(refined_model)
This code:
- Loads the model with quantization enabled.
- Wraps the model with LoRA, adding trainable adapters.
- Creates a dataset from the SROIE dataset.
- Defines training arguments such as learning rate, batch size, and epochs.
- Initializes a trainer to handle the training process.
- Trains the model on the SROIE dataset.
- Saves the fine-tuned model.
Here is a sample comparison between the base model and the QLoRA fine-tuned model:
Ground Truth:
{
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2018",
"address": "NO 4,JALAN PERJIRANAN 10, TAMAN AIR BIRU, 81700 PASIR GUDANG, JOHOR.",
"total": "72.00"
}
Prediction Base: KO
```json
{
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2016",
"address": "JM092487-D",
"total": "67.92"
}
```
Prediction QLoRA: OK
{
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2018",
"address": "NO 4, JALAN PERUBANAN 10, TAMAN AIR BIRU, 81700 PASIR GUDANG, JOHOR",
"total": "72.00"
}
Results and Conclusion
After fine-tuning with QLoRA, our model achieves a remarkable 95.4% fuzzy similarity score, a significant improvement over the baseline performance (74.24%). This demonstrates the power of QLoRA in boosting model accuracy without requiring massive computing resources (15 min training on 600 samples on a RTX 3080 GPU).
We’ve successfully built a robust receipt understanding pipeline using InternVL and QLoRA. This approach showcases the potential of multi-modal LLMs for real-world tasks like document analysis and information extraction. In this example use-case, we gained 30 points in prediction quality using a few hundred examples and a few minutes of compute time on a consumer GPU.
You can find the full code implementation for this project here.
The development of multi-modal LLMs is only just beginning, and the future holds exciting possibilities. The area of automated document processing has immense potential in the era of MLLMs. These models can revolutionize how we extract information from contracts, invoices, and other documents, requiring minimal training data. By integrating text and vision, they can analyze the layout of complex documents with unprecedented accuracy, paving the way for more efficient and intelligent information management.
The future of AI is multi-modal, and InternVL and QLoRA are powerful tools to help us unlock its potential on a small compute budget.
Links:
Code: https://github.com/CVxTz/doc-llm
Dataset Source: https://rrc.cvc.uab.es/?ch=13&com=introduction
Dataset License: licensed under a Creative Commons Attribution 4.0 International License.
A Simple Recipe to Boost the Performance of MLLMs on Your Custom Use Case was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
A Simple Recipe to Boost the Performance of MLLMs on Your Custom Use Case
Go Here to Read this Fast! A Simple Recipe to Boost the Performance of MLLMs on Your Custom Use Case