Empowering Phi-3.5-vision with Wikipedia knowledge for augmented Visual Question Answering.
Introduction
Retrieval Augmented Generation (RAG) is a powerful technique that can improve the accuracy and reliability of the answer generated by Large Language Models (LLMs). It also offers the possibility of checking the sources used by the model during a particular generation, allowing easier fact-checking by human users. Furthermore, RAG makes it possible to keep the model knowledge up-to-date and incorporate topic-specific information without the need for fine-tuning. Overall, RAG provides many benefits and few drawbacks, and its workflow is straightforward to implement. Because of this, it has become the go-to solution for many LLM use cases that require up-to-date and/or specialized knowledge.
Some of the latest developments in the Generative AI field have focused on extending the popular transformer architecture to tackle multiple input and/or output modalities, trying to replicate the huge success of LLMs. There are already several models, both open and closed source, that have demonstrated a remarkable ability to handle multiple modalities. A popular multimodal setting, and one of the first to be tackled, is that of Vision Language Models (VLMs), which has seen interesting open-source contributions with the release of small yet powerful models like LLaVA, Idefics, and Phi-vision. If you want to get started with VLMs and learn more about building a Vision Language Chat Assistant using LLaVA, you can look at my previous post Create your Vision Chat Assistant with LLaVA.
Designing RAG systems for multimodal models is more challenging than in the text-only case. In fact, the design of RAG systems for LLM is well-established and there is some consensus about the general workflow, as many of the recent developments focus on improving accuracy, reliability, and scalability rather than fundamentally changing the RAG architecture. On the other hand, multimodality opens up multiple ways of retrieving relevant information and, consequentially, there are several different architectural choices that can be made, each with its own advantages and drawbacks. For example, it is possible to use a multimodal embedding model to create a shared vector space for the different modalities or, instead, choose to ground the information in one modality only.
In this blog post, I will discuss a simple framework to extend RAG to Vision Language Models (VLMs), focusing on the Visual Question Answering task. The core idea of the method is to exploit the capabilities of the VLM to understand both text and images to generate a suitable search query that will be used to retrieve external information before answering the user’s prompt.
I will also provide a practical tutorial on how to implement the framework to empower Phi-3.5-vision with access to Wikipedia information, discussing the main points of the implementation and showing some examples. I will leave the details to the full code I shared in the following Git Hub repo.
RAG for Visual Question Answering
In this section, I will describe the general workflow of the framework mentioned in the introduction. For the sake of exposition, I will discuss the case where there is only one user’s prompt about one image. This is the case, for example, for simple Visual Question Answering (VQA) tasks. The method can be generalized straightforwardly to multiple prompts and images, but the pipeline will become more complex and introduce further complications. Furthermore, I will only consider the case in which the external data consists solely of textual documents. Using a multimodal embedding model for retrieval, or more generally a multimodal search engine, it is possible to include images in the external data as well.
As for the usual RAG workflow, the framework workflow can be divided into two parts: retrieval of the relevant external information and generation conditioned on the provided external data.
During the retrieval phase, the goal is to retrieve some passages from the external text documents that can provide useful information to answer the user’s prompt. In order to do so effectively, we must ensure that the retrieved passages are relevant to the provided image, the prompt, and, more importantly, the relationship between the two. In fact, even if the retrieved documents contain information about the image, they may not include the specific information needed to provide an answer to the user’s prompt. On the other hand, the prompt may only be correctly understood when paired with the image it refers to. To address these challenges, the framework discussed in this post exploits the multimodal model to generate an appropriate search query, tailored to capture the information needed to answer the user’s prompt in the context of the provided image. A search engine will use the produced query to retrieve the relevant information from the external data.
In more detail, the multimodal model receives as input both the user’s prompt and the image and it is tasked with creating a search query that is relevant to both of them as a whole. This process can be seen as a special case of a query transformation, designed to consider the multimodal nature of the problem. In fact, the model translates the user’s prompt into a search query while also considering the image it refers to.
The advantage of this approach over other methods that treat each input modality separately, such as using a multimodal embedding model for retrieval or using a generated image caption/description for semantic similarity, is that it can capture the relationships between the prompt and the image more effectively.
The flowchart for the retrieval phase is sketched below.
The generation phase is very similar to the standard text-only RAG workflow, the only difference being that the model receives the image in its context in addition to the prompt and the retrieved passages. This process is illustrated below.
Empowering Phi-3.5-vision with Wikipedia
In this section, I will provide a practical guide on how to apply the discussed framework to enhance a multimodal model by giving it access to Wikipedia. I chose the model Phi-3.5-vision, as it is a very powerful yet lightweight open-source Vision Language Model.
In this section, I will discuss only the general aspects of the implementation, leaving the details to the code provided in the GitHub repo.
Retrieval
The goal of the retrieval phase is to gather some passages from Wikipedia that can provide useful information to answer a user’s question about an image. In the code implementation, I used the Python package wikipedia to search and retrieve content from Wikipedia.
Here are the steps implemented to retrieve the relevant passages:
- Use the multimodal model to generate keywords capturing the meaning of the question about the image.
- Use the generated keywords to search relevant pages on Wikipedia.
- Split the content of each retrieved page into chunks.
- Select the top chunks by semantic textual similarity to the question and the keywords.
The first step exploits Phi-3.5-vision to generate an appropriate search query that will be used to retrieve relevant Wikipedia pages. In order to do so, I tasked Phi-3.5-vision to produce keywords relevant to the user’s question and the image. I then used the built-in search function of the wikipedia package to retrieve some pages relevant to the generated keywords.
The general single-turn single-image chat template for Phi-vision-3.5 has the following structure:
<|user|>n
<|image_1|>n
{prompt}<|end|>n
<|assistant|>n
To generate the keywords I used the following prompt:
Your task is to write a few search keywords to find Wikipedia pages containing
the relevant information to answer the question about the provided image. The
keywords must be as specific as possible and must represent the information
that is needed to answer the question in relation to the provided image. Don't
write more than 3 search keywords.
Question: {question}
The tag {question} is substituted with the user question before inference.
After the keywords have been generated, the built-in search function of the wikipedia package is used to retrieve some pages relevant to the generated keywords. Finally, the selected pages are split into passages, and then the most relevant passages are selected using an embedding model and the LangChain implementation of the FAISS vector store. I used the embedding model snowflake-arctic-embed-l to embed the concatenation of the question and the keywords, and the chunks of the retrieved pages. In practice, the retrieval phase is effectively a form of “hybrid search” consisting of two sequential steps: keyword search using the built-in search function of the wikipedia package, and embedding similarity retrieval using an embedding model. In this way, the retrieval operates on the smaller space of the passages of the most relevant pages selected using keyword search, avoiding the need to build an enormous vector store with the embeddings of all the content of Wikipedia. In different settings, the retrieval phase could be remodeled to use similarity retrieval on the whole external corpus or using different combinations of retrieval methods.
Retrieving passages from multiple pages can help reduce the chance of selecting the wrong page and it can also be useful when information from multiple pages is needed to produce an answer.
Generation
In the generation phase, the user’s question, the retrieved passages, and the original images are used as inputs for Phi-3.5-vision to generate an answer.
I used the following prompt in the general chat template for Phi-3.5-vision:
You are a helpful assistant tasked with answering questions about the provided
image.
Answer the following question: {question}
You can use the following passages retrieved from Wikipedia to provide your
answer:
{passages}
At generation time, the tag {question} is substituted by the user question as before, while the tag {passages} is substituted by the retrieved passages and the names of the corresponding pages with the following format
From Wikipedia page {page_name} : "{passage1}"nn
From Wikipedia page {page_name} : "{passage2}"nn
From Wikipedia page {page_name} : "{passage3}"nn
...
Providing the name of the page from which the passage is extracted can help resolve ambiguities when the content of the latter is not enough to uniquely determine the subject or topic it refers to.
Examples
In this section, I will show some examples of answers obtained with the implementation discussed in the previous section, comparing the outputs of the Vision Language Model empowered with RAG with the base version.
For each example below, I will show the image provided to the model, a block with the question and the answers of both the RAG augmented and base VLM, a block with the search query created by the model, and a block with the passages retrieved from Wikipedia.
Example 1
Question: How tall are the plants that produce this fruit?
Base VLM: Tomatoes are typically grown on plants that can reach heights of 2 to
4 feet, depending on the variety and growing conditions.
RAG VLM: The tomato plants that produce this fruit typically grow to 1-3 meters
(3-10 feet) in height.
The VLM generated the following search keywords:
Search keywords: tomato plant, tomato plant height
And retrieved the following passages:
From page Tomato (https://en.wikipedia.org/wiki/Tomato):
"throughout all seasons of the year. Tomato plants typically grow to 1–3
meters (3–10 ft) in height. They are vines that have a weak stem that sprawls
and typically needs support. Indeterminate tomato plants are perennials in
their native habitat, but are cultivated as annuals. (Determinate, or".
From page Tomato (https://en.wikipedia.org/wiki/Tomato):
"Numerous varieties of the tomato plant are widely grown in temperate climates
across the world, with greenhouses allowing for the production of tomatoes
throughout all seasons of the year. Tomato plants typically grow to 1–3 meters
(3–10 ft) in height. They are vines that have a weak stem that".
From page Early Girl (https://en.wikipedia.org/wiki/Early_Girl):
"== See also ==
List of tomato cultivars
== References ==".
From page Black Krim (https://en.wikipedia.org/wiki/Black_Krim):
"== References ==
== See also ==
List of tomato cultivars
Heirloom tomato".
From page Celebrity tomato (https://en.wikipedia.org/wiki/Celebrity_tomato):
"== See also ==
List of tomato cultivars
Tomato production in Florida
== References ==".
From page Fourth of July tomato
(https://en.wikipedia.org/wiki/Fourth_of_July_tomato):
"== See also ==
List of tomato cultivars
== References ==
== External links ==
Picture Gallery from Google Images".
From page Mortgage Lifter (https://en.wikipedia.org/wiki/Mortgage_Lifter):
"== See also ==
List of tomato cultivars
== References ==
== External links ==
Picture Gallery from Google Images".
The model augmented with RAG correctly reported the height range stated on the Wikipedia page ‘Tomato’, while the base model answered with a lower range.
Example 2
Query: How much does this species weight?
Base VLM: Polar bears can weigh between 900 to 1,600 pounds
(408 to 727 kilograms).
RAG VLM: The weight of this species, the polar bear, ranges from 300-800 kg
(660-1,760 lb) for males and 150-300 kg (330-660 lb) for females.
Search keywords: polar bear, weight
From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear):
"Males are generally 200–250 cm (6.6–8.2 ft) long with a weight of 300–800 kg
(660–1,760 lb). Females are smaller at 180–200 cm (5.9–6.6 ft) with a weight
of 150–300 kg (330–660 lb). Sexual dimorphism in the species is particularly
high compared with most other mammals. Male polar bears also have".
From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear):
"== Notes ==
== References ==
== Bibliography ==
== External links ==
Polar Bears International website
ARKive—images and movies of the polar bear (Ursus maritimus)".
From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear):
"weight of 150–300 kg (330–660 lb). Sexual dimorphism in the species is
particularly high compared with most other mammals. Male polar bears also have
proportionally larger heads than females. The weight of polar bears fluctuates
during the year, as they can bulk up on fat and increase their mass by".
From page List of ursids (https://en.wikipedia.org/wiki/List_of_ursids):
"long, plus a 3–20 cm (1–8 in) tail, though the polar bear is 2.2–2.44 m
(7–8 ft) long, and some subspecies of brown bear can be up to 2.8 m (9 ft).
Weights range greatly from the sun bear, which can be as low as 35 kg (77 lb),
to the polar bear, which can be as high as 726 kg (1,600 lb). Population".
From page Knut (polar bear) (https://en.wikipedia.org/wiki/Knut_(polar_bear)):
"== See also ==
Binky (polar bear)
List of individual bears
== References ==
== External links ==".
From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear):
"The polar bear is the largest living species of bear and land carnivore,
though some brown bear subspecies like the Kodiak bear can rival it in size.
Males are generally 200–250 cm (6.6–8.2 ft) long with a weight of 300–800 kg
(660–1,760 lb). Females are smaller at 180–200 cm (5.9–6.6 ft) with a".
From page Subspecies of brown bear
(https://en.wikipedia.org/wiki/Subspecies_of_brown_bear):
"males, potentially tripling their average weight within three years' time,
and can expect to average between 360 and 545 kg (794 and 1,202 lb).
The reported mean adult body masses for both sexes of the polar bear are very
similar to the peninsular giant and Kodiak bears. Due to their roughly".
Again, the RAG model answered correctly with the weights reported in the Wikipedia page ‘Polar bear’, while the base model answered with a slightly different range of weights.
Example 3
In my previous blog post Create your Vision Chat Assistant with LLaVA, I showed how the LLaVA-1.5 model can be induced to hallucinate when the user’s prompt asks a misleading question about the image. Let’s see what happens in that example using the Phi-3.5-vision model with and without the RAG enhancement. In particular, the model could now be more prone to hallucination if any misleading passages were retrieved.
Question: Tell me the species of the fish in this image
Base VLM: The image does not contain any fish or aquatic environments, so
I cannot provide a species of fish.
RAG VLM: The image does not contain any fish or aquatic environments. It is
a photograph of a desert landscape with sand dunes.
Therefore, I cannot provide the species of fish based on this image.
Neither the base nor the RAG model were induced to hallucinate by the misleading prompt, indicating a possible better robustness of Phi-3.5-vision compared to LLaVA-1.5. In particular, the RAG VLM did not retrieve any page from Wikipedia. In fact, when asked to generate keywords, the model answered
There are no fish visible in the image provided. The image shows a desert
landscape with sand dunes. Therefore, the search keywords related to the image
content would be more appropriate for a desert or dune-related topic.
This search query did not produce any results with the wikipedia package search function.
Note: in the latter example, in earlier experiments conducted with Idefics 2, both the base VLM and RAG VLM models were affected by hallucinations, answering with the names of fish species not present in the image. In particular, the RAG model retrieved misleading passages related to various fishes. The phenomenon of hallucinations is model-dependent and the RAG framework may not be enough to eliminate it in all instances.
Limitations
The main limitation of the framework is that it requires the model to have at least a general knowledge of the image and the question provided (while the exact details will be retrieved with RAG) in order to produce a useful search query. In some domain-specific settings, it may be necessary to fine-tune the model for search query generation.
The framework discussed in this post also suffers from the limitations common to traditional text-only RAG systems. A common source of errors in the generation is the incorrect retrieval of external information. In some cases, the model is not able to correctly identify the failure of the retrieval phase and it produces an answer based on incorrect, irrelevant, or misleading information. In particular, the presence of irrelevant passages can sometimes induce hallucinations that the model wouldn’t have suffered otherwise.
Another common source of errors is ignoring the retrieved passages or quoting them incorrectly. This can sometimes lead to subtle differences between the retrieved passages and the generated answer that may be difficult to spot at first glance.
Example
In this example, I will show an instance in which the model incorrectly quoted the retrieved passages.
Query: What's the average weight of this animal?
Base VLM: The average male lion can weigh between 160 to 310 kilograms, while
females typically weigh between 120 to 180 kilograms.
RAG VLM: Based on the information provided in the passages, the average weight
of a male lion is 187.5 kg (386 lb) in Southern Africa and 174.9 kg (386 lb) in
East Africa.
Search keywords: lion, weight, average weight
From page American lion (https://en.wikipedia.org/wiki/American_lion):
"(518lbs-1153lbs) in males and 175kg to 365 kg (385lbs-805lbs) for females.
In 2008, the American lion was estimated to weigh up to 420 kg (930 lb). A
study in 2009 showed an average weight of 256 kg (564 lb) for males and 351 kg
(774 lb) for the largest specimen analyzed.".
From page Lion (https://en.wikipedia.org/wiki/Lion):
"=== Size ===
Among felids, the lion is second only to the tiger in size. The size and weight
of adult lions vary across its range and habitats. Accounts of a few
individuals that were larger than average exist from Africa and India.
=== Mane ===".
From page Panthera leo leo (https://en.wikipedia.org/wiki/Panthera_leo_leo):
"The lion's fur varies in colour from light buff to dark brown. It has rounded
ears and a black tail tuft. Average head-to-body length of male lions is
2.47–2.84 m (8 ft 1 in – 9 ft 4 in) with a weight of 148.2–190.9 kg
(327–421 lb). Females are smaller and less heavy. Zoological lion specimens".
From page Panthera leo melanochaita
(https://en.wikipedia.org/wiki/Panthera_leo_melanochaita):
"Average head-to-body length of male lions is 2.47–2.84 m (8 ft 1 in – 9 ft
4 in) with a weight ranging from 150–225 kg (331–496 lb) averaging 187.5 kg
(413 lb) in Southern Africa and 145.4–204.7 kg (321–451 lb) averaging 174.9 kg
(386 lb) in East Africa. Females average 83–165 kg (183–364 lb) in".
From page Asiatic lion (https://en.wikipedia.org/wiki/Asiatic_lion):
"An adult male Asiatic lion weighs 160.1 kg (353 lb) on average with the
limit being 190 kg (420 lb); a wild female weighs 100 to 130 kg (220 to 285 lb)
.[1]".
From page List of largest mammals
(https://en.wikipedia.org/wiki/List_of_largest_mammals):
"== See also ==
List of largest land carnivorans
Largest organisms
Largest prehistoric animals
List of largest birds
List of largest cats
List of largest fish
List of largest plants
List of largest reptiles
List of largest insects
List of heaviest land mammals
Smallest organisms
== Notes ==".
From page Ancient Mesopotamian units of measurement
(https://en.wikipedia.org/wiki/Ancient_Mesopotamian_units_of_measurement):
"== See also ==
Assyrian lion weights
Babylonian mathematics
Historical weights and measures
Weights and measures
== References ==
=== Citations ===".
While the answer stating the weight in kilograms is correct, the model gave a wrong conversion to lbs for the average weight of male lions in Southern Africa, even though the respective passage extracted from Wikipedia reported the correct amount.
Conclusion
In this post, I illustrated a simple framework that can be used to enhance Visual Question Answering with Retrieval Augmented Generation capabilities. The core idea of the method is to exploit the Vision Language Model to generate queries that will be then used by a standard RAG pipeline to retrieve information from an external corpus. I also presented an implementation of the framework that grants Phi-3.5-vision access to Wikipedia. The full code for this implementation is available in the GitHub repo.
While the discussed method is simple and effective, it is not immune to the limitations common to all RAG systems, and to new challenges posed by the complexity of the multimodal setting. On one hand, retrieving the relevant information for some specific questions can be difficult. Since the search queries are created with the Vision Language Model, the retrieval accuracy is further limited by the ability of the VLM to recognize the image and to understand the details the question refers to. On the other hand, even after the correct information has been retrieved, there is no guarantee that the model won’t hallucinate while producing the answer. In the multimodal setting, this could be exacerbated by the fact that the model has to associate the correct meaning to both the text and the image and also understand the interactions between them.
The framework I discussed in this post is a straightforward extension of the vanilla RAG pipeline, adapted to the Visual Question Answering task. Standard advanced RAG techniques, such as query transformation, re-ranking the retrieved passages, and Hypothetical Document Embeddings (HyDE) can be easily included to increase the performance. Furthermore, using a multimodal embedding model (like CLIP) new opportunities appear: the image embeddings can be used when searching by similarity for relevant text documents, and it is also possible to retrieve similar and/or relevant images to the original image and the question. The latter could be useful, for example, when a different point of view of the image is needed to answer the prompt. Another direction for improvement is to perform fine-tuning to get more specialized and effective models. Given the role of the multimodal model in the retrieval and generation process, two different fine-tuning processes can be performed: one to get a model specialized in writing search queries, and one to increase the model’s performance on the grounded generation task. Finally, the framework could be incorporated into a specialized agentic system to further boost its performance and robustness. An agentic system could, for example, iteratively refine the generated query by giving feedback on the retrieved passages and asking follow-up questions or focusing on searching for information about particular details of the image only when needed. It could also handle multi-hop question-answering tasks for more complicated questions, and decide when the retrieval of further external information is needed to answer the user’s query.
I’d be happy to discuss further improvements and/or different approaches to multimodal RAG in the comment section!
A Simple Framework for RAG Enhanced Visual Question Answering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
A Simple Framework for RAG Enhanced Visual Question Answering
Go Here to Read this Fast! A Simple Framework for RAG Enhanced Visual Question Answering