Category: Artificial Intelligence

  • How Can We Continually Adapt Vision-Language Models?

    Alicja Dobrzeniecka

    Exploring continual learning strategies for CLIP

    Image by the author created in Midjourney

    There is currently a growing interest in the study and application of Large Language Models. However, these models can only process textual data, which limits their usefulness for some applications. Humans are capable of processing information across multiple modalities, such as written and spoken language, and visual understanding of the reality around us. We would expect models to be capable of similar processing.

    Vision-Language models can address both textual and visual data, which has a wide range of use cases such as image analysis (e.g. medical images), object recognition and better scene understanding (e.g. for self-driving cars), generating captions for the images, responding to the visual questions, chatting about images, and more…

    Unfortunately, multi-modal models face the same challenges as unimodal ones. Once trained, they can become outdated over time as new data samples arrive or the data distribution changes.

    In my last article I introduced the Continual Learning (CL) approach to AI models in general. Continual Learning tries to find ways to continually train models, which may be a more sustainable solution for the future. In this article, I want to explore the possibilities of applying CL to Vision-Language models (VLMs) — specifically the Contrastive Language-Image Pretraining (CLIP) model.

    But what is CLIP?

    Contrastive Language-Image Pretraining (CLIP) was introduced by the OpenAI in 2021 in the Learning Transferable Visual Models From Natural Language Supervision paper [1].

    The goal of the CLIP model is to understand the relation between text and an image. If you input it a piece of text it should return the most relevant image in a given set of images for it. Likewise if you put in the model an image it should give you the most fitting text from a set of available texts.

    CLIP was trained on a large dataset of text-image pairs. Contrastive learning was used to bring matching text-image pairs closer together in the embedding space and to move non-matching pairs away from each other. This learned shared embedding space is then used during inference to understand the relationship between text and images. If you want to know more about CLIP, I recommend the following article, which describes it in detail.

    Why do we need Continual Learning for Vision-Language models?

    Large foundation models can become obsolete over time due to shifts in distribution or the arrival of new data samples. Re-training such models is expensive and time consuming. The authors of the TiC-CLIP paper [7] show that current evaluation practices often fail to capture the difference in performance when considering time-evolving data.

    In Figure 1 you can see that if we compare OpenAI models trained before 2020 and OpenCLIP models trained before 2022, although there is not much difference between their robustness on Imagenet (left image), there is a performance gap when compared on retrieval tasks from 2014–2016 and 2021–2022 (right image), indicating that OpenAI models have less zero-shot robustness with time-evolving data [7].

    Fig. 1. Image from the paper TiC-CLIP: Continual Training of Clip Models [7].

    In addition, Continual Learning may be a natural choice for some use cases such as Online Lifelong Learning (OLL) [8] where data comes from continuous and non-stationary data streams and evolves with time.

    Finally, as pointed out in [4], CLIP shows remarkable zero-shot capabilities, but for some domains it may struggle to achieve good performance due to insufficient data for some categories during pre-training.

    Challenges

    As some of the current state-of-the-art Vision-Language models require more and more computational time and resources, finding a way to continually adapt them without re-training seems to be crucial. However, there are some challenges in continually adapting such models:

    • catastrophic forgetting — learning new tasks can damage the performance on the old tasks.
    • losing zero-shot capability — pre-trained models can display zero-shot behaviour meaning that they can perform a task for which they have not received training data, i.e. classify a class of images without seeing them during training. This ability can be lost when training continually.
    • misalignment between text and image representations — as noted by the authors of [12], during Continual Learning for CLIP, there may be a deterioration in the alignment of the multimodal representation space, which can lead to performance degradation in the long run.

    Continual Learning Methods for CLIP

    There is an ongoing research on improving the continual aspect of multi-modal models. Below are some of the existing strategies and use cases:

    1. Mixture of Experts (MoE)
    • To continually train the CLIP, the authors of [2] propose MoE approach using task-specific adapters. They build a dynamic extension architecture on top of a frozen CLIP model.
    • The idea here is to add new adapters as new tasks are trained. At the same time, the Distribution Discriminative Auto-Selector is trained so that later, during the inference phase, the model can automatically choose whether the test data should go to the MoE adapters or to the pre-trained CLIP for zero-shot detection.

    2. CoLeCLIP

    • The authors of [4] focus on the problem of Continual Learning for Vision-Language models in open domains — where we may have datasets from diverse seen and unseen domains with novel classes.
    • Addressing open domain challenges is particularly important for use cases such as AI assistants, autonomous driving systems and robotics, as these models operate in complex and changing environments [4].
    • CoLeCLIP is based on CLIP but adjusted for open-domain problems.
    • In CoLeCLIP an external laernable Parameter-Efficient Fine-Tuning (PEFT) module per task is attached to the frozen text encoder of CLIP to learn the text embeddings of the classes [4].

    3. Continual Language Learning (CLL)

    • The authors of [3] noted that current pre-trained Vision-Language models often only support English. At the same time popular methods for creating multilingual models are expensive and require large amounts of data.
    • In their paper, they propose to extend language capability by using CLL, where linguistic knowledge is updated incrementally.
    • CLL-CLIP uses an expandable embedding layer to store linguistic differences. It trains only token embeddings and is optimised for learning alignment between images and multilingual text [3].
    • The authors also propose a novel approach to ensure that the distribution of all token embeddings is identical during initialisation and later regularised during training. You can see a visualisation of this process in Figure 2 from their paper.
    Fig. 2. Image from the paper Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning [3].

    4. Symmetric Image-Text tuning strategy (SIT)

    • In [8] the authors observe that there occurs asymetry between text and image during Parameter-Efficient Tuning (PET) for their Online Lifelong Learning scenario which may lead to catastrophic forgetting.
    • They propose to use the SIT strategy to mitigate this problem. This approach matches images and class labels within the current batch only during online learning.
    • The goal is to preserve the generalisation ability of CLIP while improving its performance on a specific downstream task or dataset, without introducing asymmetry between the encoders.

    Evaluation of the Continual Learning models

    The evaluation standards for CL appear to be still a work in progress. Many of the existing benchmarks for evaluating the effectiveness of CL models do not take the time factor into account when constructing data sets. As mentioned by [7], the performance gap may sometimes only become visible when we recreate the time-evolving setup for the test data.

    In addition, many of the existing benchmarks for Vision-Language models focus only on the single-image input, without measuring multi-image understanding, which may be critical in some applications. The authors of [5] develop a benchmark for multi-image evaluation that allows a more fine-grained assessment of the limitations and capabilities of current state-of-the-art models.

    Continual Learning does not solve all the problems…

    Visual-Language models like CLIP have their shortcomings. In [6], the authors explored the gap between CLIP’s visual embedding space and purely visual self-supervised learning. They investigated false matches in the embedding space, where images have similar encoding when they should not.

    From their results it can be concluded that if a pre-trained model has a weakness, it can be propagated when the model is adapted. Learning visual representations remains an open challenge, and vision models may become a bottleneck in multimodal systems, as scaling alone does not solve the built-in limitations of models such as CLIP. [6]

    Conclusion

    This article explores the opportunities and challenges of applying Continual Learning to Vision-Language models, focusing on the CLIP model. Hopefully this article has given you a first impression of what is possible, and that while Continual Learning seems to be a good direction for the future of AI models, there is still a lot of work to be done to make it fully usable.

    If you have any questions or comments, please feel free to share them in the comments section.

    Until next time!

    Image by the author generated in Midjourney.

    References

    [1] Radford, A., Kim, J., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (pp. 8748–8763). PMLR.

    [2] Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, & You He. (2024). Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters.

    [3] Bang Yang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, & Yuexian Zou. (2024). Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning.

    [4] Yukun Li, Guansong Pang, Wei Suo, Chenchen Jing, Yuling Xi, Lingqiao Liu, Hao Chen, Guoqiang Liang, & Peng Wang. (2024). CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning.

    [5] Bingchen Zhao, Yongshuo Zong, Letian Zhang, & Timothy Hospedales. (2024). Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning.

    [6] Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, & Saining Xie. (2024). Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs.

    [7] Saurabh Garg, Hadi Pour Ansari, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Oncel Tuzel, Vaishaal Shankar, & Fartash Faghri (2023). TiC-CLIP: Continual Training of CLIP Models. In NeurIPS Workshop.

    [8] Leyuan Wang, Liuyu Xiang, Yujie Wei, Yunlong Wang, & Zhaofeng He. (2024). CLIP model is an Efficient Online Lifelong Learner.

    [9] Vishal Thengane, Salman Khan, Munawar Hayat, & Fahad Khan. (2023). CLIP model is an Efficient Continual Learner.

    [10] Yuxuan Ding, Lingqiao Liu, Chunna Tian, Jingyuan Yang, & Haoxuan Ding. (2022). Don’t Stop Learning: Towards Continual Learning for the CLIP Model.

    [11] Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, & Aman Chadha. (2024). Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions.

    [12] Ni, Z., Wei, L., Tang, S., Zhuang, Y., & Tian, Q. (2023). Continual vision-language representation learning with off-diagonal information. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org.


    How Can We Continually Adapt Vision-Language Models? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    How Can We Continually Adapt Vision-Language Models?

    Go Here to Read this Fast! How Can We Continually Adapt Vision-Language Models?

  • Secure RAG applications using prompt engineering on Amazon Bedrock

    Secure RAG applications using prompt engineering on Amazon Bedrock

    Andrei Ivanovic

    In this post, we discuss existing prompt-level threats and outline several security guardrails for mitigating prompt-level threats. For our example, we work with Anthropic Claude on Amazon Bedrock, implementing prompt templates that allow us to enforce guardrails against common security threats such as prompt injection. These templates are compatible with and can be modified for other LLMs.

    Originally appeared here:
    Secure RAG applications using prompt engineering on Amazon Bedrock

    Go Here to Read this Fast! Secure RAG applications using prompt engineering on Amazon Bedrock

  • Get the most from Amazon Titan Text Premier

    Get the most from Amazon Titan Text Premier

    Anupam Dewan

    In this post, we introduce the new Amazon Titan Text Premier model, specifically optimized for enterprise use cases, such as building Retrieval Augmented Generation (RAG) and agent-based applications. Such integrations enable advanced applications like building interactive AI assistants that use enterprise APIs and interact with your propriety documents.

    Originally appeared here:
    Get the most from Amazon Titan Text Premier

    Go Here to Read this Fast! Get the most from Amazon Titan Text Premier

  • Advanced Retrieval Techniques in a World of 2M Token Context Windows: Part 2 on Re-rankers

    Advanced Retrieval Techniques in a World of 2M Token Context Windows: Part 2 on Re-rankers

    Meghan Heintz

    Visualising AI project launched by Google DeepMind. From Unsplash image.

    Exploring RAG techniques to improve retrieval accuracy

    In Part 1, we dove into improving RAG (retrieval augmented generation) outcomes by re-writing queries before performing retrieval. This time we will learn about how re-ranking results from vector database retrievals helps performance.

    While I highly recommend experimenting with promising proprietary options like Cohere’s Re-Rank 3, we’ll focus mainly on understanding what researchers have shared on the topic.

    Re-Ranking, what’s the point?

    First of all, why rerank at all? Results from vector databases return “similarity” scores based on the embeddings of the query and document. These scores can already be used to sort the results and since this is already a semantic similarity scoring of the document and query, why would we need another step?

    There are a few reasons why we would take this approach:

    • Document embeddings are “lossy”. Documents are compressed in vector format before seeing the query, which means the document vector is not tailored to the query vector. Re-ranking allows us to better understand the document’s meaning specific to the query.
    • Two-stage systems have become standard in traditional search and recommender systems. They offer improvements in scalability, flexibility, and accuracy. Retrieval models are very fast, whereas ranking models are slow. By building a hybrid system, we can balance the speed and accuracy trade-offs between each stage.
    • Re-ranking allows us to reduce the number of documents we stuff into the context window which a) reduces costs and b) reduces the chances of relevant data being “lost in the haystack”.

    Traditional Methods of Re-Ranking

    Informational retrieval is not a new field. Before LLMs employed RAG to improve generation, search engines used re-ranking methods to improve search results. Two popular methodologies are TF-IDF (term frequency–inverse document frequency) and BM25 (Best Match 25).

    Karen Spärck Jones conceived of the concept of IDF (of TF-IDF), inverse document frequency, as a statistical interpretation of term-specificity in the 1970s. The general concept is that the specificity of a term can be quantified as an inverse function of the number of documents in which it occurs. A toy example is the frequency of terms in Shakespearean plays. Because the term “Romeo” only appears in one play, we believe it is more informative to the subject of the play than the word “sweet” because that term occurs in all plays.

    BM25 or Okapi BM25 was developed by both Karen Spärck Jones and Stephen Robertson as an improvement to TF-IDF. BM25 is a “bag-of-words” retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. This method expands on TF-IDF in a few important ways:

    • BM25 uses a saturation function where the importance of a term increases with frequency but with diminishing returns. (Side note: This was important for protecting accuracy when search engine optimization (SEO) became higher stakes. You can’t just spam your keyword with this improvement.)
    • BM25 includes document length normalization to ensure that longer documents are not unfairly advantaged. (Another improvement to thwart would-be SEO gamers.)

    Both of these methods can be used to re-rank results from vector databases before documents are used in the context of generation. This would be called “feature” based re-ranking.

    Neural Re-Ranking Models

    Something you should notice about the traditional methods is that they focus on exact term matches. These methods will struggle when documents use semantically similar but different terms. Neural re-ranking methods like SBERT (Sentence Transformers) seek to overcome this limitation.

    SBERT is a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model with a siamese / triplet network architecture which greatly improves the computation efficiency and latency for calculating sentence similarity. Transformers like SBERT (Sentence-BERT) use the context in which terms are used, allowing the model to handle synonyms and words with multiple meanings.

    SBERT architecture at inference, for example, to compute similarity scores. This architecture is also used with the regression objective function. From Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks by Nils Reimers and Iryna Gurevych

    SBERT tends to perform better for semantic similarity ranking due to its specialization. However, using SBERT comes with the downside that you will need to manage the models locally versus calling an API, such as with OpenAI’s embedding models. Pick your poison wisely!

    Cross-Encoder Re-Ranking

    The top K results from a vector database search are the most similar document vectors compared to the query vector. Another way of describing this ranking method is to say it is a “bi-encoder” ranking. Vectors are calculated up front and approximate nearest neighbors algorithms (ANNs) select the most similar documents making this a highly efficient ranking method. But that efficiency comes at the expense of some accuracy.

    In contrast, cross-encoders use a classification mechanism on data pairs to calculate similarity. This means you need a pair for each document and query. This approach can yield much more accurate results but it’s highly inefficient. That is why cross-encoders are best implemented through a hybrid approach where the number of documents is first pruned using a “bi-encoder” top K result before ranking with a cross-encoder. You can read more about using bi-encoders and cross-encoders together in the SBERT documentation.

    Information Retrieval / Question Answering Retrieval Diagram explaining how to use Bi-Encoders and Cross-Encoders together from the SBERT documentation, full citation below.
    @inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
    }

    Prompt-based Re-Ranking (PBR)

    Until now, we have focused on using vectors or other numeric methods to rerank our RAG results. But does that mean we are underleveraging the LLM? Feeding the document and the query back to the LLM for scoring can be an effective way to score the document; there is approximately no information loss when you take this approach. If the LLM is prompted to return only a single token (the score), the latency incurred is often acceptable (although this is one of the more expensive approaches to scale). This is considered to be “zero-shot” re-ranking and research is still limited on this topic but we know it must be sensitive to the quality of the prompt.

    Another version of (PBR) is the DSLR Framework (Document Refinement with Sentence-Level Re-ranking and Reconstruction). DSLR proposes an unsupervised method that decomposes retrieved documents into sentences, re-ranks them based on relevance, and reconstructs them into coherent passages before passing them to the LLM. This approach contrasts with traditional methods that rely on fixed-size passages, which may include redundant or irrelevant information. Pruning non-relevant sentences before generating a response can reduce hallucinations and improve overall accuracy. Below you can see an example of how DSLR refinement improves the LLMs response.

    Example DSLR refinement from DSLR: Document Refinement with Sentence-Level Re-ranking and Reconstruction to Enhance Retrieval-Augmented Generation by Taeho Hwang, Soyeong Jeong, Sukmin Cho, SeungYoon Han, Jong C. Park at the School of Computing Korea Advanced Institute of Science and Technology

    Graph-based Reranking

    Sometimes the answer is not going to fit cleanly inside a single document chunk. Books and papers are written with the expectation that they’ll be read linearly or at least the reader will be able to easily refer back to earlier passages. For example, you could be asked to refer back to an earlier chapter on BM25 when reading about SBERT. In a basic RAG application, this would be impossible because your retrieved document would have no connections to the previous chapters.

    G-RAG, an approach proposed by researchers at Google and UCLA, hopes to alleviate this issue. G-RAG is a re-ranker that leverages graph neural networks (GNNs) to consider connections between retrieved documents. Documents are represented as nodes and edges are shared concepts between documents. These graphs are generated as Abstract Meaning Representation (AMR) Graphs which can be created with tools like https://github.com/goodbai-nlp/AMRBART (MIT License).

    Experiments with Natural Question (NQ) and TriviaQA (TQA) datasets showed this approach made improvements to Mean Tied Reciprocal Ranking (MTRR) and Tied Mean Hits@10 (TMHits@10) over other state-of-the-art approaches.

    From UCLA and Google researchers: G-RAG uses two graphs for re-ranking documents: The Abstract Meaning Representation (AMR) graph is used as feature for the document-level graph. Document graph is then used for document reranking

    Conclusion

    I hope you’ve enjoyed this overview of techniques you can use to improve the performance of your RAG applications. I look forward to continued advancements in this field. I know there will be many considering the blistering pace of research at the moment.

    Let me know in the comments section if you have favorite re-ranking methods not covered in this article.


    Advanced Retrieval Techniques in a World of 2M Token Context Windows: Part 2 on Re-rankers was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Advanced Retrieval Techniques in a World of 2M Token Context Windows: Part 2 on Re-rankers

    Go Here to Read this Fast! Advanced Retrieval Techniques in a World of 2M Token Context Windows: Part 2 on Re-rankers

  • GenASL: Generative AI-powered American Sign Language avatars

    GenASL: Generative AI-powered American Sign Language avatars

    Alak Eswaradass

    In this post, we dive into the architecture and implementation details of GenASL, which uses AWS generative AI capabilities to create human-like ASL avatar videos. GenASL is a solution that translates speech or text into expressive ASL avatar animations, bridging the gap between spoken and written language and sign language.

    Originally appeared here:
    GenASL: Generative AI-powered American Sign Language avatars

    Go Here to Read this Fast! GenASL: Generative AI-powered American Sign Language avatars

  • AWS empowers sales teams using generative AI solution built on Amazon Bedrock

    AWS empowers sales teams using generative AI solution built on Amazon Bedrock

    Rupa Boddu

    Through this series of posts, we share our generative AI journey and use cases, detailing the architecture, AWS services used, lessons learned, and the impact of these solutions on our teams and customers. In this first post, we explore Account Summaries, one of our initial production use cases built on Amazon Bedrock. Account Summaries equips our teams to be better prepared for customer engagements. It combines information from various sources into comprehensive, on-demand summaries available in our CRM or proactively delivered based on upcoming meetings. From the period of September 2023 to March 2024, sellers leveraging GenAI Account Summaries saw a 4.9% increase in value of opportunities created.

    Originally appeared here:
    AWS empowers sales teams using generative AI solution built on Amazon Bedrock

    Go Here to Read this Fast! AWS empowers sales teams using generative AI solution built on Amazon Bedrock

  • How to Network as a Data Scientist

    How to Network as a Data Scientist

    Haden Pelletier

    Times are changing —  if you want to get into data science, you have to network like you mean it.

    Originally appeared here:
    How to Network as a Data Scientist

    Go Here to Read this Fast! How to Network as a Data Scientist