Category: Artificial Intelligence

  • MobileDiffusion: Rapid text-to-image generation on-device

    MobileDiffusion: Rapid text-to-image generation on-device

    Google AI

    Text-to-image diffusion models have shown exceptional capabilities in generating high-quality images from text prompts. However, leading models feature billions of parameters and are consequently expensive to run, requiring powerful desktops or servers (e.g., Stable Diffusion, DALL·E, and Imagen). While recent advancements in inference solutions on Android via MediaPipe and iOS via Core ML have been made in the past year, rapid (sub-second) text-to-image generation on mobile devices has remained out of reach.

    To that end, in “MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices”, we introduce a novel approach with the potential for rapid text-to-image generation on-device. MobileDiffusion is an efficient latent diffusion model specifically designed for mobile devices. We also adopt DiffusionGAN to achieve one-step sampling during inference, which fine-tunes a pre-trained diffusion model while leveraging a GAN to model the denoising step. We have tested MobileDiffusion on iOS and Android premium devices, and it can run in half a second to generate a 512×512 high-quality image. Its comparably small model size of just 520M parameters makes it uniquely suited for mobile deployment.

          
    Rapid text-to-image generation on-device.

    Background

    The relative inefficiency of text-to-image diffusion models arises from two primary challenges. First, the inherent design of diffusion models requires iterative denoising to generate images, necessitating multiple evaluations of the model. Second, the complexity of the network architecture in text-to-image diffusion models involves a substantial number of parameters, regularly reaching into the billions and resulting in computationally expensive evaluations. As a result, despite the potential benefits of deploying generative models on mobile devices, such as enhancing user experience and addressing emerging privacy concerns, it remains relatively unexplored within the current literature.

    The optimization of inference efficiency in text-to-image diffusion models has been an active research area. Previous studies predominantly concentrate on addressing the first challenge, seeking to reduce the number of function evaluations (NFEs). Leveraging advanced numerical solvers (e.g., DPM) or distillation techniques (e.g., progressive distillation, consistency distillation), the number of necessary sampling steps have significantly reduced from several hundreds to single digits. Some recent techniques, like DiffusionGAN and Adversarial Diffusion Distillation, even reduce to a single necessary step.

    However, on mobile devices, even a small number of evaluation steps can be slow due to the complexity of model architecture. Thus far, the architectural efficiency of text-to-image diffusion models has received comparatively less attention. A handful of earlier works briefly touches upon this matter, involving the removal of redundant neural network blocks (e.g., SnapFusion). However, these efforts lack a comprehensive analysis of each component within the model architecture, thereby falling short of providing a holistic guide for designing highly efficient architectures.

    MobileDiffusion

    Effectively overcoming the challenges imposed by the limited computational power of mobile devices requires an in-depth and holistic exploration of the model’s architectural efficiency. In pursuit of this objective, our research undertakes a detailed examination of each constituent and computational operation within Stable Diffusion’s UNet architecture. We present a comprehensive guide for crafting highly efficient text-to-image diffusion models culminating in the MobileDiffusion.

    The design of MobileDiffusion follows that of latent diffusion models. It contains three components: a text encoder, a diffusion UNet, and an image decoder. For the text encoder, we use CLIP-ViT/L14, which is a small model (125M parameters) suitable for mobile. We then turn our focus to the diffusion UNet and image decoder.

    Diffusion UNet

    As illustrated in the figure below, diffusion UNets commonly interleave transformer blocks and convolution blocks. We conduct a comprehensive investigation of these two fundamental building blocks. Throughout the study, we control the training pipeline (e.g., data, optimizer) to study the effects of different architectures.

    In classic text-to-image diffusion models, a transformer block consists of a self-attention layer (SA) for modeling long-range dependencies among visual features, a cross-attention layer (CA) to capture interactions between text conditioning and visual features, and a feed-forward layer (FF) to post-process the output of attention layers. These transformer blocks hold a pivotal role in text-to-image diffusion models, serving as the primary components responsible for text comprehension. However, they also pose a significant efficiency challenge, given the computational expense of the attention operation, which is quadratic to the sequence length. We follow the idea of UViT architecture, which places more transformer blocks at the bottleneck of the UNet. This design choice is motivated by the fact that the attention computation is less resource-intensive at the bottleneck due to its lower dimensionality.

    Our UNet architecture incorporates more transformers in the middle, and skips self-attention (SA) layers at higher resolutions.

    Convolution blocks, in particular ResNet blocks, are deployed at each level of the UNet. While these blocks are instrumental for feature extraction and information flow, the associated computational costs, especially at high-resolution levels, can be substantial. One proven approach in this context is separable convolution. We observed that replacing regular convolution layers with lightweight separable convolution layers in the deeper segments of the UNet yields similar performance.

    In the figure below, we compare the UNets of several diffusion models. Our MobileDiffusion exhibits superior efficiency in terms of FLOPs (floating-point operations) and number of parameters.

    Comparison of some diffusion UNets.

    Image decoder

    In addition to the UNet, we also optimized the image decoder. We trained a variational autoencoder (VAE) to encode an RGB image to an 8-channel latent variable, with 8× smaller spatial size of the image. A latent variable can be decoded to an image and gets 8× larger in size. To further enhance efficiency, we design a lightweight decoder architecture by pruning the original’s width and depth. The resulting lightweight decoder leads to a significant performance boost, with nearly 50% latency improvement and better quality. For more details, please refer to our paper.

    VAE reconstruction. Our VAE decoders have better visual quality than SD (Stable Diffusion).

    Decoder   #Params (M)     PSNR↑     SSIM↑     LPIPS↓  
    SD 49.5 26.7 0.76 0.037
    Ours 39.3 30.0 0.83 0.032
    Ours-Lite     9.8 30.2 0.84 0.032

    Quality evaluation of VAE decoders. Our lite decoder is much smaller than SD, with better quality metrics, including peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS).

    One-step sampling

    In addition to optimizing the model architecture, we adopt a DiffusionGAN hybrid to achieve one-step sampling. Training DiffusionGAN hybrid models for text-to-image generation encounters several intricacies. Notably, the discriminator, a classifier distinguishing real data and generated data, must make judgments based on both texture and semantics. Moreover, the cost of training text-to-image models can be extremely high, particularly in the case of GAN-based models, where the discriminator introduces additional parameters. Purely GAN-based text-to-image models (e.g., StyleGAN-T, GigaGAN) confront similar complexities, resulting in highly intricate and expensive training.

    To overcome these challenges, we use a pre-trained diffusion UNet to initialize the generator and discriminator. This design enables seamless initialization with the pre-trained diffusion model. We postulate that the internal features within the diffusion model contain rich information of the intricate interplay between textual and visual data. This initialization strategy significantly streamlines the training.

    The figure below illustrates the training procedure. After initialization, a noisy image is sent to the generator for one-step diffusion. The result is evaluated against ground truth with a reconstruction loss, similar to diffusion model training. We then add noise to the output and send it to the discriminator, whose result is evaluated with a GAN loss, effectively adopting the GAN to model a denoising step. By using pre-trained weights to initialize the generator and the discriminator, the training becomes a fine-tuning process, which converges in less than 10K iterations.

    Illustration of DiffusionGAN fine-tuning.

    Results

    Below we show example images generated by our MobileDiffusion with DiffusionGAN one-step sampling. With such a compact model (520M parameters in total), MobileDiffusion can generate high-quality diverse images for various domains.

    Images generated by our MobileDiffusion

    We measured the performance of our MobileDiffusion on both iOS and Android devices, using different runtime optimizers. The latency numbers are reported below. We see that MobileDiffusion is very efficient and can run within half a second to generate a 512×512 image. This lightning speed potentially enables many interesting use cases on mobile devices.

    Latency measurements (s) on mobile devices.

    Conclusion

    With superior efficiency in terms of latency and size, MobileDiffusion has the potential to be a very friendly option for mobile deployments given its capability to enable a rapid image generation experience while typing text prompts. And we will ensure any application of this technology will be in-line with Google’s responsible AI practices.

    Acknowledgments

    We like to thank our collaborators and contributors that helped bring MobileDiffusion to on-device: Zhisheng Xiao, Yanwu Xu, Jiuqiang Tang, Haolin Jia, Lutz Justen, Daniel Fenner, Ronald Wotzlaw, Jianing Wei, Raman Sarokin, Juhyun Lee, Andrei Kulik, Chuo-Ling Chang, and Matthias Grundmann.

    Originally appeared here:
    MobileDiffusion: Rapid text-to-image generation on-device

    Go Here to Read this Fast! MobileDiffusion: Rapid text-to-image generation on-device

  • Build a movie chatbot for TV/OTT platforms using Retrieval Augmented Generation in Amazon Bedrock

    Build a movie chatbot for TV/OTT platforms using Retrieval Augmented Generation in Amazon Bedrock

    Gaurav Rele

    In this post, we show you how to securely create a movie chatbot by implementing RAG with your own data using Knowledge Bases for Amazon Bedrock. We use the IMDb and Box Office Mojo dataset to simulate a catalog for media and entertainment customers and showcase how you can build your own RAG solution in just a couple of steps.

    Originally appeared here:
    Build a movie chatbot for TV/OTT platforms using Retrieval Augmented Generation in Amazon Bedrock

    Go Here to Read this Fast! Build a movie chatbot for TV/OTT platforms using Retrieval Augmented Generation in Amazon Bedrock

  • How Mendix is transforming customer experiences with generative AI and Amazon Bedrock

    Ricardo Perdigao

    This post was co-written with Ricardo Perdigao, Solution Architecture Manager at Mendix, a Siemens business. Mendix, a Siemens business, offers the low-code platform with the vision and execution designed for today’s complex software development challenges. Since 2005, we’ve helped thousands of organizations worldwide reimagine how they develop applications with our platform’s cutting-edge capabilities. Mendix allows […]

    Originally appeared here:
    How Mendix is transforming customer experiences with generative AI and Amazon Bedrock

    Go Here to Read this Fast! How Mendix is transforming customer experiences with generative AI and Amazon Bedrock

  • Train and host a computer vision model for tampering detection on Amazon SageMaker: Part 2

    Train and host a computer vision model for tampering detection on Amazon SageMaker: Part 2

    Anup Ravindranath

    In the first part of this three-part series, we presented a solution that demonstrates how you can automate detecting document tampering and fraud at scale using AWS AI and machine learning (ML) services for a mortgage underwriting use case. In this post, we present an approach to develop a deep learning-based computer vision model to […]

    Originally appeared here:
    Train and host a computer vision model for tampering detection on Amazon SageMaker: Part 2

    Go Here to Read this Fast! Train and host a computer vision model for tampering detection on Amazon SageMaker: Part 2

  • What Is the Biggest Number?

    What Is the Biggest Number?

    James W

    In the abstract world that we call home, how many digits do we need? A non-technical journey to the longest sum of all

    Originally appeared here:
    What Is the Biggest Number?

    Go Here to Read this Fast! What Is the Biggest Number?

  • Lingua Franca — Entity-Aware Machine Translation Approach for Question Answering over Knowledge…

    Aleksandr Perevalov

    Lingua Franca — Entity-Aware Machine Translation Approach for Question Answering over Knowledge Graphs

    Towards a lingua franca for knowledge graph question answering systems

    TLDR

    Machine Translation (MT) can enhance existing Question Answering (QA) systems, which have limited language capabilities, by enabling them to support multiple languages. However, there is one major drawback of MT — often, it fails at translating named entities that are not translatable word-by-word. For example, the German title of the movie “The Pope Must Die” is “Ein Papst zum Küssen”, which has the literal translation: “A Pope to Kiss”. As the correctness of the named entities is crucial for QA systems, such a challenge has to be handled properly. In this article, we present our entity-aware MT approach called “Lingua Franca”. It takes advantage of knowledge graphs in order to use information stored there to ensure the correctness of named entities’ translations. And yes, it works!

    The Challenge

    Achieving high-quality translations depends significantly on accurately translating named entities (NEs) within sentences. Various methods have been proposed to enhance the translation of NEs, including approaches that integrate knowledge graphs (KGs) to improve entity translation, recognizing the pivotal role of entities in overall translation quality within the context of QA. It is important to note that the quality of NE translation is not an isolated objective; it has broader implications for systems involved in tasks such as information retrieval (IR) or knowledge graph-based question answering (KGQA). In this article, we will delve into a detailed discussion of machine translation (MT) and KGQA.

    The significance of KGQA systems lies in their ability to provide factual answers to users based on structured data (see figure below).

    Screenshot of Google’s direct answer functionality (by author)

    KGQA systems are core components in modern search engines enabling them to give direct answers to their users (Google Search, screenshot by author).

    Additionally, multilingual KGQA systems play a crucial role in addressing the “digital language divide” on the Web. For instance, Germany-related Wikipedia articles, especially those dedicated to cities or people, contain more information in the German language than in other languages — this information imbalance can be handled by the multilingual KGQA system that is, by the way, the core of all modern search engines.

    One of the options for enabling the KGQA system to answer questions in different languages is to use MT. However, an off-the-shelf MT faces notable challenges when it comes to translating NEs, as numerous entities are not readily translatable and demand background knowledge for accurate interpretation. For instance, consider the German title of the movie “The Pope Must Die,” which is “Ein Papst zum Küssen.” The literal translation, “A Pope to Kiss,” underscores the need for contextual understanding beyond a straightforward translation approach.

    Given the limitations of conventional MT methods in translating entities, the combination of KGQA systems with MT often results in distorted NEs, significantly reducing the likelihood of accurate question answering. Therefore, there is a need for an enhanced approach to incorporate background knowledge about NEs in multiple languages.

    Our Approach

    This article introduces and implements a novel approach for Named-Entity Aware Machine Translation (NEAMT) aimed at enhancing the multilingual capabilities of KGQA systems. The central concept of NEAMT involves augmenting the quality of MT by incorporating information from a knowledge graph (e.g. Wikidata and DBpedia). This is achieved through the utilization of the “entity-replacement” technique.

    As the data for the evaluation, we use the QALD-9-plus and QALD-10 datasets. Then, we use multiple components within our NEAMT framework, which are available in our repository. Finally, the approach is evaluated on two KGQA systems: QAnswer and Qanary. The detailed description of the approach is available at the figure below.

    General overview of the Lingua Franca approach in the KGQA process (figure by author)

    In essence, our approach, during the translation process, preserves known NEs using the entity-replacement technique. Subsequently, these entities are substituted with their corresponding labels from a knowledge graph in the target translation language. This meticulous process ensures the precise translation of questions before they are addressed by a KGQA system.

    Adhering to the insights from our previous article, we designate English as the common target translation language, leading to the nomenclature of our approach as “Lingua Franca” (inspired by the meaning of “bridge” or “link” language). It is essential to note that our framework is versatile and can seamlessly adapt to any other language as the target language. Importantly, Lingua Franca extends beyond the scope of KGQA and finds applicability in various entity-oriented search applications.

    The Lingua Franca approach comprises three main steps: (1) Named Entity Recognition (NER) and Named Entity Linking (NEL), (2) the application of the entity-replacement technique based on identified named entities, and (3) utilizing a machine translation tool to generate text in a target language while considering information from the preceding steps. Here, English is consistently used as the target language, aligning with related research that deems it the most optimal strategy for Question Answering (QA) quality. However, the approach is not limited to English, and other languages can be employed if necessary.

    The approach is implemented as an open-source framework, allowing users to build their Named-Entity Aware Machine Translation (NEAMT) pipelines by integrating custom NER, NEL, and MT components (see our GitHub). The details of the Lingua Franca approach for all settings are illustrated in the provided example, as shown in the figure below.

    A detailed representation of the Lingua Franca approach following multiple settings (figure by author)

    The experimental findings in this study strongly advocate for the superiority of Lingua Franca over standard MT tools when combined with KGQA systems.

    Experimental Results

    In evaluating each entity-replacement setting, the rate of corrupted placeholders or NE labels after processing through an MT tool was calculated. This rate serves as an indicator of the actual NE translation quality for the approach-related pipelines. The updated statistics are as follows:

    • Setting 1 (string-like placeholders): 6.63% of the placeholders were lost or corrupted.
    • Setting 2 (numerical placeholders): 2.89% of the placeholders were lost or corrupted.
    • Setting 3 (replacing the NEs with their English labels before translation): 6.16% of the labels were corrupted.

    As a result, with our approach, we can confidently assert that up to 97.11% (Setting 2) of the recognized NEs in a text were translated correctly.

    We analyzed the results regarding QA quality while taking into account the following experimental components: an approach pipeline or a standard MT tool, a source language, and a KGQA benchmark. The figure below illustrates the comparison between the approach and standard MT — these results can be interpreted as an ablation study.

    Grouped bar plot of Macro F1 scores for our experiments (by author)

    The grouped bar plot illustrates the Macro F1 score (obtained using Gerbil-QA) concerning each language and split. In the context of the ablation study, each group consists of two bars: the first one pertains to the best approach proposed by us, while the second bar reflects the performance of a standard MT tool (baseline).

    We observed that in the majority of the experimental cases (19 out of 24) the KGQA systems that were using our approach outperformed the ones that used standard MT tools. To verify the statement above, we conducted the Wilcoxon signed-rank test on the same data. Based on the test results (p-value = 0.0008, with α = 0.01), we rejected the null hypothesis which denotes that the QA quality results have no difference, i.e., while combining KGQA with standard MT and while combining KGQA with the approach. Therefore, we conclude that the approach, which relies on our NEAMT framework, significantly improves the QA quality while answering multilingual questions in comparison to standard MT tools.

    The reproducibility of the experiments was ensured by repeating them and calculating the Pearson’s correlation coefficient between all the QA quality metrics. The resulting coefficient of 0.794 corresponds to the borderline value between strong and very strong correlation. Therefore, we assume that our experiments are reproducible.

    Conclusion

    This paper introduces the NEAMT approach called Lingua Franca. Designed to enhance multilingual capabilities and improve QA quality in comparison to standard MT tools, Lingua Franca is tailored for use with KGQA systems in order to enlarge the scope of its possible users. The implementation and evaluation of Lingua Franca utilize a modular NEAMT framework developed by the authors, with detailed information provided in the section on Experiments. The key contributions of the paper include: (1) being the first, to the best of our knowledge, to combine the NEAMT approach (i.e., Lingua Franca) with KGQA; (2) presenting an open-source modular framework for NEAMT, allowing the research community to build their own MT pipelines; and (3) conducting a comprehensive evaluation and ablation study to demonstrate the effectiveness of the Lingua Franca approach.

    For future work, we aim to expand our experimental setup to encompass a broader range of languages, benchmarks, and KGQA systems. To address damaged placeholders in the entity-replacement process, we plan to fine-tune the MT models using this data. Additionally, a more detailed error analysis, focusing on error propagation, will be conducted.

    Do not forget to check our full research paper and the GitHub repository.

    Acknowledgments

    This research has been funded by the Federal Ministry of Education and Research, Germany (BMBF) under Grant numbers 01IS17046 and 01QE2056C, as well as the Ministry of Culture and Science of North Rhine-Westphalia, Germany (MKW NRW) under Grant Number NW21–059D. This research also was funded within the research project QA4CB — Entwicklung von Question-Answering-Komponenten zur Erweiterung des Chatbot-Frameworks.


    Lingua Franca — Entity-Aware Machine Translation Approach for Question Answering over Knowledge… was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Lingua Franca — Entity-Aware Machine Translation Approach for Question Answering over Knowledge…

    Go Here to Read this Fast! Lingua Franca — Entity-Aware Machine Translation Approach for Question Answering over Knowledge…