Category: AI

  • Neuromorphic Computing — an Edgier, Greener AI

    Neuromorphic Computing — an Edgier, Greener AI

    Jonathan R. Williford, PhD

    Neuromorphic Computing — an Edgier, Greener AI

    Why computer hardware and AI algorithms are being reinvented using inspiration from the brain

    Server towers connected to floating brains with windmills and solar panels in the background.
    euromorphic Computing might not just help bring AI to the edge, but also reduce carbon emissions at data centers. Generated by author with ImageGen 3.

    There are periodic proclamations of the coming neuromorphic computing revolution, which uses inspiration from the brain to rethink neural networks and the hardware they run on. While there remain challenges in the field, there have been solid successes and continues to be steady progress in spiking neural network algorithms and neuromorphic hardware. This progress is paving the way for disruption in at least some sectors of artificial intelligence and will reduce the energy consumption per computation at inference and allow artificial intelligence to be pushed further out to the edge. In this article, I will cover some neuromorphic computing and engineering basics, training, the advantages of neuromorphic systems, and the remaining challenges.

    The classical use case of neuromorphic systems is for edge devices that need to perform the computation locally and are energy-limited, for example, battery-powered devices. However, one of the recent interests in using neuromorphic systems is to reduce energy usage at data centers, such as the energy needed by large language models (LLMs). For example, OpenAI signed a letter of intent to purchase $51 million of neuromorphic chips from Rain AI in December 2023. This makes sense since OpenAI spends a lot on inference, with one estimate of around $4 billion on running inference in 2024. It also appears that both Intel’s Loihi 2 and IBM’s NorthPole (successor to TrueNorth) neuromorphic systems are designed for use in servers.

    The promises of neuromorphic computing can broadly be divided into 1) pragmatic, near-term successes that have already found successes and 2) more aspirational, wacky neuroscientist fever-dream ideas of how spiking dynamics might endow neural networks with something closer to real intelligence. Of course, it’s group 2 that really excites me, but I’m going to focus on group 1 for this post. And there is no more exciting way to start than to dive into terminology.

    Terminology

    Neuromorphic computation is often defined as computation that is brain-inspired, but that definition leaves a lot to the imagination. Neural networks are more neuromorphic than classical computation, but these days neuromorphic computation is specifically interested in using event-based spiking neural networks (SNNs) for their energy efficiency. Even though SNNs are a type of artificial neural network, the term “artificial neural networks” (ANNs) is reserved for the more standard non-spiking artificial neural networks in the neuromorphic literature. Schuman and colleagues (2022) define neuromorphic computers as non-von Neuman computers where both processing and memory are collocated in artificial neurons and synapses, as opposed to von Neuman computers that separate processing and memory.

    Diagram comparing von Neumann computers with Neuromorphic computers
    von Neumann Computers operate on digital information, have separate processors and memory, and are synchronized by clocks, while neuromorphic computers operate on event-driven spikes, combine compute and memory, and are asynchronous. Created by the author with inspiration from Schuman et al. 2022.

    Neuromorphic engineering means designing the hardware while “neuromorphic computation” is focused on what is being simulated rather than what it is being simulated on. These are tightly intertwined since the computation is dependent on the properties of the hardware and what is implemented in hardware depends on what is empirically found to work best.

    Another related term is NeuroAI, the goal of which is to use AI to gain a mechanistic understanding of the brain and is more interested in biological realism. Neuromorphic computation is interested in neuroscience as a means to an end. It views the brain as a source of ideas that can be used to achieve objectives such as energy efficiency and low latency in neural architectures. A decent amount of the NeuroAI research relies on spike averages rather than spiking neural networks, which allows closer comparison of the majority of modern ANNs that are applied to discrete tasks.

    Event-Driven Systems

    Generated by the author using ImageGen 3.

    Neuromorphic systems are event-based, which is a paradigm shift from how modern ANN systems work. Even real-time ANN systems typically process one frame at a time, with activity synchronously propagated from one layer to the next. This means that in ANNs, neurons that carry no information require the same processing as neurons that carry critical information. Event-driven is a different paradigm that often starts at the sensor and applies the most work where information needs to be processed. ANNs rely on matrix operations that take the same amount of time and energy regardless of the values in the matrices. Neuromorphic systems use SNNs where the amount of work depends on the number of spikes.

    A traditional deployed ANN would often be connected to a camera that synchronously records a frame in a single exposure. The ANN then processes the frame. The results of the frame might then be fed into a tracking algorithm and further processed.

    Event-driven systems may start at the sensor with an event camera. Each pixel sends updates asynchronously whenever a change crosses a threshold. So when there is movement in a scene that is otherwise stationary, the pixels that correspond to the movement send events or spikes immediately without waiting for a synchronization signal. The event signals can be sent within tens of microseconds, while a traditional camera might collect at 24 Hz and could introduce a latency that’s in the range of tens of milliseconds. In addition to receiving the information sooner, the information in the event-based system would be sparser and would focus on the movement. The traditional system would have to process the entire scene through each network layer successively.

    Learning in Spiking Neural Networks

    An image of a teacher teaching a class of neurons the difference between cats and dogs.
    One way to train a spiking neural network is to use an ANN as a teacher. Generated by the author with ImageGen 3.

    One of the major challenges of SNNs is training them. Backpropagation algorithms and stochastic gradient descent are the go-to solutions for training ANNs, however, these methods run into difficulty with SNNs. The best way to train SNNs is not yet established and the following methods are some of the more common approaches that are used:

    1. ANN to SNN conversion
    2. Backpropagation-like
    3. Synaptic plasticity
    4. Evolutionary

    ANN to SNN conversion

    One method of creating SNNs is to bypass training the SNNs directly and instead train ANNs. This approach limits the types of SNNs and hardware that can be used. For example, Sengupta et al. (2019) converted VGG and ResNets to ANNs using an integrate-and-fire (IF) neuron that does not have a leaking or refractory period. They introduce a novel weight-normalization technique to perform the conversion, which involves setting the firing threshold of each neuron based on its pre-synaptic weights. Dr. Priyadarshini Panda goes into more detail in her ESWEEK 2021 SNN Talk.

    Advantages:

    1. Enables deep SNNs.
    2. Allows reuse of deep ANN knowledge, such as training, architecture, etc.

    Disadvantages:

    1. Limits architectures to those suited to ANNs and the conversion procedures.
    2. Network doesn’t learn to take advantage of SNN properties, which can lead to lower accuracy and longer latency.

    Backpropagation-like approaches and surrogate gradient descent

    The most common methods currently used to train SNNs are backpropagation-like approaches. Standard backpropagation does not work to train SNNs because 1) the spiking threshold function’s gradient is nonzero except at the threshold where it is undefined and 2) the credit assignment problem needs to be solved in the temporal dimension in addition spatial (or color etc).

    In ANNs, the most common activation function is the ReLU. For SNNs, the neuron will fire if the membrane potential is above some threshold, otherwise, it will not fire. This is called a Heaviside function. You could use a sigmoid function instead, but then it would not be a spiking neural network. The solution of using surrogate gradients is to use the standard threshold function in the forward pass, but then use the derivative from a “smoothed” version of the Heaviside function, such as the sigmoid function, in the backward pass (Neftci et al. 2019, Bohte 2011).

    Advantages:

    1. Connects to well-known methods.
    2. Compared to conversion, can result in a more energy efficient network (Li et al. 2022)

    Disadvantages:

    1. Can be computationally intensive to solve both spatially and through time

    Synaptic Plasticity

    Spike-timing-dependent plasticity (STDP) is the most well-known form of synaptic plasticity. In most cases, STDP increases the strength of a synapse when a presynaptic (input) spike comes immediately before the postsynaptic spike. Early models have shown promise with STDP on simple unsupervised tasks, although getting it to work well for more complex models and tasks has proven more difficult.

    Other biological learning mechanisms include the pruning and creation of both neurons and synapses, homeostatic plasticity, neuromodulators, astrocytes, and evolution. There is even some recent evidence that some primitive types of knowledge can be passed down by epigenetics.

    Advantages:

    1. Unsupervised
    2. Can take advantage of temporal properties
    3. Biologically inspired

    Disadvantages:

    1. Synaptic plasticity is not well understood, especially at different timescales
    2. Difficult to get to work with non-trivial networks

    Evolutionary Optimization

    Evolutionary optimization is another approach that has some cool applications that works well with small networks. Dr. Catherine Schuman is a leading expert and she gave a fascinating talk on neuromorphic computing to the ICS lab that is available on YouTube.

    Advantages:

    1. Applicable to many tasks, architectures, and devices.
    2. Can learn topology and parameters (requiring less knowledge of the problem).
    3. Learns small networks which results in lower latency.

    Disadvantages:

    1. Not effective for problems that require deep or large architectures.

    Advantages of Neuromorphic Systems

    Energy Efficiency

    Neuromorphic systems have two main advantages: 1) energy efficiency and 2) low latency. There are a lot of reasons to be excited about the energy efficiency. For example, Intel claimed that their Loihi 2 Neural Processing Unit (NPU) can use 100 times less energy while being as much as 50 times faster than conventional ANNs. Chris Eliasmith compared the energy efficiency of an SNN on neuromorphic hardware with an ANN with the same architecture on standard hardware in a presentation available on YouTube. He found that the SNN is 100 times more energy efficient on Loihi compared to the ANN on a standard NVIDIA GPU and 20 times more efficient than the ANN on an NVIDIA Jetson GPU. It is 5–7 times more energy efficient than the Intel Neural Compute Stick (NCS) and NCS 2. At the same time the SNN achieves a 93.8% accuracy compared to the 92.7% accuracy of the ANN.

    Barplot comparing ANNs running on traditional GPUs and CPUs with an SNN running on an Intel Loihi.
    Figure recreated by author from Chris Eliasmith’s slides at https://www.youtube.com/watch?v=PeW-TN3P1hk&t=1308s which shows the neuromorphic processor being 5–100x more efficient while achieving a similar accuracy.

    Neuromorphic chips are more energy efficient and allow complex deep learning models to be deployed on low-energy edge devices. In October 2024, BrainChip introduced the Akida Pico NPU which uses less than 1 mW of power, and Intel Loihi 2 NPU uses 1 W. That’s a lot less power than NVIDIA Jetson modules that use between 10–50 watts which is often used for embedded ANNs and server GPUs can use around 100 watts.

    Comparing the energy efficiency between ANNs and SNNs are difficult because: 1. energy efficiency is dependent on hardware, 2. SNNs and ANNs can use different architectures, and 3. they are suited to different problems. Additionally, the energy used by SNNs scales with the number of spikes and the number of time steps, so the number of spikes and time steps needs to be minimized to achieve the best energy efficiency.

    Theoretical analysis is often used to estimate the energy needed by SNNs and ANNs, however, this doesn’t take into account all of the differences between the CPUs and GPUs used for ANNs and the neuromorphic chips for SNNs.

    Looking into nature can give us an idea of what might be possible in the future and Mike Davies provided a great anecdote in an Intel Architecture All Access YouTube video:

    Consider the capabilities of a tiny cockatiel parrot brain, a two-gram brain running on about 50 mW of power. This brain enables the cockatiel to fly at speeds up to 20 mph, to navigate unknown environments while foraging for food, and even to learn to manipulate objects as tools and utter human words.

    In current neural networks, there is a lot of wasted computation. For example, an image encoder takes the same amount of time encoding a blank page as a cluttered page in a “Where’s Waldo?” book. In spiking neural networks, very few units would activate on a blank page and very little computation would be used, while a page containing a lot of features would fire a lot more units and use a lot more computation. In real life, there are often regions in the visual field that contain more features and require more processing than other regions that contain fewer features, like a clear sky. In either case, SNNs only perform work when work needs to be performed, whereas ANNs depend on matrix multiplications that are difficult to use sparsely.

    This in itself is exciting. A lot of deep learning currently involves uploading massive amounts of audio or video to the cloud, where the data is processed in massive data centers, spending a lot of energy on the computation and cooling the computational devices, and then the results are returned. With edge computing, you can have more secure and more responsive voice recognition or video recognition, that you can keep on your local device, with orders of magnitude less energy consumption.

    Low Latency

    When a pixel receptor of an event camera changes by some threshold, it can send an event or spike within microseconds. It doesn’t need to wait for a shutter or synchronization signal to be sent. This benefit is seen throughout the event-based architecture of SNNs. Units can send events immediately, rather than waiting for a synchronization signal. This makes neuromorphic computers much faster, in terms of latency, than ANNs. Hence, neuromorphic processing is better than ANNs for real-time applications that can benefit from low latency. This benefit is reduced if the problem allows for batching and you are measuring speed by throughput since ANNs can take advantage of batching more easily. However, in real-time processing, such as robotics or user interfacing, latency is more important.

    Disadvantages and Challenges

    Everything Everywhere All at Once

    One of the challenges is that neuromorphic computing and engineering are progressing at multiple levels at the same time. The details of the models depend on the hardware implementation and empirical results with actualized models guide the development of the hardware. Intel discovered this with their Loihi 1 chips and built more flexibility into their Loihi 2 chips, however, there will always be tradeoffs and there are still many advances to be made on both the hardware and software side.

    Limited Availability of Commercial Hardware

    Hopefully, this will change soon, but commercial hardware isn’t very available. BrainChip’s Akida was the first neuromorphic chip to be commercially available, although apparently, it does not even support the standard leaky-integrate and fire (LIF) neuron. SpiNNaker boards used to be for sale, which was part of the EU Human Brain Project but are no longer available. Intel makes Loihi 2 chips available to some academic researchers via the Intel Neuromorphic Research Community (INRC) program.

    Datasets

    The number of neuromorphic datasets is much less than traditional datasets and can be much larger. Some of the common smaller computer vision datasets, such as MNIST (NMNIST, Li et al. 2017) and CIFAR-10 (CIFAR10-DVS, Orchard et al. 2015), have been converted to event streams by displaying the images and recording them using event-based cameras. The images are collected with movement (or “saccades”) to increase the number of spikes for processing. With larger datasets, such as ES-ImageNet (Lin et al. 2021), simulation of event cameras has been used.

    The dataset derived from static images might be useful in comparing SNNs with conventional ANNs and might be useful as part of the training or evaluation pipeline, however, SNNs are naturally temporal, and using them for static inputs does not make a lot of sense if you want to take advantage of SNNs temporal properties. Some of the datasets that take advantage of these properties of SNNs include:

    • DvsGesture (Amir et al. 2017) — a dataset of people performing a set of 11 hand and arm gestures
    • Bullying10K (Dong et al. 2024) — a privacy-preserving dataset for bullying recognition

    Synthetic data can be generated from standard visible camera data without the use of expensive event camera data collections, however these won’t exhibit the high dynamic range and frame rate that event cameras would capture.

    Tonic is an example python library that makes it easy to access at least some of these event-based datasets. The datasets themselves can take up a lot more space than traditional datasets. For example, the training images for MNIST is around 10 MB, while in N-MNIST, it is almost 1 GB.

    Another thing to take into account is that visualizing the datasets can be difficult. Even the datasets derived from static images can be difficult to match with the original input images. Also, the benefit of using real data is typically to avoid a gap between training and inference, so it would seem that the benefit of using these datasets would depend on their similarity to the cameras used during deployment or testing.

    Conclusion

    Created by author with ImageGen 3 and GIMP.

    We are in an exciting time with neuromorphic computation, with both the investment in the hardware and the advancements in spiking neural networks. There are still challenges for adoption, but there are proven cases where they are more energy efficient, especially standard server GPUs while having lower latency and similar accuracy as traditional ANNs. A lot of companies, including Intel, IBM, Qualcomm, Analog Devices, Rain AI, and BrainChip have been investing in neuromorphic systems. BrainChip is the first company to make their neuromorphic chips commercially available while both Intel and IBM are on the second generations of their research chips (Loihi 2 and NorthPole respectively). There also seems to have been a particular spike of successful spiking transformers and other deep spiking neural networks in the last couple of years, following the Spikformer paper (Zhou et al. 2022) and the SEW-ResNet paper (Fang et al. 2021).

    References

    • Amir, A., Taba, B., Berg, D., Melano, T., McKinstry, J., Di Nolfo, C., Nayak, T., Andreopoulos, A., Garreau, G., Mendoza, M., Kusnitz, J., Debole, M., Esser, S., Delbruck, T., Flickner, M., & Modha, D. (2017). A Low Power, Fully Event-Based Gesture Recognition System. 7243–7252. https://openaccess.thecvf.com/content_cvpr_2017/html/Amir_A_Low_Power_CVPR_2017_paper.html
    • Bohte, S. M. (2011). Error-Backpropagation in Networks of Fractionally Predictive Spiking Neurons. In Artificial Neural Networks and Machine Learning https://doi.org/10.1007/978-3-642-21735-7_8
    • Dong, Y., Li, Y., Zhao, D., Shen, G., & Zeng, Y. (2023). Bullying10K: A Large-Scale Neuromorphic Dataset towards Privacy-Preserving Bullying Recognition. Advances in Neural Information Processing Systems, 36, 1923–1937.
    • Fang, W., Yu, Z., Chen, Y., Huang, T., Masquelier, T., & Tian, Y. (2021). Deep Residual Learning in Spiking Neural Networks. Advances in Neural Information Processing Systems, 34, 21056–21069. https://proceedings.neurips.cc/paper/2021/hash/afe434653a898da20044041262b3ac74-Abstract.html
    • Li, C., Ma, L., & Furber, S. (2022). Quantization Framework for Fast Spiking Neural Networks. Frontiers in Neuroscience,16. https://doi.org/10.3389/fnins.2022.918793
    • Li, H., Liu, H., Ji, X., Li, G., & Shi, L. (2017). CIFAR10-DVS: An Event-Stream Dataset for Object Classification. Frontiers in Neuroscience, 11. https://doi.org/10.3389/fnins.2017.00309
    • Lin, Y., Ding, W., Qiang, S., Deng, L., & Li, G. (2021). ES-ImageNet: A Million Event-Stream Classification Dataset for Spiking Neural Networks. Frontiers in Neuroscience, 15. [https://doi.org/10.3389/fnins.2021.726582](https://doi.org/10.3389/fnins.2021.726582
    • Neftci, E. O., Mostafa, H., & Zenke, F. (2019). Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks. IEEE Signal Processing Magazine. https://doi.org/10.1109/MSP.2019.2931595
    • Orchard, G., Jayawant, A., Cohen, G. K., & Thakor, N. (2015). Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades. Frontiers in Neuroscience, 9. https://doi.org/10.3389/fnins.2015.00437
    • Schuman, C. D., Kulkarni, S. R., Parsa, M., Mitchell, J. P., Date, P., & Kay, B. (2022). Opportunities for neuromorphic computing algorithms and applications. Nature Computational Science,2(1), 10–19. https://doi.org/10.1038/s43588-021-00184-y
    • Sengupta, A., Ye, Y., Wang, R., Liu, C., & Roy, K. (2019). Going Deeper in Spiking Neural Networks: VGG and Residual Architectures. Frontiers in Neuroscience, 13. https://doi.org/10.3389/fnins.2019.00095
    • Zhou, Z., Zhu, Y., He, C., Wang, Y., Yan, S., Tian, Y., & Yuan, L. (2022, September 29). Spikformer: When Spiking Neural Network Meets Transformer. The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=frE4fUwz_h

    Resources

    Talks on Youtube

    Originally published at https://neural.vision on November 22, 2024.


    Neuromorphic Computing — an Edgier, Greener AI was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Neuromorphic Computing — an Edgier, Greener AI

    Go Here to Read this Fast! Neuromorphic Computing — an Edgier, Greener AI

  • Unleash your Salesforce data using the Amazon Q Salesforce Online connector

    Unleash your Salesforce data using the Amazon Q Salesforce Online connector

    Mehdy Haghy

    In this post, we walk you through configuring and setting up the Amazon Q Salesforce Online connector. Thousands of companies worldwide use Salesforce to manage their sales, marketing, customer service, and other business operations. The Salesforce cloud-based platform centralizes customer information and interactions across the organization, providing sales reps, marketers, and support agents with a unified 360-degree view of each customer. With Salesforce at the heart of their business, companies accumulate vast amounts of customer data within the platform over time. This data is incredibly valuable for gaining insights into customers, improving operations, and guiding strategic decisions. However, accessing and analyzing the blend of structured data and unstructured data can be challenging. With the Amazon Q Salesforce Online connector, companies can unleash the value of their Salesforce data.

    Originally appeared here:
    Unleash your Salesforce data using the Amazon Q Salesforce Online connector

    Go Here to Read this Fast! Unleash your Salesforce data using the Amazon Q Salesforce Online connector

  • Reducing hallucinations in large language models with custom intervention using Amazon Bedrock Agents

    Reducing hallucinations in large language models with custom intervention using Amazon Bedrock Agents

    Shayan Ray

    This post demonstrates how to use Amazon Bedrock Agents, Amazon Knowledge Bases, and the RAGAS evaluation metrics to build a custom hallucination detector and remediate it by using human-in-the-loop. The agentic workflow can be extended to custom use cases through different hallucination remediation techniques and offers the flexibility to detect and mitigate hallucinations using custom actions.

    Originally appeared here:
    Reducing hallucinations in large language models with custom intervention using Amazon Bedrock Agents

    Go Here to Read this Fast! Reducing hallucinations in large language models with custom intervention using Amazon Bedrock Agents

  • Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

    Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

    Maurits de Groot

    In this post, we walk through the steps to deploy the Meta Llama 3.1-8B model on Inferentia 2 instances using Amazon EKS. This solution combines the exceptional performance and cost-effectiveness of Inferentia 2 chips with the robust and flexible landscape of Amazon EKS. Inferentia 2 chips deliver high throughput and low latency inference, ideal for LLMs.

    Originally appeared here:
    Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

    Go Here to Read this Fast! Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

  • Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

    Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

    Omri Shiv

    The use of large language models (LLMs) and generative AI has exploded over the last year. With the release of powerful publicly available foundation models, tools for training, fine tuning and hosting your own LLM have also become democratized. Using vLLM on AWS Trainium and Inferentia makes it possible to host LLMs for high performance […]

    Originally appeared here:
    Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

    Go Here to Read this Fast! Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

  • Addressing Missing Data

    Gizem Kaya

    Understand missing data patterns (MCAR, MNAR, MAR) for better model performance with Missingno

    Originally appeared here:
    Addressing Missing Data

    Go Here to Read this Fast! Addressing Missing Data

  • Optimizing Transformer Models for Variable-Length Input Sequences

    Chaim Rand

    How PyTorch NestedTensors, FlashAttention2, and xFormers can Boost Performance and Reduce AI Costs

    Photo by Tanja Zöllner on Unsplash

    As generative AI (genAI) models grow in both popularity and scale, so do the computational demands and costs associated with their training and deployment. Optimizing these models is crucial for enhancing their runtime performance and reducing their operational expenses. At the heart of modern genAI systems is the Transformer architecture and its attention mechanism, which is notably compute-intensive.

    In a previous post, we demonstrated how using optimized attention kernels can significantly accelerate the performance of Transformer models. In this post, we continue our exploration by addressing the challenge of variable-length input sequences — an inherent property of real-world data, including documents, code, time-series, and more.

    The Challenge of Batching Variable-Length Input

    In a typical deep learning workload, individual samples are grouped into batches before being copied to the GPU and fed to the AI model. Batching improves computational efficiency and often aids model convergence during training. Usually, batching involves stacking all of the sample tensors along a new dimension — the batch dimension. However, torch.stack requires that all tensors to have the same shape, which is not the case with variable-length sequences.

    Padding and its Inefficiencies

    The traditional way to address this challenge is to pad the input sequences to a fixed length and then perform stacking. This solution requires appropriate masking within the model so that the output is not affected by the irrelevant tensor elements. In the case of attention layers, a padding mask indicates which tokens are padding and should not be attended to (e.g., see PyTorch MultiheadAttention). However, padding can waste considerable GPU resources, increasing costs and slowing development. This is especially true for large-scale AI models.

    Don’t Pad, Concatenate

    One way to avoid padding is to concatenate sequences along an existing dimension instead of stacking them along a new dimension. Contrary to torch.stack, torch.cat allows inputs of different shapes. The output of concatenation is single sequence whose length equals the sum of the lengths of the individual sequences. For this solution to work, our single sequence would need to be supplemented by an attention mask that would ensure that each token only attends to other tokens in the same original sequence, in a process sometimes referred to as document masking. Denoting the sum of the lengths of all of the individual by N and adopting ”big O” notation, the size of this mask would need to be O(N²), as would the compute complexity of a standard attention layer, making this solution highly inefficient.

    Attention Layer Optimization

    The solution to this problem comes in the form of specialized attention layers. Contrary to the standard attention layer that performs the full set of O(N²) attention scores only to mask out the irrelevant ones, these optimized attention kernels are designed to calculate only the scores that matter. In this post we will explore several solutions, each with their own distinct characteristics. These include:

    Integration into Existing HuggingFace Models

    For teams working with pre-trained models, transitioning to these optimizations might seem challenging. We will demonstrate how HuggingFace’s APIs simplify this process, enabling developers to integrate these techniques with minimal code changes and effort.

    Disclaimers

    • Please do not interpret our use of any platforms, libraries, or optimization techniques as an endorsement for their use. The best options for you will depend greatly on the specifics of your own use-case.
    • Some of the APIs discussed here are in prototype or beta stages and may change in the future.
    • The code examples provided are for demonstrative purposes only. We make no claims regarding their accuracy, optimality, or robustness.

    Special thanks to Yitzhak Levi and Peleg Nahaliel for their contributions to this post.

    Toy LLM Model

    To facilitate our discussion we will define a simple generative model (partially inspired by the GPT model defined here). For a more comprehensive guide on building language models, please see one of the many excellent tutorials available online (e.g., here).

    Transformer Block

    We begin by constructing a basic Transformer block, specifically designed to facilitate experimentation with different attention mechanisms and optimizations. While our block performs the same computation as standard Transformer blocks, we make slight modifications to the usual choice of operators in order to support the possibility of PyTorch NestedTensor inputs (as described here).

    # general imports
    import time, functools

    # torch imports
    import torch
    from torch.utils.data import Dataset, DataLoader
    import torch.nn as nn

    # Define Transformer settings
    BATCH_SIZE = 32
    NUM_HEADS = 16
    HEAD_DIM = 64
    DIM = NUM_HEADS * HEAD_DIM
    DEPTH = 24
    NUM_TOKENS = 1024
    MAX_SEQ_LEN = 1024
    PAD_ID = 0
    DEVICE = 'cuda'

    class MyAttentionBlock(nn.Module):
    def __init__(
    self,
    attn_fn,
    dim,
    num_heads,
    format=None,
    **kwargs
    ):
    super().__init__()
    self.attn_fn = attn_fn
    self.num_heads = num_heads
    self.dim = dim
    self.head_dim = dim // num_heads
    self.norm1 = nn.LayerNorm(dim, bias=False)
    self.norm2 = nn.LayerNorm(dim, bias=False)
    self.qkv = nn.Linear(dim, dim * 3)
    self.proj = nn.Linear(dim, dim)

    # mlp layers
    self.fc1 = nn.Linear(dim, dim * 4)
    self.act = nn.GELU()
    self.fc2 = nn.Linear(dim * 4, dim)

    self.permute = functools.partial(torch.transpose, dim0=1, dim1=2)
    if format == 'bshd':
    self.permute = nn.Identity()

    def mlp(self, x):
    x = self.fc1(x)
    x = self.act(x)
    x = self.fc2(x)
    return x

    def reshape_and_permute(self,x, batch_size):
    x = x.view(batch_size, -1, self.num_heads, self.head_dim)
    return self.permute(x)

    def forward(self, x_in, attn_mask=None):
    batch_size = x_in.size(0)
    x = self.norm1(x_in)
    qkv = self.qkv(x)

    # rather than first reformatting and then splitting the input
    # state, we first split and then reformat q, k, v in order to
    # support PyTorch Nested Tensors
    q, k, v = qkv.chunk(3, -1)
    q = self.reshape_and_permute(q, batch_size)
    k = self.reshape_and_permute(k, batch_size)
    v = self.reshape_and_permute(v, batch_size)

    # call the attn_fn with the input attn_mask
    x = self.attn_fn(q, k, v, attn_mask=attn_mask)

    # reformat output
    x = self.permute(x).reshape(batch_size, -1, self.dim)
    x = self.proj(x)
    x = x + x_in
    x = x + self.mlp(self.norm2(x))
    return x

    Transformer Decoder Model

    Building on our programmable Transformer block, we construct a typical Transformer decoder model.

    class MyDecoder(nn.Module):
    def __init__(
    self,
    block_fn,
    num_tokens,
    dim,
    num_heads,
    num_layers,
    max_seq_len,
    pad_idx=None
    ):
    super().__init__()
    self.num_heads = num_heads
    self.pad_idx = pad_idx
    self.embedding = nn.Embedding(num_tokens, dim, padding_idx=pad_idx)
    self.positional_embedding = nn.Embedding(max_seq_len, dim)
    self.blocks = nn.ModuleList([
    block_fn(
    dim=dim,
    num_heads=num_heads
    )
    for _ in range(num_layers)])
    self.output = nn.Linear(dim, num_tokens)

    def embed_tokens(self, input_ids, position_ids=None):
    x = self.embedding(input_ids)
    if position_ids is None:
    position_ids = torch.arange(input_ids.shape[1],
    device=x.device)
    x = x + self.positional_embedding(position_ids)
    return x

    def forward(self, input_ids, position_ids=None, attn_mask=None):
    # Embed tokens and add positional encoding
    x = self.embed_tokens(input_ids, position_ids)
    if self.pad_idx is not None:
    assert attn_mask is None
    # create a padding mask - we assume boolean masking
    attn_mask = (input_ids != self.pad_idx)
    attn_mask = attn_mask.view(BATCH_SIZE, 1, 1, -1)
    .expand(-1, self.num_heads, -1, -1)

    for b in self.blocks:
    x = b(x, attn_mask)

    logits = self.output(x)
    return logits

    Variable Length Sequence Input

    Next, we create a dataset containing sequences of variable lengths, where each sequence is made up of randomly generated tokens. For simplicity, we (arbitrarily) select a fixed distribution for the sequence lengths. In real-world scenarios, the distribution of sequence lengths typically reflects the nature of the data, such as the length of documents or audio segments. Note, that the distribution of lengths directly affects the computational inefficiencies caused by padding.

    # Use random data
    class FakeDataset(Dataset):
    def __len__(self):
    return 1000000

    def __getitem__(self, index):
    length = torch.randint(1, MAX_SEQ_LEN, (1,))
    sequence = torch.randint(1, NUM_TOKENS, (length + 1,))
    input = sequence[:-1]
    target = sequence[1:]
    return input, target

    def pad_sequence(sequence, length, pad_val):
    return torch.nn.functional.pad(
    sequence,
    (0, length - sequence.shape[0]),
    value=pad_val
    )

    def collate_with_padding(batch):
    padded_inputs = []
    padded_targets = []
    for b in batch:
    padded_inputs.append(pad_sequence(b[0], MAX_SEQ_LEN, PAD_ID))
    padded_targets.append(pad_sequence(b[1], MAX_SEQ_LEN, PAD_ID))
    padded_inputs = torch.stack(padded_inputs, dim=0)
    padded_targets = torch.stack(padded_targets, dim=0)
    return {
    'inputs': padded_inputs,
    'targets': padded_targets
    }

    def data_to_device(data, device):
    if isinstance(data, dict):
    return {
    key: data_to_device(val,device)
    for key, val in data.items()
    }
    elif isinstance(data, (list, tuple)):
    return type(data)(
    data_to_device(val, device) for val in data
    )
    elif isinstance(data, torch.Tensor):
    return data.to(device=device, non_blocking=True)
    else:
    return data.to(device=device)

    Training/Evaluation Loop

    Lastly, we implement a main function that performs training/evaluation on input sequences of varying length.

    def main(
    block_fn,
    data_collate_fn=collate_with_padding,
    pad_idx=None,
    train=True,
    compile=False
    ):
    torch.random.manual_seed(0)
    device = torch.device(DEVICE)
    torch.set_float32_matmul_precision("high")

    # Create dataset and dataloader
    data_set = FakeDataset()
    data_loader = DataLoader(
    data_set,
    batch_size=BATCH_SIZE,
    collate_fn=data_collate_fn,
    num_workers=12,
    pin_memory=True,
    drop_last=True
    )

    model = MyDecoder(
    block_fn=block_fn,
    num_tokens=NUM_TOKENS,
    dim=DIM,
    num_heads=NUM_HEADS,
    num_layers=DEPTH,
    max_seq_len=MAX_SEQ_LEN,
    pad_idx=pad_idx
    ).to(device)

    if compile:
    model = torch.compile(model)

    # Define loss and optimizer
    criterion = torch.nn.CrossEntropyLoss(ignore_index=PAD_ID)
    optimizer = torch.optim.SGD(model.parameters())

    def train_step(model, inputs, targets,
    position_ids=None, attn_mask=None):
    with torch.amp.autocast(DEVICE, dtype=torch.bfloat16):
    outputs = model(inputs, position_ids, attn_mask)
    outputs = outputs.view(-1, NUM_TOKENS)
    targets = targets.flatten()
    loss = criterion(outputs, targets)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    @torch.no_grad()
    def eval_step(model, inputs, targets,
    position_ids=None, attn_mask=None):
    with torch.amp.autocast(DEVICE, dtype=torch.bfloat16):
    outputs = model(inputs, position_ids, attn_mask)
    if outputs.is_nested:
    outputs = outputs.data._values
    targets = targets.data._values
    else:
    outputs = outputs.view(-1, NUM_TOKENS)
    targets = targets.flatten()
    loss = criterion(outputs, targets)
    return loss

    if train:
    model.train()
    step_fn = train_step
    else:
    model.eval()
    step_fn = eval_step

    t0 = time.perf_counter()
    summ = 0
    count = 0

    for step, data in enumerate(data_loader):
    # Copy data to GPU
    data = data_to_device(data, device=device)
    step_fn(model, data['inputs'], data['targets'],
    position_ids=data.get('indices'),
    attn_mask=data.get('attn_mask'))

    # Capture step time
    batch_time = time.perf_counter() - t0
    if step > 20: # Skip first steps
    summ += batch_time
    count += 1
    t0 = time.perf_counter()
    if step >= 100:
    break
    print(f'average step time: {summ / count}')

    PyTorch SDPA with Padding

    For our baseline experiments, we configure our Transformer block to utilize PyTorch’s SDPA mechanism. In our experiments, we run both training and evaluation, both with and without torch.compile. These were run on an NVIDIA H100 with CUDA 12.4 and PyTorch 2.5.1

    from torch.nn.functional import scaled_dot_product_attention as sdpa
    block_fn = functools.partial(MyAttentionBlock, attn_fn=sdpa)
    causal_block_fn = functools.partial(
    MyAttentionBlock,
    attn_fn=functools.partial(sdpa, is_causal=True)
    )

    for mode in ['eval', 'train']:
    for compile in [False, True]:
    block_func = causal_block_fn
    if mode == 'train' else block_fn
    print(f'{mode} with {collate}, '
    f'{"compiled" if compile else "uncompiled"}')
    main(block_fn=block_func,
    pad_idx=PAD_ID,
    train=mode=='train',
    compile=compile)

    Performance Results:

    • Evaluation: 132 milliseconds (ms) without torch.compile, 130 ms with torch.compile
    • Training: 342 ms without torch.compile, 299 ms with torch.compile

    Optimizing for Variable Length Input

    In this section, we will explore several optimization techniques for handling variable-length input sequences in Transformer models.

    Padding Optimization

    Our first optimization relates not to the attention kernel but to our padding mechanism. Rather than padding the sequences in each batch to a constant length, we pad to the length of the longest sequence in the batch. The following block of code consists of our revised collation function and updated experiments.

    def collate_pad_to_longest(batch):
    padded_inputs = []
    padded_targets = []
    max_length = max([b[0].shape[0] for b in batch])
    for b in batch:
    padded_inputs.append(pad_sequence(b[0], max_length, PAD_ID))
    padded_targets.append(pad_sequence(b[1], max_length, PAD_ID))
    padded_inputs = torch.stack(padded_inputs, dim=0)
    padded_targets = torch.stack(padded_targets, dim=0)
    return {
    'inputs': padded_inputs,
    'targets': padded_targets
    }

    for mode in ['eval', 'train']:
    for compile in [False, True]:
    block_func = causal_block_fn
    if mode == 'train' else block_fn
    print(f'{mode} with {collate}, '
    f'{"compiled" if compile else "uncompiled"}')
    main(block_fn=block_func,
    data_collate_fn=collate_pad_to_longest,
    pad_idx=PAD_ID,
    train=mode=='train',
    compile=compile)

    Padding to the longest sequence in each batch results in a slight performance acceleration:

    • Evaluation: 129 ms without torch.compile, 116 ms with torch.compile
    • Training: 337 ms without torch.compile, 294 ms with torch.compile

    SDPA with PyTorch NestedTensors

    Next, we take advantage of the built-in support for PyTorch NestedTensors in SDPA in evaluation mode. Currently a prototype feature, PyTorch NestedTensors allows for grouping together tensors of varying length. These are sometimes referred to as jagged or ragged tensors. In the code block below, we define a collation function for grouping our sequences into NestedTensors. We also define an indices entry so that we can properly calculate the positional embeddings.

    PyTorch NestedTensors are supported by a limited number of PyTorch ops. Working around these limitations can require some creativity. For example, addition between NestedTensors is only supported when they share precisely the same “jagged” shape. In the code below we use a workaround to ensure that the indices entry shares the same shape as the model inputs.

    def nested_tensor_collate(batch):
    inputs = torch.nested.as_nested_tensor([b[0] for b in batch],
    layout=torch.jagged)
    targets = torch.nested.as_nested_tensor([b[1] for b in batch],
    layout=torch.jagged)
    indices = torch.concat([torch.arange(b[0].shape[0]) for b in batch])

    # workaround for creating a NestedTensor with identical "jagged" shape
    xx = torch.empty_like(inputs)
    xx.data._values[:] = indices

    return {
    'inputs': inputs,
    'targets': targets,
    'indices': xx
    }

    for compile in [False, True]:
    print(f'eval with nested tensors, '
    f'{"compiled" if compile else "uncompiled"}')
    main(
    block_fn=block_fn,
    data_collate_fn=nested_tensor_collate,
    train=False,
    compile=compile
    )

    Although, with torch.compile, the NestedTensor optimization results in a step time of 131 ms, similar to our baseline result, in compiled mode the step time drops to 42 ms for an impressive ~3x improvement.

    FlashAttention2

    In our previous post we demonstrated the use of FlashAttention and its impact on the performance of a transformer model. In this post we demonstrate the use of flash_attn_varlen_func from flash-attn (2.7.0), an API designed for use with variable-sized inputs. To use this function, we concatenate all of the sequences in the batch into a single sequence. We also create a cu_seqlens tensor that points to the indices within the concatenated tensor where each of the individual sequences start. The code block below includes our collation function followed by evaluation and training experiments. Note, that flash_attn_varlen_func does not support torch.compile (at the time of this writing).

    def collate_concat(batch):
    inputs = torch.concat([b[0] for b in batch]).unsqueeze(0)
    targets = torch.concat([b[1] for b in batch]).unsqueeze(0)
    indices = torch.concat([torch.arange(b[0].shape[0]) for b in batch])
    seqlens = torch.tensor([b[0].shape[0] for b in batch])
    seqlens = torch.cumsum(seqlens, dim=0, dtype=torch.int32)
    cu_seqlens = torch.nn.functional.pad(seqlens, (1, 0))

    return {
    'inputs': inputs,
    'targets': targets,
    'indices': indices,
    'attn_mask': cu_seqlens
    }

    from flash_attn import flash_attn_varlen_func
    fa_varlen = lambda q, k, v, attn_mask: flash_attn_varlen_func(
    q.squeeze(0),
    k.squeeze(0),
    v.squeeze(0),
    cu_seqlens_q=attn_mask,
    cu_seqlens_k=attn_mask,
    max_seqlen_q=MAX_SEQ_LEN,
    max_seqlen_k=MAX_SEQ_LEN
    ).unsqueeze(0)

    fa_varlen_causal = lambda q, k, v, attn_mask: flash_attn_varlen_func(
    q.squeeze(0),
    k.squeeze(0),
    v.squeeze(0),
    cu_seqlens_q=attn_mask,
    cu_seqlens_k=attn_mask,
    max_seqlen_q=MAX_SEQ_LEN,
    max_seqlen_k=MAX_SEQ_LEN,
    causal=True
    ).unsqueeze(0)

    block_fn = functools.partial(MyAttentionBlock,
    attn_fn=fa_varlen,
    format='bshd')

    causal_block_fn = functools.partial(MyAttentionBlock,
    attn_fn=fa_varlen_causal,
    format='bshd')

    print('flash-attn eval')
    main(
    block_fn=block_fn,
    data_collate_fn=collate_concat,
    train=False
    )

    print('flash-attn train')
    main(
    block_fn=causal_block_fn,
    data_collate_fn=collate_concat,
    train=True,
    )

    The impact of this optimization is dramatic, 51 ms for evaluation and 160 ms for training, amounting to 2.6x and 2.1x performance boosts compared to our baseline experiment.

    XFormers Memory Efficient Attention

    In our previous post we demonstrated the use of the memory_efficient_attention operator from xFormers (0.0.28). Here we demonstrate the use of BlockDiagonalMask, specifically designed for input sequences of arbitrary length. The required collation function appears in the code block below followed by the evaluation and training experiments. Note, that torch.compile failed in training mode.

    from xformers.ops import fmha
    from xformers.ops import memory_efficient_attention as mea

    def collate_xformer(batch):
    inputs = torch.concat([b[0] for b in batch]).unsqueeze(0)
    targets = torch.concat([b[1] for b in batch]).unsqueeze(0)
    indices = torch.concat([torch.arange(b[0].shape[0]) for b in batch])
    seqlens = [b[0].shape[0] for b in batch]
    batch_sizes = [1 for b in batch]
    block_diag = fmha.BlockDiagonalMask.from_seqlens(seqlens, device='cpu')
    block_diag._batch_sizes = batch_sizes

    return {
    'inputs': inputs,
    'targets': targets,
    'indices': indices,
    'attn_mask': block_diag
    }

    mea_eval = lambda q, k, v, attn_mask: mea(
    q,k,v, attn_bias=attn_mask)

    mea_train = lambda q, k, v, attn_mask: mea(
    q,k,v, attn_bias=attn_mask.make_causal())

    block_fn = functools.partial(MyAttentionBlock,
    attn_fn=mea_eval,
    format='bshd')

    causal_block_fn = functools.partial(MyAttentionBlock,
    attn_fn=mea_train,
    format='bshd')

    print(f'xFormer Attention ')
    for compile in [False, True]:
    print(f'eval with xFormer Attention, '
    f'{"compiled" if compile else "uncompiled"}')
    main(block_fn=block_fn,
    train=False,
    data_collate_fn=collate_xformer,
    compile=compile)

    print(f'train with xFormer Attention')
    main(block_fn=causal_block_fn,
    train=True,
    data_collate_fn=collate_xformer)

    The resultant step time were 50 ms and 159 ms for evaluation and training without torch.compile. Evaluation with torch.compile resulted in a step time of 42 ms.

    Results

    The table below summarizes the results of our optimization methods.

    Step time results for different optimization methods (lower is better) — by Author

    The best performer for our toy model is xFormer’s memory_efficient_attention which delivered a ~3x performance for evaluation and ~2x performance for training. We caution against deriving any conclusions from these results as the performance impact of different attention functions can vary significantly depending on the specific model and use case.

    Optimizing a HuggingFace Model for Variable-Length Input

    The tools and techniques described above are easy to implement when creating a model from scratch. However, these days it is not uncommon for ML developers to adopt existing (pretrained) models and finetune them for their use case. While the optimizations we have described can be integrated without changing the set of model weights and without altering the model behavior, it is not entirely clear what the best way to do this is. In an ideal world, our ML framework would allow us to program the use of an attention mechanism that is optimized for variable-length inputs. In this section we demonstrate how to optimize HuggingFace models for variable-length inputs.

    A Toy HuggingFace Model – GPT2LMHeadModel

    To facilitate the discussion, we create a toy example in which we train a HuggingFace GPT2LMHead model on variable-length sequences. This requires adapting our random dataset and data-padding collation function according to HuggingFace’s input specifications.

    from transformers import GPT2Config, GPT2LMHeadModel

    # Use random data
    class HuggingFaceFakeDataset(Dataset):
    def __len__(self):
    return 1000000

    def __getitem__(self, index):
    length = torch.randint(1, MAX_SEQ_LEN, (1,))
    input_ids = torch.randint(1, NUM_TOKENS, (length,))
    labels = input_ids.clone()
    labels[0] = PAD_ID # ignore first token
    return {
    'input_ids': input_ids,
    'labels': labels
    }
    return input_ids, labels

    def hf_collate_with_padding(batch):
    padded_inputs = []
    padded_labels = []
    for b in batch:
    input_ids = b['input_ids']
    labels = b['labels']
    padded_inputs.append(pad_sequence(input_ids, MAX_SEQ_LEN, PAD_ID))
    padded_labels.append(pad_sequence(labels, MAX_SEQ_LEN, PAD_ID))
    padded_inputs = torch.stack(padded_inputs, dim=0)
    padded_labels = torch.stack(padded_labels, dim=0)
    return {
    'input_ids': padded_inputs,
    'labels': padded_labels,
    'attention_mask': (padded_inputs != PAD_ID)
    }

    Training Function

    Our training function instantiates a GPT2LMHeadModel based on the requested GPT2Config and proceeds to train it on our variable-length sequences.

    def hf_main(
    config,
    collate_fn=hf_collate_with_padding,
    compile=False
    ):
    torch.random.manual_seed(0)
    device = torch.device(DEVICE)
    torch.set_float32_matmul_precision("high")

    # Create dataset and dataloader
    data_set = HuggingFaceFakeDataset()
    data_loader = DataLoader(
    data_set,
    batch_size=BATCH_SIZE,
    collate_fn=collate_fn,
    num_workers=12 if DEVICE == "CUDA" else 0,
    pin_memory=True,
    drop_last=True
    )

    model = GPT2LMHeadModel(config).to(device)

    if compile:
    model = torch.compile(model)

    # Define loss and optimizer
    criterion = torch.nn.CrossEntropyLoss(ignore_index=PAD_ID)
    optimizer = torch.optim.SGD(model.parameters())

    model.train()

    t0 = time.perf_counter()
    summ = 0
    count = 0

    for step, data in enumerate(data_loader):
    # Copy data to GPU
    data = data_to_device(data, device=device)
    input_ids = data['input_ids']
    labels = data['labels']
    position_ids = data.get('position_ids')
    attn_mask = data.get('attention_mask')
    with torch.amp.autocast(DEVICE, dtype=torch.bfloat16):
    outputs = model(input_ids=input_ids,
    position_ids=position_ids,
    attention_mask=attn_mask)
    logits = outputs.logits[..., :-1, :].contiguous()
    labels = labels[..., 1:].contiguous()
    loss = criterion(logits.view(-1, NUM_TOKENS), labels.flatten())

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    # Capture step time
    batch_time = time.perf_counter() - t0
    if step > 20: # Skip first steps
    summ += batch_time
    count += 1
    t0 = time.perf_counter()
    if step >= 100:
    break
    print(f'average step time: {summ / count}')

    SDPA with Padding

    In the callback below we call our training function with the default sequence-padding collator.

    config = GPT2Config(
    n_layer=DEPTH,
    n_embd=DIM,
    n_head=NUM_HEADS,
    vocab_size=NUM_TOKENS,
    )

    for compile in [False, True]:
    print(f"HF GPT2 train with SDPA, compile={compile}")
    hf_main(config=config, compile=compile)

    The resultant step times are 815 ms without torch.compile and 440 ms with torch.compile.

    FlashAttention2

    We now take advantage of HuggingFace’s built-in support for FlashAttention2, by setting the attn_implementation parameter to “flash_attention_2”. Behind the scenes, HuggingFace will unpad the padded data input and then pass them to the optimized flash_attn_varlen_func function we saw above:

    flash_config = GPT2Config(
    n_layer=DEPTH,
    n_embd=DIM,
    n_head=NUM_HEADS,
    vocab_size=NUM_TOKENS,
    attn_implementation='flash_attention_2'
    )

    print(f"HF GPT2 train with flash")
    hf_main(config=flash_config)

    The resultant time step is 620 ms, amounting to a 30% boost (in uncompiled mode) with just a simple flick of a switch.

    FlashAttention2 with Unpadded Input

    Of course, padding the sequences in the collation function only to have them unpadded, hardly seems sensible. In a recent update to HuggingFace, support was added for passing in concatenated (unpadded) sequences to a select number of models. Unfortunately, (as of the time of this writing) our GPT2 model did not make the cut. However, adding support requires just five small line additions changes to modeling_gpt2.py in order to propagate the sequence position_ids to the flash-attention kernel. The full patch appears in the block below:

    @@ -370,0 +371 @@
    + position_ids = None
    @@ -444,0 +446 @@
    + position_ids=position_ids
    @@ -611,0 +614 @@
    + position_ids=None
    @@ -621,0 +625 @@
    + position_ids=position_ids
    @@ -1140,0 +1145 @@
    + position_ids=position_ids

    We define a collate function that concatenates our sequences and train our hugging face model on unpadded sequences. (Also see the built-in DataCollatorWithFlattening utility.)

    def collate_flatten(batch):
    input_ids = torch.concat([b['input_ids'] for b in batch]).unsqueeze(0)
    labels = torch.concat([b['labels'] for b in batch]).unsqueeze(0)
    position_ids = [torch.arange(b['input_ids'].shape[0]) for b in batch]
    position_ids = torch.concat(position_ids)

    return {
    'input_ids': input_ids,
    'labels': labels,
    'position_ids': position_ids
    }

    print(f"HF GPT2 train with flash, no padding")
    hf_main(config=flash_config, collate_fn=collate_flatten)

    The resulting step time is 323 ms, 90% faster than running flash-attention on the padded input.

    Results

    The results of our HuggingFace experiments are summarized below.

    Step time results for different optimization methods (lower is better) — by Author

    With little effort, we were able to boost our runtime performance by 2.5x when compared to the uncompiled baseline experiment, and by 36% when compared to the compiled version.

    In this section, we demonstrated how the HuggingFace APIs allow us to leverage the optimized kernels in FlashAttention2, significantly boosting the training performance of existing models on sequences of varying length.

    Summary

    As AI models continue to grow in both popularity and complexity, optimizing their performance has become essential for reducing runtime and costs. This is especially true for compute-intensive components like attention layers. In this post, we have continued our exploration of attention layer optimization, and demonstrated new tools and techniques for enhancing Transformer model performance. For more insights on AI model optimization, be sure to check out the first post in this series as well as our many other posts on this topic.


    Optimizing Transformer Models for Variable-Length Input Sequences was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Optimizing Transformer Models for Variable-Length Input Sequences

    Go Here to Read this Fast! Optimizing Transformer Models for Variable-Length Input Sequences