We can significantly accelerate LLMs next token generation by merging consecutive pairs of tokens using SLERP, reducing the computing power needed to perform the full prediction.
This article presents a novel approach to accelerating Large Language Models (LLMs) inference by merging tokens using Spherical Linear Interpolation (SLERP). By reducing the sequence length while maintaining quality, this technique offers significant speed-ups in LLM inference, addressing the computational challenges posed by longer sequences. The method is still raw but highlights a dual world for LLM with one set up for training and one for predicting.
Background:
LLMs have revolutionized natural language processing tasks by exhibiting remarkable generative abilities. However, their effectiveness comes at a cost — computational resources. As LLMs process longer sequences, the quadratic scaling of transformer computations becomes increasingly prohibitive. Traditional methods to mitigate this, such as caching and quantization, have limitations. Therefore, there is a need for innovative approaches to speed up LLM inference without compromising too much quality.
The current method to generate a token during inference is a brute force approach, essentially a transposition of the training methodology. While this methodology has proven effective for training, it may not be the most efficient for inference tasks. Thus, there is an opportunity to develop a new inference methodology dedicated specifically to generating tokens during inference, which could optimize the process and further enhance the efficiency of LLMs. This highlights the importance of exploring alternative techniques to address the computational challenges faced by LLM inference.
Recently, the mergekit library proposed to merge networks’ weights using the SLERP methods which tends to yield better results. Inspired by this work, I decided to see if could merge the tokens inside a sequence to produce a smaller sequence to process while predicting the next token.
Vanilla generation vs merged one
Merging Sequence with SLERP:
The proposed method involves modifying the forward pass of LLMs to merge tokens using Spherical Linear Interpolation (SLERP), a technique borrowed from computer graphics and animation. Unlike simple averaging techniques, SLERP preserves the spherical aspects of token dimensions, offering a more nuanced interpolation. The merging procedure entails several steps to efficiently condense the input sequence:
Sequence Length Adjustment:
Initially, the input sequence undergoes adjustments based on its length:
Sequences with a length less than 3 remain unchanged.
For odd-length sequences, two null tokens are added, one at the beginning and one at the end
Even-length sequences receive an additional null token, positioned at the penultimate position.
By doing so, we ensure that the first and last token in the context are preserved.
Pair Formation:
The adjusted sequence is then formatted into pairs of consecutive tokens. This process prepares the data for aggregation.
Aggregation with SLERP:
Each pair of tokens undergoes aggregation using SLERP, effectively reducing the sequence length by half (not exactly as we add and preserve some extra tokens). SLERP interpolates between the two vectors representing consecutive tokens. This creates a new vector.
To do so efficiently, I recreated all the SLERP functions in native pytorch. However, the code might be under optimized.
Layer Cutoff and Prompt Preservation:
The merging process can occur at different levels of the model architecture, referred to as “layer cutoff.” Additionally, to preserve the integrity of prompts, a portion of the sequence at the beginning and/or end can be designated to remain unchanged. This is particularly useful for Instruct-based Models where the starting part of the prompt should always be remembered.
This innovative approach offers a nuanced solution to the computational challenges associated with LLM inference, promising significant speed-ups without sacrificing quality or accuracy.
Simple illustrative exemple of how to merge a sequence
What it means ?
Concretely, in a LLM, the forward call takes as input a sequence of token of shape (batch_size, sequence length). The embedding layer creates a sequence of shape (batch size, sequence length, dimension). Each attention module takes this sequence as input. At a given attention layer, you can merge the tokens creating a sequence of shape (batch size, k, dimension) where k is the compressed sequence length. The choice of the layer where to apply this is the “layer cutoff”.
The next attention modules will no longer need to compute a (sequence length, sequence length) attention matrix but a smaller one as k is strictly inferior to the original sequence length.
Hence, the merging could occur at different level of the model architecture. This parameter is referred as “layer cutoff”. Also, to ensure that a prompt is not completely merged, you can define a part of the sequence at the beginning and/or at the end to be kept unchanged. It is more efficient for Instruct-based Models where the starting part of the prompt should be always reminded.
One downside of this methodology is that it strongly relies on the underlying forward pass of the used model, requiring you to carefully rewrite the “merged” process depending on the chosen model. Another downside is the necessity of recomputing attention masks and possibly positional embeddings at each step.
Results:
Experiments conducted on a Mistral 7B Instruct V0.2 model demonstrate promising outcomes. By comparing predictions between the base model and various merged models at different layer cutoffs, it was observed that merging tokens did not significantly impact prediction quality. Moreover, the merged models exhibited notable speed-ups in inference time, particularly at shallower layers. The technique also showcased its effectiveness in handling longer sequences, making it applicable across a wide range of use cases.
Accuracy between the merged inference model predicted token and the base one for different layer cut and sequence lengthsHow many time faster is the merged inference model vs the base one for different sequence length and layer cut
One downside is that I did not succeed at making the forward call the most optimized. Hence, there are probably many optimizations to find by rethinking the process.
I also tested a merged version of Mistral Instruct v0.2 on the AlpacaEval dataset. I apply the merging at the 20th attention module. The results are really encouraging as the models outperforms Falcon 7B, Gemma 7B and nous-hermes-13b. It shows that merging without rethinking the positional encodings returns a model that speaks more with an increase of 600 tokens generated on average. I tried to reimplement the positional encoding procedure but failed.
Extract from the leaderboard
In summary, merging tokens with SLERP is a strong candidate solution to the computational challenges associated with LLM inference. By striking a balance between speed and quality, this approach is just about rewriting the forward loop.
Using a new class where the foraward call is adapted, you can easily pass the LLM to a generation pipeline and use it on your dataset. So far my experiments are limited to a Mistral 7B model but I would like to extend it to other architectures to see whether the performances maintain.
All of the resources are in and you can reach out to me if you would like to test it on another LLM.
Conclusion:
The merging tokens with SLERP technique should be explored for accelerating LLM inference. With further optimization and exploration, it holds the potential to improve the efficiency and scalability of natural language processing tasks.
If you work in the AI field and are willing to bring this to the next level : reach out to me !
In Part 1 of this series, we presented a solution that used the Amazon Titan Multimodal Embeddings model to convert individual slides from a slide deck into embeddings. We stored the embeddings in a vector database and then used the Large Language-and-Vision Assistant (LLaVA 1.5-7b) model to generate text responses to user questions based on […]
This is a guest post co-written with the leadership team of Iambic Therapeutics. Iambic Therapeutics is a drug discovery startup with a mission to create innovative AI-driven technologies to bring better medicines to cancer patients, faster. Our advanced generative and predictive artificial intelligence (AI) tools enable us to search the vast space of possible drug […]
Antitrust chief Margrethe Vestager still wants to favor the dominant music streaming service in the EU, and says that Apple must allow Spotify’s latest iOS antisteering update immediately.
The European Union suspects Big Tech firms are avoiding their responsibilities to the publc
Central to Spotify’s iOS app is the addition of a link to its website, and details showing EU users the different pricing there. Previously not allowed by Apple, this is now permitted under EU regulations, but Spotify has been complaining that Apple is blocking the update.
Speaking to CNBC the EU’s Margrethe Vestager said her office was continuing to investigate Spotify’s complaint that it is being prevented from reaching its audience. Spotify is currently the leading music streaming service in the world, while Apple Music is variously in fourth or fifth place.
Apple resellers are competing for your business with a $200 price cut on the current 14-inch MacBook Pro with the M3 Pro chip, 18GB RAM and 512GB of storage.
At press time, Apple Authorized Resellers B&H Photo, Amazon and Adorama all have the standard M3 Pro MacBook Pro 14-inch marked down to $1,799, a discount of $200 off retail. At Adorama specifically, the offer requires the use of exclusive promo code APINSIDER — with the code also knocking $50 off optional AppleCare.
This particular spec features Apple’s M3 Pro chip with an 11-core CPU and 14-core GPU. It also comes equipped with 18GB of unified memory and 512GB of storage, making it a popular configuration for many Apple users.
The best Linux laptops are user-friendly with high-end components and great battery life. We tested models from Lenovo, Dell, and more to find the top options.
I tried Logitech’s new AI feature and immediately added it to my bag of tricks – which I don’t do lightly. In this article, I’ll show you three ways to customize this awesome tool and make it your own.
Logitech’s free AI Prompt Builder is surprisingly handy. Here’s how I’m using it
Originally appeared here:
Logitech’s free AI Prompt Builder is surprisingly handy. Here’s how I’m using it
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.