Tag: technews

  • Linearizing Attention

    Linearizing Attention

    Shitanshu Bhushan

    Breaking the quadratic barrier: modern alternatives to softmax attention

    Large Languange Models are great but they have a slight drawback that they use softmax attention which can be computationally intensive. In this article we will explore if there is a way we can replace the softmax somehow to achieve linear time complexity.

    Image by Author (Created using Miro Board)

    Attention Basics

    I am gonna assume you already know about stuff like ChatGPT, Claude, and how transformers work in these models. Well attention is the backbone of such models. If we think of normal RNNs, we encode all past states in some hidden state and then use that hidden state along with new query to get our output. A clear drawback here is that well you can’t store everything in just a small hidden state. This is where attention helps, imagine for each new query you could find the most relevant past data and use that to make your prediction. That is essentially what attention does.

    Attention mechanism in transformers (the architecture behind most current language models) involve key, query and values embeddings. The attention mechanism in transformers works by matching queries against keys to retrieve relevant values. For each query(Q), the model computes similarity scores with all available keys(K), then uses these scores to create a weighted combination of the corresponding values(Y). This attention calculation can be expressed as:

    Source: Image by Author

    This mechanism enables the model to selectively retrieve and utilize information from its entire context when making predictions. We use softmax here since it effectively converts raw similarity scores into normalized probabilities, acting similar to a k-nearest neighbor mechanism where higher attention weights are assigned to more relevant keys.

    Okay now let’s see the computational cost of 1 attention layer,

    Source: Image by Author

    Softmax Drawback

    From above, we can see that we need to compute softmax for an NxN matrix, and thus, our computation cost becomes quadratic in sequence length. This is fine for shorter sequences, but it becomes extremely computationally inefficient for long sequences, N=100k+.

    This gives us our motivation: can we reduce this computational cost? This is where linear attention comes in.

    Linear Attention

    Introduced by Katharopoulos et al., linear attention uses a clever trick where we write the softmax exponential as a kernel function, expressed as dot products of feature maps φ(x). Using the associative property of matrix multiplication, we can then rewrite the attention computation to be linear. The image below illustrates this transformation:

    Source: Image by Author

    Katharopoulos et al. used elu(x) + 1 as φ(x), but any kernel feature map that can effectively approximate the exponential similarity can be used. The computational cost of above can be written as,

    Source: Image by Author

    This eliminates the need to compute the full N×N attention matrix and reduces complexity to O(Nd²). Where d is the embedding dimension and this in effect is linear complexity when N >>> d, which is usually the case with Large Language Models

    Okay let’s look at the recurrent view of linear attention,

    Source: Image by Author

    Okay why can we do this in linear attention and not in softmax? Well softmax is not seperable so we can’t really write it as product of seperate terms. A nice thing to note here is that during decoding, we only need to keep track of S_(n-1), giving us O(d²) complexity per token generation since S is a d × d matrix.

    However, this efficiency comes with an important drawback. Since S_(n-1) can only store d² information (being a d × d matrix), we face a fundamental limitation. For instance, if your original context length requires storing 20d² worth of information, you’ll essentially lose 19d² worth of information in the compression. This illustrates the core memory-efficiency tradeoff in linear attention: we gain computational efficiency by maintaining only a fixed-size state matrix, but this same fixed size limits how much context information we can preserve and this gives us the motivation for gating.

    Gated Linear Attention

    Okay, so we’ve established that we’ll inevitably forget information when optimizing for efficiency with a fixed-size state matrix. This raises an important question: can we be smart about what we remember? This is where gating comes in — researchers use it as a mechanism to selectively retain important information, trying to minimize the impact of memory loss by being strategic about what information to keep in our limited state. Gating isn’t a new concept and has been widely used in architectures like LSTM

    The basic change here is in the way we formulate Sn,

    Source: Image by author

    There are many choices for G all which lead to different models,

    Source: Yang, Songlin, et al. “Gated linear attention transformers with hardware-efficient training.” arXiv preprint arXiv:2312.06635(2023).

    A key advantage of this architecture is that the gating function depends only on the current token x and learnable parameters, rather than on the entire sequence history. Since each token’s gating computation is independent, this allows for efficient parallel processing during training — all gating computations across the sequence can be performed simultaneously.

    State Space Models

    When we think about processing sequences like text or time series, our minds usually jump to attention mechanisms or RNNs. But what if we took a completely different approach? Instead of treating sequences as, well, sequences, what if we processed them more like how CNNs handle images using convolutions?

    State Space Models (SSMs) formalize this approach through a discrete linear time-invariant system:

    Source: Image by Author

    Okay now let’s see how this relates to convolution,

    Source: Image by Author

    where F is our learned filter derived from parameters (A, B, c), and * denotes convolution.

    H3 implements this state space formulation through a novel structured architecture consisting of two complementary SSM layers.

    Source: Fu, Daniel Y., et al. “Hungry hungry hippos: Towards language modeling with state space models.” arXiv preprint arXiv:2212.14052 (2022).

    Here we take the input and break it into 3 channels to imitate K, Q and V. We then use 2 SSM and 2 gating to kind of imitate linear attention and it turns out that this kind of architecture works pretty well in practice.

    Selective State Space Models

    Earlier, we saw how gated linear attention improved upon standard linear attention by making the information retention process data-dependent. A similar limitation exists in State Space Models — the parameters A, B, and c that govern state transitions and outputs are fixed and data-independent. This means every input is processed through the same static system, regardless of its importance or context.

    we can extend SSMs by making them data-dependent through time-varying dynamical systems:

    Source: Image by Author

    The key question becomes how to parametrize c_t, b_t, and A_t to be functions of the input. Different parameterizations can lead to architectures that approximate either linear or gated attention mechanisms.

    Mamba implements this time-varying state space formulation through selective SSM blocks.

    Source: Gu, Albert, and Tri Dao. “Mamba: Linear-time sequence modeling with selective state spaces.” arXiv preprint arXiv:2312.00752 (2023).

    Mamba here uses Selective SSM instead of SSM and uses output gating and additional convolution to improve performance. This is a very high-level idea explaining how Mamba combines these components into an efficient architecture for sequence modeling.

    Conclusion

    In this article, we explored the evolution of efficient sequence modeling architectures. Starting with traditional softmax attention, we identified its quadratic complexity limitation, which led to the development of linear attention. By rewriting attention using kernel functions, linear attention achieved O(Nd²) complexity but faced memory limitations due to its fixed-size state matrix.

    This limitation motivated gated linear attention, which introduced selective information retention through gating mechanisms. We then explored an alternative perspective through State Space Models, showing how they process sequences using convolution-like operations. The progression from basic SSMs to time-varying systems and finally to selective SSMs parallels our journey from linear to gated attention — in both cases, making the models more adaptive to input data proved crucial for performance.

    Through these developments, we see a common theme: the fundamental trade-off between computational efficiency and memory capacity. Softmax attention excels at in-context learning by maintaining full attention over the entire sequence, but at the cost of quadratic complexity. Linear variants (including SSMs) achieve efficient computation through fixed-size state representations, but this same optimization limits their ability to maintain detailed memory of past context. This trade-off continues to be a central challenge in sequence modeling, driving the search for architectures that can better balance these competing demands.

    To read more on this topics, i would suggest the following papers:

    Linear Attention: Katharopoulos, Angelos, et al. “Transformers are rnns: Fast autoregressive transformers with linear attention.” International conference on machine learning. PMLR, 2020.

    GLA: Yang, Songlin, et al. “Gated linear attention transformers with hardware-efficient training.” arXiv preprint arXiv:2312.06635(2023).

    H3: Fu, Daniel Y., et al. “Hungry hungry hippos: Towards language modeling with state space models.” arXiv preprint arXiv:2212.14052 (2022).

    Mamba: Gu, Albert, and Tri Dao. “Mamba: Linear-time sequence modeling with selective state spaces.” arXiv preprint arXiv:2312.00752 (2023).

    Waleffe, Roger, et al. “An Empirical Study of Mamba-based Language Models.” arXiv preprint arXiv:2406.07887 (2024).

    Acknowledgement

    This blog post was inspired by coursework from my graduate studies during Fall 2024 at University of Michigan. While the courses provided the foundational knowledge and motivation to explore these topics, any errors or misinterpretations in this article are entirely my own. This represents my personal understanding and exploration of the material.


    Linearizing Attention was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Linearizing Attention

    Go Here to Read this Fast! Linearizing Attention

  • Understanding the Mathematics of PPO in Reinforcement Learning

    Manelle Nouar

    Deep dive into RL with PPO for beginners

    Photo by ThisisEngineering on Unsplash

    Introduction

    Reinforcement Learning (RL) is a branch of Artificial Intelligence that enables agents to learn how to interact with their environment. These agents, which range from robots to software features or autonomous systems, learn through trial and error. They receive rewards or penalties based on the actions they take, which guide their future decisions.

    Among the most well-known RL algorithms, Proximal Policy Optimization (PPO) is often favored for its stability and efficiency. PPO addresses several challenges in RL, particularly in controlling how the policy (the agent’s decision-making strategy) evolves. Unlike other algorithms, PPO ensures that policy updates are not too large, preventing destabilization during training. This stabilization is crucial, as drastic updates can cause the agent to diverge from an optimal solution, making the learning process erratic. PPO thus maintains a balance between exploration (trying new actions) and exploitation (focusing on actions that yield the highest rewards).

    Additionally, PPO is highly efficient in terms of both computational resources and learning speed. By optimizing the agent’s policy effectively while avoiding overly complex calculations, PPO has become a practical solution in various domains, such as robotics, gaming, and autonomous systems. Its simplicity makes it easy to implement, which has led to its widespread adoption in both research and industry.

    This article explores the mathematical foundations of RL and the key concepts introduced by PPO, providing a deeper understanding of why PPO has become a go-to algorithm in modern reinforcement learning research.

    1. The Basics of RL: Markov Decision Process (MDP)

    Reinforcement learning problems are often modeled using a Markov Decision Process (MDP), a mathematical framework that helps formalize decision-making in environments where outcomes are uncertain.

    A Markov chain models a system that transitions between states, where the probability of moving to a new state depends solely on the current state and not on previous states. This principle is known as the Markov property. In the context of MDPs, this simplification is key for modeling decisions, as it allows an agent to focus only on the current state when making decisions without needing to account for the entire history of the system.

    An MDP is defined by the following elements:
    – S: Set of possible states.
    – A: Set of possible actions.
    – P(s’|s, a): Transition probability of reaching state s’ after taking action a in state s.
    – R(s, a): Reward received after taking action a in state s.
    – γ: Discount factor (a value between 0 and 1) that reflects the importance of future rewards.

    The discount factor γ is crucial for modeling the importance of future rewards in decision-making problems. When an agent makes a decision, it must evaluate not only the immediate reward but also the potential future rewards. The discount γ reduces the impact of rewards that occur later in time due to the uncertainty of reaching those rewards. Thus, a value of γ close to 1 indicates that future rewards are almost as important as immediate rewards, while a value close to 0 gives more importance to immediate rewards.

    The time discount reflects the agent’s preference for quick gains over future ones, often due to uncertainty or the possibility of changes in the environment. For example, an agent will likely prefer an immediate reward rather than one in the distant future unless that future reward is sufficiently significant. This discount factor thus models optimization behaviors where the agent considers both short-term and long-term benefits.

    The goal is to find an action policy π(a|s) that maximizes the expected sum of rewards over time, often referred to as the value function:

    This function represents the expected total reward an agent can accumulate starting from state s and following policy π.

    2. Policy Optimization: Policy Gradient

    Policy gradient methods focus on directly optimizing the parameters θ of a policy πθ by maximizing an objective function that represents the expected reward obtained by following that policy in a given environment.

    The objective function is defined as:

    Where R(s, a) is the reward received for taking action a in state s, and the goal is to maximize this expected reward over time. The term dπ(s) represents the stationary distribution of states under policy π, indicating how frequently the agent visits each state when following policy π.

    The policy gradient theorem gives the gradient of the objective function, providing a way to update the policy parameters:

    This equation shows how to adjust the policy parameters based on past experiences, which helps the agent learn more efficient behaviors over time.

    3. Mathematical Enhancements of PPO

    PPO (Proximal Policy Optimization) introduces several important features to improve the stability and efficiency of reinforcement learning, particularly in large and complex environments. PPO was introduced by John Schulman et al. in 2017 as an improvement over earlier policy optimization algorithms like Trust Region Policy Optimization (TRPO). The main motivation behind PPO was to strike a balance between sample efficiency, ease of implementation, and stability while avoiding the complexities of TRPO’s second-order optimization methods. While TRPO ensures stable policy updates by enforcing a strict constraint on the policy change, it relies on computationally expensive second-order derivatives and conjugate gradient methods, making it challenging to implement and scale. Moreover, the strict constraints in TRPO can sometimes overly limit policy updates, leading to slower convergence. PPO addresses these issues by using a simple clipped objective function that allows the policy to update in a stable and controlled manner, avoiding forgetting previous policies with each update, thus improving training efficiency and reducing the risk of policy collapse. This makes PPO a popular choice for a wide range of reinforcement learning tasks.

    a. Probability Ratio

    One of the key components of PPO is the probability ratio, which compares the probability of taking an action in the current policy πθ to that of the old policy πθold:

    This ratio provides a measure of how much the policy has changed between updates. By monitoring this ratio, PPO ensures that updates are not too drastic, which helps prevent instability in the learning process.

    b. Clipping Function

    Clipping is preferred over adjusting the learning rate in Proximal Policy Optimization (PPO) because it directly limits the magnitude of policy updates, preventing excessive changes that could destabilize the learning process. While the learning rate uniformly scales the size of updates, clipping ensures that updates stay close to the previous policy, thereby enhancing stability and reducing erratic behavior.

    The main advantage of clipping is that it allows for better control over updates, ensuring more stable progress. However, a potential drawback is that it may slow down learning by limiting the exploration of significantly different strategies. Nonetheless, clipping is favored in PPO and other algorithms when stability is essential.

    To avoid excessive changes to the policy, PPO uses a clipping function that modifies the objective function to restrict the size of policy updates. This is crucial because large updates in reinforcement learning can lead to erratic behavior. The modified objective with clipping is:

    The clipping function constrains the probability ratio within a specific range, preventing updates that would deviate too far from the previous policy. This helps avoid sudden, large changes that could destabilize the learning process.

    c. Advantage Estimation with GAE

    In RL, estimating the advantage is important because it helps the agent determine which actions are better than others in each state. However, there is a trade-off: using only immediate rewards (or very short horizons) can introduce high variance in advantage estimates, while using longer horizons can introduce bias.

    Generalized Advantage Estimation (GAE) strikes a balance between these two by using a weighted average of n-step returns and value estimates, making it less sensitive to noise and improving learning stability.

    Why use GAE?
    Stability: GAE helps reduce variance by considering multiple steps so the agent does not react to noise in the rewards or temporary fluctuations in the environment.
    Efficiency: GAE strikes a good balance between bias and variance, making learning more efficient by not requiring overly long sequences of rewards while still maintaining reliable estimates.
    Better Action Comparison: By considering not just the immediate reward but a broader horizon of rewards, the agent can better compare actions over time and make more informed decisions.

    The advantage function At is used to assess how good an action was relative to the expected behavior under the current policy. To reduce variance and ensure more reliable estimates, PPO uses Generalized Advantage Estimation (GAE). This method smooths out the advantages over time while controlling for bias:

    This technique provides a more stable and accurate measure of the advantage, which improves the agent’s ability to make better decisions.

    d. Entropy to Encourage Exploration

    PPO incorporates an entropy term in the objective function to encourage the agent to explore more of the environment rather than prematurely converging to a suboptimal solution. The entropy term increases the uncertainty in the agent’s decision-making, which prevents overfitting to a specific strategy:

    Where H(πθ) represents the entropy of the policy. By adding this term, PPO ensures that the agent does not converge too quickly and is encouraged to continue exploring different actions and strategies, improving overall learning efficiency.

    Conclusion

    The mathematical underpinnings of PPO demonstrate how this algorithm achieves stable and efficient learning. With concepts like the probability ratio, clipping, advantage estimation, and entropy, PPO offers a powerful balance between exploration and exploitation. These features make it a robust choice for both researchers and practitioners working in complex environments. The simplicity of PPO, combined with its efficiency and effectiveness, makes it a popular and valuable algorithm in reinforcement learning.

    Reference

    This article was partially translated from French using DeepL.


    Understanding the Mathematics of PPO in Reinforcement Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Understanding the Mathematics of PPO in Reinforcement Learning

    Go Here to Read this Fast! Understanding the Mathematics of PPO in Reinforcement Learning

  • Optimizing costs of generative AI applications on AWS

    Optimizing costs of generative AI applications on AWS

    Vinnie Saini

    Optimizing costs of generative AI applications on AWS is critical for realizing the full potential of this transformative technology. The post outlines key cost optimization pillars, including model selection and customization, token usage, inference pricing plans, and vector database considerations.

    Originally appeared here:
    Optimizing costs of generative AI applications on AWS

    Go Here to Read this Fast! Optimizing costs of generative AI applications on AWS

  • How Apple’s ‘Move to iOS’ app has vaulted into Google Play’s top 40 app list

    How Apple’s ‘Move to iOS’ app has vaulted into Google Play’s top 40 app list

    One of Apple’s apps for Android has popped up on the Google Play’s store top downloaded apps over Christmas. Here’s why the “Move to iOS” app has launched itself into the top 40, and what it does.

    Two smartphones display instructions for transferring data from Android to iOS using the 'Move to iOS' app, featuring a blue 'Continue' button.
    Image Credit: Apple

    Now that the holidays are wrapping up, first-time and returning iPhone users are unboxing their new iPhones and making the jump from Android to iOS.

    Every year, iPhone sales see a significant boost as customers rush to buy Apple’s flagship smartphone in time to give as a gift. And, every year, Apple sees a slew of new and returning users migrate from Android to iPhone. Enough that the company created a dedicated Android app in 2015 to help streamline the process.

    Continue Reading on AppleInsider | Discuss on our Forums

    Go Here to Read this Fast!

    How Apple’s ‘Move to iOS’ app has vaulted into Google Play’s top 40 app list

    Originally appeared here:

    How Apple’s ‘Move to iOS’ app has vaulted into Google Play’s top 40 app list

  • Is Incogni legit? Find out how to protect your privacy online and avoid scams

    Is Incogni legit? Find out how to protect your privacy online and avoid scams

    It’s natural to be suspicious of a company offering to erase your data online, especially if most other companies exist to profit from your data. Learn why Incogni is trustworthy.

    Text 'Erase Your Data with Incogni' on a dark background with colorful dripping lines.
    Is Incogni legit? Learn why you can trust it with fighting data brokers. Image source: Incogni

    Chances are you’ve realized that all of your data is available and for sale online. Data brokers buy and sell user data as a commodity — and whether it’s used for ad targeting, stalking, or identity theft isn’t a consideration.

    That’s why it’s important to reduce your digital footprint, thus reducing the data available to gather and sell. However, that’s only part of the solution for increased data privacy.

    Continue Reading on AppleInsider

    Go Here to Read this Fast! Is Incogni legit? Find out how to protect your privacy online and avoid scams

    Originally appeared here:
    Is Incogni legit? Find out how to protect your privacy online and avoid scams

  • 7 best movie scenes of 2024, ranked

    Dan Girolamo

    What were the top movie moments of 2024? Here are Digital Trends’ rankings for the seven best movie scenes of 2024.

    Go Here to Read this Fast! 7 best movie scenes of 2024, ranked

    Originally appeared here:
    7 best movie scenes of 2024, ranked

  • How to get a free Santa Snoop Dogg skin in Fortnite

    Rishabh Sabarwal

    Winterfest has arrived and it brings a free Santa Snoop Dogg skin for players to claim. Here’s how to get it in Fortnite Chapter 6.

    Go Here to Read this Fast! How to get a free Santa Snoop Dogg skin in Fortnite

    Originally appeared here:
    How to get a free Santa Snoop Dogg skin in Fortnite

  • Want more gifts? Snag Ghostrunner II for free today only

    Patrick Hearn

    If you’re itching for a wicked-fast game set in the neon-soaked rain of a cyberpunk city, Ghostrunner II is free today on the Epic Games Store.

    Go Here to Read this Fast! Want more gifts? Snag Ghostrunner II for free today only

    Originally appeared here:
    Want more gifts? Snag Ghostrunner II for free today only