Category: AI

  • Reinforcement Learning 101: Q-Learning

    Cristian Leo

    Decoding the Math behind Q-Learning, Action-Value Functions, and Bellman Equations, and building them from scratch in Python.

    Image Generated by DALLE

    In the previous article, we dipped our toes into the world of reinforcement learning (RL), covering the basics like how agents learn from their surroundings, focusing on a simple setup called GridWorld. We went over the essentials — actions, states, rewards, and how to get around in this environment. If you’re new to this or need a quick recap, it might be a good idea to check out that piece again to get a firm grip on the basics before diving in deeper.

    Reinforcement Learning 101: Building a RL Agent

    Today, we’re ready to take it up a bit. We will explore more complex aspects of RL, moving from simple setups to dynamic, ever-changing environments and more sophisticated ways for our agents to navigate through them. We’ll dive into the concept of the Markov Decision Process, which is very important for understanding how RL works at a deeper level. Plus, we’ll take a closer look at Q-learning, a key algorithm in RL that shows how agents can learn to make smart decisions in places like GridWorld, even when things are constantly changing.

    Index
    · 1: Beyond the Basics
    1.1: Dynamic Environments

    · 2: Markov Decision Process
    2.1: Understanding MDP
    2.2: The Math Behind MDP
    2.3: The Math Behind Bellman Equations

    · 3: Deep Dive into Q-Learning
    3.1: Fundamentals of Q-Learning
    3.2: The Math Behind Q-Learning

    · 4: Q-Learning From Scratch
    4.1: The GridWorld Environment
    4.2: The Q-Learning Class

    · 5: Next Steps and Future Directions
    5.1: Current Problems and Limitations
    5.2: Next Steps

    · 6: Conclusion

    1: Beyond the Basics

    1.1: Dynamic Environments

    When we first started exploring reinforcement learning (RL), we looked at simple, unchanging worlds. But as we move to dynamic environments, things get a lot more interesting. Unlike static setups where everything stays the same, dynamic environments are all about change. Obstacles move, goals shift, and rewards vary, making these settings much closer to the real world’s unpredictability.

    What Makes Dynamic Environments Special?
    Dynamic environments are key for teaching agents to adapt because they mimic the constant changes we face daily. Here, agents need to do more than just find the quickest route to a goal; they have to adjust their strategies as obstacles move, goals relocate, and rewards increase or decrease. This continuous learning and adapting are what could lead to true artificial intelligence.

    Moving back to the environment we created in the last article, GridWorld, a 5×5 board with obstacles inside it. In this article, we’ll add some complexity to it making the obstacles shuffle randomly.

    The Impact of Dynamic Environments on RL Agents
    Dynamic environments train RL agents to be more robust and intelligent. Agents learn to adjust their strategies on the fly, a skill critical for navigating the real world where change is the only constant.

    Facing a constantly evolving set of challenges, agents must make more nuanced decisions, balancing the pursuit of immediate rewards against the potential for future gains. Moreover, agents trained in dynamic environments are better equipped to generalize their learning to new, unseen situations, a key indicator of intelligent behavior.

    2: Markov Decision Process

    2.1: Understanding MDP

    Before we dive into Q-Learning, let’s introduce the Markov Decision Process, or MDP for short. Think of MDP as the ABC of reinforcement learning. It offers a neat framework for understanding how an agent decides and learns from its surroundings. Picture MDP like a board game. Each square is a possible situation (state) the agent could find itself in, the moves it can make (actions), and the points it racks up after each move (rewards). The main aim is to collect as many points as possible.

    Differing from the classic RL framework we introduced in the previous article, which focused on the concepts of states, actions, and rewards in a broad sense, MDP adds structure to these concepts by introducing transition probabilities and the optimization of policies. While the classic framework sets the stage for understanding reinforcement learning, MDP dives deeper, offering a mathematical foundation that accounts for the probabilities of moving from one state to another and optimizing the decision-making process over time. This detailed approach helps bridge the gap between theoretical learning and practical application, especially in environments where outcomes are partly uncertain and partly under the agent’s control.

    Transition Probabilities
    Ideally, we’d know exactly what happens next after an action. But life, much like MDP, is full of uncertainties. Transition probabilities are the rules that predict what comes next. If our game character jumps, will they land safely or fall? If the thermostat is cranked up, will the room get to the desired temperature?

    Now imagine a maze game, where the agent aims to find the exit. Here, states are its spots in the maze, actions are which way it moves, and rewards come from exiting the maze with fewer moves.

    MDP frames this scenario in a way that helps an RL agent figure out the best moves in different states to max out rewards. By playing this “game” repeatedly, the agent learns which actions work best in each state to score the highest, despite the uncertainties.

    2.2: The Math Behind MDP

    To get what the Markov Decision Process is about in reinforcement learning, it’s key to dive into its math. MDP gives us a solid setup for figuring out how to make decisions when things aren’t totally predictable and there’s some room for choice. Let’s break down the main math bits and pieces that paint the full picture of MDP.

    Core Components of MDP
    MDP is characterized by a tuple (S, A, P, R, γ), where:

    • S is a set of states,
    • A is a set of actions,
    • P is the state transition probability matrix,
    • R is the reward function, and
    • γ is the discount factor.

    While we covered the math behind states, actions, and the discount factor in the previous article, now we’ll introduce the math behind the state transition probability, and the reward function.

    State Transition Probabilities
    The state transition probability P(s′ ∣ s, a) defines the probability of transitioning from state s to state s′ after taking action a. This is a core element of the MDP that captures the dynamics of the environment. Mathematically, it’s expressed as:

    State Transition Probabilities Formula — Image by Author

    Here:

    • s: The current state of the agent before taking the action.
    • a: The action taken by the agent in state s.
    • s′: The subsequent state the agent finds itself in after action a is taken.
    • P(s′ ∣ s, a): The probability that action a in state s will lead to state s′.
    • Pr⁡ denotes the probability, St​ represents the state at time t.
    • St+1​ is the state at time t+1 after the action At​ is taken at time t.

    This formula captures the essence of the stochastic nature of the environment. It acknowledges that the same action taken in the same state might not always lead to the same outcome due to the inherent uncertainties in the environment.

    Consider a simple grid world where an agent can move up, down, left, or right. If the agent tries to move right, there might be a 90% chance it successfully moves right (s′=right), a 5% chance it slips and moves up instead (s′=up), and a 5% chance it slips and moves down (s′=down). There’s no probability of moving left since it’s the opposite direction of the intended action. Hence, for the action a=right from state s, the state transition probabilities might look like this:

    • P(right ∣ s, right) = 0.9
    • P(up ∣ s, right) = 0.05
    • P(down ∣ s, right) = 0.05
    • P(left ∣ s, right) = 0

    Understanding and calculating these probabilities are fundamental for the agent to make informed decisions. By anticipating the likelihood of each possible outcome, the agent can evaluate the potential rewards and risks associated with different actions, guiding it towards decisions that maximize expected returns over time.

    In practice, while exact state transition probabilities might not always be known or directly computable, various RL algorithms strive to estimate or learn these dynamics to achieve optimal decision-making. This learning process lies at the core of an agent’s ability to navigate and interact with complex environments effectively.

    Reward Function
    The reward function R(s, a, s′) specifies the immediate reward received after transitioning from state s to state s′ as a result of taking action a. It can be defined in various ways, but a common form is:

    Reward Function — Image by Author

    Here:

    • Rt+1​: This is the reward received at the next time step after taking the action, which could vary depending on the stochastic elements of the environment.
    • St​=s: This indicates the current state at time t.
    • At​=a: This is the action taken by the agent in state s at time t.
    • St+1​=s′: This denotes the state at the next time step t+1 after the action a has been taken.
    • E[Rt+1​ ∣ St​=s, At​=a, St+1​=s′]: This represents the expected reward after taking action a in state s and ending up in state s′. The expectation E is taken over all possible outcomes that could result from the action, considering the probabilistic nature of the environment.

    In essence, this function calculates the average or expected reward that the agent anticipates receiving for making a particular move. It takes into account the uncertain nature of the environment, as the same action in the same state may not always lead to the same next state or reward because of the probabilistic state transitions.

    For example, if an agent is in a state representing its position in a grid, and it takes an action to move to another position, the reward function will calculate the expected reward of that move. If moving to that new position means reaching a goal, the reward might be high. If it means hitting an obstacle, the reward might be low or even negative. The reward function encapsulates the goals and rules of the environment, incentivizing the agent to take actions that will maximize its cumulative reward over time.

    Policies
    A policy π is a strategy that the agent follows, where π(a s) defines the probability of taking action a in state s. A policy can be deterministic, where the action is explicitly defined for each state, or stochastic, where actions are chosen according to a probability distribution:

    Policy Function — Image by Author
    • π(as): The probability that the agent takes action a given it is in state s.
    • Pr(At​=aSt​=s): The conditional probability that action a is taken at time t given the current state at time t is s.

    Let’s consider a simple example of an autonomous taxi navigating in a city. Here the states are the different intersections within a city grid, and the actions are the possible maneuvers at each intersection, like ‘turn left’, ‘go straight’, ‘turn right’, or ‘pick up a passenger’.

    The policy π might dictate that at a certain intersection (state), the taxi has the following probabilities for each action:

    • π(’turn left’∣intersection) = 0.1
    • π(’go straight’∣intersection) = 0.7
    • π(’turn right’∣intersection) = 0.1
    • π(’pick up passenger’∣intersection) = 0.1

    In this example, the policy is stochastic because there are probabilities associated with each action rather than a single certain outcome. The taxi is most likely to go straight but has a small chance of taking other actions, which may be due to traffic conditions, passenger requests, or other variables.

    The policy function guides the agent in selecting actions that it believes will maximize the expected return or reward over time, based on its current knowledge or strategy. Over time, as the agent learns, the policy may be updated to reflect new strategies that yield better results, making the agent’s behavior more sophisticated and better at achieving its goals.

    Value Functions
    Once I have my set of states, actions, and policies defined, we could ask ourselves the following question

    What rewards can I expect in the long run if I start here and follow my game plan?

    The answer is in the value function (s), which gives the expected return when starting in state s and following policy π thereafter:

    Value Functions — Image by Author

    Where:

    • (s): The value of state s under policy π.
    • Gt​: The total discounted return from time t onwards.
    • Eπ​[Gt​∣St​=s]: The expected return starting from state s following policy π.
    • γ: The discount factor between 0 and 1, which determines the present value of future rewards — a way of expressing that immediate rewards are more certain than distant rewards.
    • Rt+k+1​: The reward received at time t+k+1, which is k steps in the future.
    • k=0∞​: The sum of the discounted rewards from time t onward.

    Imagine a game where you have a grid with different squares, and each square is a state that has different points (rewards). You have a policy π that tells you the probability of moving to other squares from your current square. Your goal is to collect as many points as possible.

    For a particular square (state s), the value function (s) would be the expected total points you could collect from that square, discounted by how far in the future you receive them, following your policy π for moving around the grid. If your policy is to always move to the square with the highest immediate points, then (s) would reflect the sum of points you expect to collect, starting from s and moving to other squares according to π, with the understanding that points available further in the future are worth slightly less than points available right now (due to the discount factor γ).

    In this way, the value function helps to quantify the long-term desirability of states given a particular policy, and it plays a key role in the agent’s learning process to improve its policy.

    Action-Value Function
    This function goes a step further, estimating the expected return of taking a specific action in a specific state and then following the policy. It’s like saying:

    If I make this move now and stick to my strategy, what rewards am I likely to see?

    While the value function V(s) is concerned with the value of states under a policy without specifying an initial action. In contrast, the action-value function Q(s, a) extends this concept to evaluate the value of taking a particular action in a state, before continuing with the policy.

    The action-value function (s, a) represents the expected return of taking action a in state s and following policy π thereafter:

    Action-Value Function — Image by Author
    • (s, a): The value of taking action a in state s under policy π.
    • Gt​: The total discounted return from time t onward.
    • Eπ​[Gt ​∣ St​=s, At​=a]: The expected return after taking action a in state s the following policy π.
    • γ: The discount factor, which determines the present value of future rewards.
    • Rt+k+1​: The reward received k time steps in the future, after action a is taken at time t.
    • k=0∞​: The sum of the discounted rewards from time t onward.

    The action-value function tells us what the expected return is if we start in state s, take action a, and then follow policy π after that. It takes into account not only the immediate reward received for taking action a but also all the future rewards that follow from that point on, discounted back to the present time.

    Let’s say we have a robot vacuum cleaner with a simple task: clean a room and return to its charging dock. The states in this scenario could represent the vacuum’s location within the room, and the actions might include ‘move forward’, ‘turn left’, ‘turn right’, or ‘return to dock’.

    The action-value function (s, a) helps the vacuum determine the value of each action in each part of the room. For instance:

    • (middle of the room, ’move forward’) would represent the expected total reward the vacuum would get if it moves forward from the middle of the room and continues cleaning following its policy π.
    • (near the dock, ’return to dock’) would represent the expected total reward for heading back to the charging dock to recharge.

    The action-value function will guide the vacuum to make decisions that maximize its total expected rewards, such as cleaning as much as possible before needing to recharge.

    In reinforcement learning, the action-value function is central to many algorithms, as it helps to evaluate the potential of different actions and informs the agent on how to update its policy to improve its performance over time.

    2.3: The Math Behind Bellman Equations

    In the world of Markov Decision Processes, the Bellman equations are fundamental. They act like a map, helping us navigate through the complex territory of decision-making to find the best strategies or policies. The beauty of these equations is how they simplify big challenges — like figuring out the best move in a game — into more manageable pieces.

    They lay down the groundwork for what an optimal policy looks like — the strategy that maximizes rewards over time. They’re especially crucial in algorithms like Q-learning, where the agent learns the best actions through trial and error, adapting even when faced with unexpected situations.

    Bellman Equation for (s)
    This equation computes the expected return (total future rewards) of being in state s under a policy π. It sums up all the rewards an agent can expect to receive, starting from state s, and taking into account the likelihood of each subsequent state-action pair under the policy π. Essentially, it answers, “If I follow this policy, how good is it to be in this state?”

    Bellman Equation for (s) — Image by Author
    • π(as) is the probability of taking action a in state s under policy π.
    • P(s′ ∣ s, a) is the probability of transitioning to state s′ from state s after taking action a.
    • R(s, a, s′) is the reward received after transitioning from s to s′ due to action a.
    • γ is the discount factor, which values future rewards less than immediate rewards (0 ≤ γ < 1).
    • ​(s′) is the value of the subsequent state s′.

    This equation calculates the expected value of a state s by considering all possible actions a, the likelihood of transitioning to a new state s′, the immediate reward R(s, a, s′), plus the discounted value of the subsequent state s′. It encapsulates the essence of planning under uncertainty, emphasizing the trade-offs between immediate rewards and future gains.

    Bellman Equation for (s,a)
    This equation goes a step further by evaluating the expected return of taking a specific action a in state s, and then following policy π afterward. It provides a detailed look at the outcomes of specific actions, giving insights like, “If I take this action in this state and then stick to my policy, what rewards can I expect?”

    Bellman Equation for (s,a) — Image by Author
    • P(s′ ∣ s, a) and R(s, a, s′) are as defined above.
    • γ is the discount factor.
    • π(a′ ∣ s′) is the probability of taking action a′ in the next state s′ under policy π.
    • ​(s′, a′) is the value of taking action a′ in the subsequent state s′.

    This equation extends the concept of the state-value function by evaluating the expected utility of taking a specific action a in a specific state s. It accounts for the immediate reward and the discounted future rewards obtained by following policy π from the next state s′ onwards.

    Both equations highlight the relationship between the value of a state (or a state-action pair) and the values of subsequent states, providing a way to evaluate and improve policies.

    While value functions V(s) and action-value functions Q(s, a) represent the core objectives of learning in reinforcement learning — estimating the value of states and actions — the Bellman equations provide the recursive framework necessary for computing these values and enabling the agent to improve its decision-making over time.

    3: Deep Dive into Q-Learning

    Now that we’ve established all the foundational knowledge necessary for Q-Learning, let’s dive into action!

    3.1: Fundamentals of Q-Learning

    Image Generated by DALLE

    Q-learning works through trial and error. In particular, the agent checks out its surroundings, sometimes randomly picking paths to discover new ways to go. After it makes a move, the agent sees what happens and what kind of reward it gets. A good move, like getting closer to the goal, earns a positive reward. A not-so-good move, like smacking into a wall, means a negative reward. Based on what it learns, the agent updates its guide, bumping up the scores for good moves and lowering them for the bad ones. As the agent keeps exploring and updating its guide, it gets sharper at picking the best moves.

    Let’s use the prior robot vacuum example. A Q-learning powered robot vacuum may firstly move around randomly. But as it keeps at it, it learns from the outcomes of its moves.

    For instance, if moving forward means it cleans up a lot of dust (earning a high reward), the robot notes that going forward in that spot is a great move. If turning right causes it to bump into a chair (getting a negative reward), it learns that turning right there isn’t the best option.

    The “cheat sheet” the robot builds is what Q-learning is all about. It’s a bunch of values (known as Q-values) that help guide the robot’s decisions. The higher the Q-value for a particular action in a specific situation, the better that action is. Over many cleaning rounds, the robot keeps refining its Q-values with every move it makes, constantly improving its cheat sheet until it nails down the best way to clean the room and zip back to its charger.

    3.2: The Math Behind Q-Learning

    Q-learning is a model-free reinforcement learning algorithm that seeks to find the best action to take given the current state. It’s about learning a function that will give us the best action to maximize the total future reward.

    The Q-learning Update Rule: A Mathematical Formula
    The mathematical heart of Q-learning lies in its update rule, which iteratively improves the Q-values that estimate the returns of taking certain actions from particular states. Here is the Q-learning update rule expressed in mathematical terms:

    Q-Learning Update Formula — Image by Author

    Let’s break down the components of this formula:

    • Q(s, a): The current Q-value for a given state s and action a.
    • α: The learning rate, a factor that determines how much new information overrides old information. It is a number between 0 and 1.
    • R(s, a): The immediate reward received after taking action a in state s.
    • γ: The discount factor, also a number between 0 and 1, which discounts the value of future rewards compared to immediate rewards.
    • maxa′​Q(s′, a′): The maximum predicted reward for the next state s′, achieved by any action a′. This is the agent’s best guess at how valuable the next state will be.
    • Q(s, a): The old Q-value before the update.

    The essence of this rule is to adjust the Q-value for the state-action pair towards the sum of the immediate reward and the discounted maximum reward for the next state. The agent does this after every action it takes, slowly honing its Q-values towards the true values that reflect the best possible decisions.

    The Q-values are initialized arbitrarily, and then the agent interacts with its environment, making observations, and updating its Q-values according to the rule above. Over time, with enough exploration of the state-action space, the Q-values converge to the optimal values, which reflect the maximum expected return one can achieve from each state-action pair.

    This convergence means that the Q-values eventually provide the agent with a strategy for choosing actions that maximize the total expected reward for any given state. The Q-values essentially become a guide for the agent to follow, informing it of the value or quality of taking each action when in each state, hence the name “Q-learning”.

    Difference with Bellman Equation
    Comparing the Bellman Equation for (s, a) with the Q-learning update rule, we see that Q-learning essentially applies the Bellman equation in a practical, iterative manner. The key differences are:

    • Learning from Experience: Q-learning uses the observed immediate reward R(s, a) and the estimated value of the next state maxa′​Q(s′, a′) directly from experience, rather than relying on the complete model of the environment (i.e., the transition probabilities P(s′ ∣ sa)).
    • Temporal Difference Learning: Q-learning’s update rule reflects a temporal difference learning approach, where the Q-values are updated based on the difference (error) between the estimated future rewards and the current Q-value.

    4: Q-Learning From Scratch

    To better understand every step of Q-Learning beyond its math, let’s build it from scratch. Take a look first at the whole code we will be using to create a reinforcement learning setup using a grid world environment and a Q-learning agent. The agent learns to navigate through the grid, avoiding obstacles and aiming for a goal.

    Don’t worry if the code doesn’t seem clear, as we will break it down and go through it in detail later.

    The code below is also accessible through this GitHub repo:

    Reinforcement-Learning/Turtorial 2 – Q-Learning/main.py at main · cristianleoo/Reinforcement-Learning

    import numpy as np
    import matplotlib.pyplot as plt
    import matplotlib.animation as animation
    import pickle
    import os

    # GridWorld Environment
    class GridWorld:
    """GridWorld environment with obstacles and a goal.
    The agent starts at the top-left corner and has to reach the bottom-right corner.
    The agent receives a reward of -1 at each step, a reward of -0.01 at each step in an obstacle, and a reward of 1 at the goal.

    Args:
    size (int): The size of the grid.
    num_obstacles (int): The number of obstacles in the grid.

    Attributes:
    size (int): The size of the grid.
    num_obstacles (int): The number of obstacles in the grid.
    obstacles (list): The list of obstacles in the grid.
    state_space (numpy.ndarray): The state space of the grid.
    state (tuple): The current state of the agent.
    goal (tuple): The goal state of the agent.

    Methods:
    generate_obstacles: Generate the obstacles in the grid.
    step: Take a step in the environment.
    reset: Reset the environment.
    """
    def __init__(self, size=5, num_obstacles=5):
    self.size = size
    self.num_obstacles = num_obstacles
    self.obstacles = []
    self.generate_obstacles()
    self.state_space = np.zeros((self.size, self.size))
    self.state = (0, 0)
    self.goal = (self.size-1, self.size-1)

    def generate_obstacles(self):
    """
    Generate the obstacles in the grid.
    The obstacles are generated randomly in the grid, except in the top-left and bottom-right corners.

    Args:
    None

    Returns:
    None
    """
    for _ in range(self.num_obstacles):
    while True:
    obstacle = (np.random.randint(self.size), np.random.randint(self.size))
    if obstacle not in self.obstacles and obstacle != (0, 0) and obstacle != (self.size-1, self.size-1):
    self.obstacles.append(obstacle)
    break

    def step(self, action):
    """
    Take a step in the environment.
    The agent takes a step in the environment based on the action it chooses.

    Args:
    action (int): The action the agent takes.
    0: up
    1: right
    2: down
    3: left

    Returns:
    state (tuple): The new state of the agent.
    reward (float): The reward the agent receives.
    done (bool): Whether the episode is done or not.
    """
    x, y = self.state
    if action == 0: # up
    x = max(0, x-1)
    elif action == 1: # right
    y = min(self.size-1, y+1)
    elif action == 2: # down
    x = min(self.size-1, x+1)
    elif action == 3: # left
    y = max(0, y-1)
    self.state = (x, y)
    if self.state in self.obstacles:
    return self.state, -1, True
    if self.state == self.goal:
    return self.state, 1, True
    return self.state, -0.01, False

    def reset(self):
    """
    Reset the environment.
    The agent is placed back at the top-left corner of the grid.

    Args:
    None

    Returns:
    state (tuple): The new state of the agent.
    """
    self.state = (0, 0)
    return self.state

    # Q-Learning
    class QLearning:
    """
    Q-Learning agent for the GridWorld environment.

    Args:
    env (GridWorld): The GridWorld environment.
    alpha (float): The learning rate.
    gamma (float): The discount factor.
    epsilon (float): The exploration rate.
    episodes (int): The number of episodes to train the agent.

    Attributes:
    env (GridWorld): The GridWorld environment.
    alpha (float): The learning rate.
    gamma (float): The discount factor.
    epsilon (float): The exploration rate.
    episodes (int): The number of episodes to train the agent.
    q_table (numpy.ndarray): The Q-table for the agent.

    Methods:
    choose_action: Choose an action for the agent to take.
    update_q_table: Update the Q-table based on the agent's experience.
    train: Train the agent in the environment.
    save_q_table: Save the Q-table to a file.
    load_q_table: Load the Q-table from a file.
    """
    def __init__(self, env, alpha=0.5, gamma=0.95, epsilon=0.1, episodes=10):
    self.env = env
    self.alpha = alpha
    self.gamma = gamma
    self.epsilon = epsilon
    self.episodes = episodes
    self.q_table = np.zeros((self.env.size, self.env.size, 4))

    def choose_action(self, state):
    """
    Choose an action for the agent to take.
    The agent chooses an action based on the epsilon-greedy policy.

    Args:
    state (tuple): The current state of the agent.

    Returns:
    action (int): The action the agent takes.
    0: up
    1: right
    2: down
    3: left
    """
    if np.random.uniform(0, 1) < self.epsilon:
    return np.random.choice([0, 1, 2, 3]) # exploration
    else:
    return np.argmax(self.q_table[state]) # exploitation

    def update_q_table(self, state, action, reward, new_state):
    """
    Update the Q-table based on the agent's experience.
    The Q-table is updated based on the Q-learning update rule.

    Args:
    state (tuple): The current state of the agent.
    action (int): The action the agent takes.
    reward (float): The reward the agent receives.
    new_state (tuple): The new state of the agent.

    Returns:
    None
    """
    self.q_table[state][action] = (1 - self.alpha) * self.q_table[state][action] +
    self.alpha * (reward + self.gamma * np.max(self.q_table[new_state]))

    def train(self):
    """
    Train the agent in the environment.
    The agent is trained in the environment for a number of episodes.
    The agent's experience is stored and returned.

    Args:
    None

    Returns:
    rewards (list): The rewards the agent receives at each step.
    states (list): The states the agent visits at each step.
    starts (list): The start of each new episode.
    steps_per_episode (list): The number of steps the agent takes in each episode.
    """
    rewards = []
    states = [] # Store states at each step
    starts = [] # Store the start of each new episode
    steps_per_episode = [] # Store the number of steps per episode
    steps = 0 # Initialize the step counter outside the episode loop
    episode = 0
    while episode < self.episodes:
    state = self.env.reset()
    total_reward = 0
    done = False
    while not done:
    action = self.choose_action(state)
    new_state, reward, done = self.env.step(action)
    self.update_q_table(state, action, reward, new_state)
    state = new_state
    total_reward += reward
    states.append(state) # Store state
    steps += 1 # Increment the step counter
    if done and state == self.env.goal: # Check if the agent has reached the goal
    starts.append(len(states)) # Store the start of the new episode
    rewards.append(total_reward)
    steps_per_episode.append(steps) # Store the number of steps for this episode
    steps = 0 # Reset the step counter
    episode += 1
    return rewards, states, starts, steps_per_episode

    def save_q_table(self, filename):
    """
    Save the Q-table to a file.

    Args:
    filename (str): The name of the file to save the Q-table to.

    Returns:
    None
    """
    filename = os.path.join(os.path.dirname(__file__), filename)
    with open(filename, 'wb') as f:
    pickle.dump(self.q_table, f)

    def load_q_table(self, filename):
    """
    Load the Q-table from a file.

    Args:
    filename (str): The name of the file to load the Q-table from.

    Returns:
    None
    """
    filename = os.path.join(os.path.dirname(__file__), filename)
    with open(filename, 'rb') as f:
    self.q_table = pickle.load(f)


    # Initialize environment and agent
    for i in range(10):
    env = GridWorld(size=5, num_obstacles=5)
    agent = QLearning(env)

    # Load the Q-table if it exists
    if os.path.exists(os.path.join(os.path.dirname(__file__), 'q_table.pkl')):
    agent.load_q_table('q_table.pkl')

    # Train the agent and get rewards
    rewards, states, starts, steps_per_episode = agent.train() # Get starts and steps_per_episode as well

    # Save the Q-table
    agent.save_q_table('q_table.pkl')

    # Visualize the agent moving in the grid
    fig, ax = plt.subplots()

    def update(i):
    """
    Update the grid with the agent's movement.

    Args:
    i (int): The current step.

    Returns:
    None
    """
    ax.clear()
    # Calculate the cumulative reward up to the current step
    cumulative_reward = sum(rewards[:i+1])
    # Find the current episode
    current_episode = next((j for j, start in enumerate(starts) if start > i), len(starts)) - 1
    # Calculate the number of steps since the start of the current episode
    if current_episode < 0:
    steps = i + 1
    else:
    steps = i - starts[current_episode] + 1
    ax.set_title(f"Iteration: {current_episode+1}, Total Reward: {cumulative_reward:.2f}, Steps: {steps}")
    grid = np.zeros((env.size, env.size))
    for obstacle in env.obstacles:
    grid[obstacle] = -1
    grid[env.goal] = 1
    grid[states[i]] = 0.5 # Use states[i] instead of env.state
    ax.imshow(grid, cmap='cool')

    ani = animation.FuncAnimation(fig, update, frames=range(len(states)), repeat=False)

    # After the animation
    print(f"Environment number {i+1}")
    for i, steps in enumerate(steps_per_episode, 1):
    print(f"Iteration {i}: {steps} steps")
    print(f"Total reward: {sum(rewards):.2f}")
    print()

    plt.show()

    That was a lot of code! Let’s break down this code into smaller, more understandable steps. Here’s what each part does:

    4.1: The GridWorld Environment

    This class represents a grid environment where an agent can move around, avoid obstacles, and reach a goal.

    Initialization (__init__ method)

    def __init__(self, size=5, num_obstacles=5):
    self.size = size
    self.num_obstacles = num_obstacles
    self.obstacles = []
    self.generate_obstacles()
    self.state_space = np.zeros((self.size, self.size))
    self.state = (0, 0)
    self.goal = (self.size-1, self.size-1)

    When you create a new GridWorld, you specify the size of the grid and the number of obstacles. The grid is square, so size=5 means a 5×5 grid. The agent starts at the top-left corner (0, 0) and aims to reach the bottom-right corner (size-1, size-1). The obstacles are held in self.obstacles, which is an empty list of obstacles that will be filled with the locations of the obstacles. The generate_obstacles() method is then called to randomly place obstacles in the grid.

    Therefore, we could expect an environment like the following:

    Environment — Image by Author

    In the environment above the top-left block is the starting state, the bottom-right block is the goal, and the pink blocks in the middle are the obstacles. Note that the obstacles will vary everytime you create an environment, as they are generated randomly.

    Generating Obstacles (generate_obstacles method)

    def generate_obstacles(self):
    for _ in range(self.num_obstacles):
    while True:
    obstacle = (np.random.randint(self.size), np.random.randint(self.size))
    if obstacle not in self.obstacles and obstacle != (0, 0) and obstacle != (self.size-1, self.size-1):
    self.obstacles.append(obstacle)
    break

    This method places num_obstacles randomly within the grid. It ensures that obstacles don’t overlap with the starting point or the goal.

    It does this by looping until the specified number of obstacles ( self.num_obstacles)have been placed. In every loop, it randomly selects a position in the grid, then if the position is not already an obstacle, and not the start or goal, it’s added to the list of obstacles.

    Taking a Step (step method)

    def step(self, action):
    x, y = self.state
    if action == 0: # up
    x = max(0, x-1)
    elif action == 1: # right
    y = min(self.size-1, y+1)
    elif action == 2: # down
    x = min(self.size-1, x+1)
    elif action == 3: # left
    y = max(0, y-1)
    self.state = (x, y)
    if self.state in self.obstacles:
    return self.state, -1, True
    if self.state == self.goal:
    return self.state, 1, True
    return self.state, -0.01, False

    The step method moves the agent according to the action (0 for up, 1 for right, 2 for down, 3 for left) and updates its state. It also checks the new position to see if it’s an obstacle or a goal.

    It does that by taking the current state (x, y), which is the current location of the agent. Then, it changes x or y based on the action (0 for up, 1 for right, 2 for down, 3 for left), ensuring the agent doesn’t move outside the grid boundaries. It updates self.state to this new position. Then it checks if the new state is an obstacle or the goal and returns the corresponding reward and whether the episode is finished (done).

    Resetting the Environment (reset method)

    def reset(self):
    self.state = (0, 0)
    return self.state

    This function puts the agent back at the starting point. It’s used at the beginning of a new learning episode.

    It simply sets self.state back to (0, 0) and returns this as the new state.

    4.2: The Q-Learning Class

    This is a Python class that represents a Q-learning agent, which will learn how to navigate the GridWorld.

    Initialization (__init__ method)

    def __init__(self, env, alpha=0.5, gamma=0.95, epsilon=0.1, episodes=10):
    self.env = env
    self.alpha = alpha
    self.gamma = gamma
    self.epsilon = epsilon
    self.episodes = episodes
    self.q_table = np.zeros((self.env.size, self.env.size, 4))

    When you create a QLearning agent, you provide it with the environment to learn from self.env, which is the GridWorld environment in our case; a learning rate alpha, which controls how new information affects the existing Q-values; a discount factor gamma, which determines the importance of future rewards; an exploration rate epsilon, which controls the trade-off between exploration and exploitation.

    Then, we also initialize the number of episodes for training. The Q-table, which stores the agent’s knowledge, and it’s a 3D numpy array of zeros with dimensions (env.size, env.size, 4), representing the Q-values for each state-action pair. 4 is the number of possible actions the agent can take in every state.

    Choosing an Action (choose_action method)

    def choose_action(self, state):
    if np.random.uniform(0, 1) < self.epsilon:
    return np.random.choice([0, 1, 2, 3]) # exploration
    else:
    return np.argmax(self.q_table[state]) # exploitation

    The agent picks an action based on the epsilon-greedy policy. Most of the time, it chooses the best-known action (exploitation), but sometimes it randomly explores other actions.

    Here, epsilon is the probability a random action is chosen. Otherwise, the action with the highest Q-value for the current state is chosen (argmax over the Q-values).

    In our example, we set epsilon it to 0.1, which means that the agent will take a random action 10% of the time. Therefore, when np.random.uniform(0,1) generating a number lower than 0.1, a random action will be taken. This is done to prevent the agent from being stuck on a suboptimal strategy, and instead going out and exploring before being set on one.

    Updating the Q-Table (update_q_table method)

    def update_q_table(self, state, action, reward, new_state):
    self.q_table[state][action] = (1 - self.alpha) * self.q_table[state][action] +
    self.alpha * (reward + self.gamma * np.max(self.q_table[new_state]))

    After the agent takes an action, it updates its Q-table with the new knowledge. It adjusts the value of the action based on the immediate reward and the discounted future rewards from the new state.

    It updates the Q-table using the Q-learning update rule. It modifies the value for the state-action pair in the Q-table (self.q_table[state][action]) based on the received reward and the estimated future rewards (using np.max(self.q_table[new_state]) for the future state).

    Training the Agent (train method)

    def train(self):
    rewards = []
    states = [] # Store states at each step
    starts = [] # Store the start of each new episode
    steps_per_episode = [] # Store the number of steps per episode
    steps = 0 # Initialize the step counter outside the episode loop
    episode = 0
    while episode < self.episodes:
    state = self.env.reset()
    total_reward = 0
    done = False
    while not done:
    action = self.choose_action(state)
    new_state, reward, done = self.env.step(action)
    self.update_q_table(state, action, reward, new_state)
    state = new_state
    total_reward += reward
    states.append(state) # Store state
    steps += 1 # Increment the step counter
    if done and state == self.env.goal: # Check if the agent has reached the goal
    starts.append(len(states)) # Store the start of the new episode
    rewards.append(total_reward)
    steps_per_episode.append(steps) # Store the number of steps for this episode
    steps = 0 # Reset the step counter
    episode += 1
    return rewards, states, starts, steps_per_episode

    This function is pretty straightforward, it runs the agent through many episodes using a while loop. In every episode, it first resets the environment by placing the agent in the starting state (0,0). Then, it chooses actions, updates the Q-table, and keeps track of the total rewards and steps it takes.

    Saving and Loading the Q-Table (save_q_table and load_q_table methods)

    def save_q_table(self, filename):
    filename = os.path.join(os.path.dirname(__file__), filename)
    with open(filename, 'wb') as f:
    pickle.dump(self.q_table, f)

    def load_q_table(self, filename):
    filename = os.path.join(os.path.dirname(__file__), filename)
    with open(filename, 'rb') as f:
    self.q_table = pickle.load(f)

    These methods are used to save the learned Q-table to a file and load it back. They use the pickle module to serialize (pickle.dump) and deserialize (pickle.load) the Q-table, allowing the agent to resume learning without starting from scratch.

    Running the Simulation

    Finally, the script initializes the environment and the agent, optionally loads an existing Q-table, and then starts the training process. After training, it saves the updated Q-table. There’s also a visualization section that shows the agent moving through the grid, which helps you see what the agent has learned.

    Initialization

    Firstly, the environment and agent are initialized:

    env = GridWorld(size=5, num_obstacles=5)
    agent = QLearning(env)

    Here, a GridWorld of size 5×5 with 5 obstacles is created. Then, a QLearning agent is initialized using this environment.

    Loading and Saving the Q-table
    If there’s a Q-table file already saved (‘q_table.pkl’), it’s loaded, which allows the agent to continue learning from where it left off:

    if os.path.exists(os.path.join(os.path.dirname(__file__), 'q_table.pkl')):
    agent.load_q_table('q_table.pkl')

    After the agent is trained for the specified number of episodes, the updated Q-table is saved:

    agent.save_q_table('q_table.pkl')

    This ensures that the agent’s learning is not lost and can be used in future training sessions or actual navigation tasks.

    Training the Agent
    The agent is trained by calling the train method, which runs through the specified number of episodes, allowing the agent to explore the environment, update its Q-table, and track its progress:

    rewards, states, starts, steps_per_episode = agent.train()

    During training, the agent chooses actions, updates the Q-table, observes rewards, and keeps track of states visited. All of this information is used to adjust the agent’s policy (i.e., the Q-table) to improve its decision-making over time.

    Visualization

    After training, the code uses matplotlib to create an animation showing the agent’s journey through the grid. It visualizes how the agent moves, where the obstacles are, and the path to the goal:

    fig, ax = plt.subplots()
    def update(i):
    # Update the grid visualization based on the agent's current state
    ax.clear()
    # Calculate the cumulative reward up to the current step
    cumulative_reward = sum(rewards[:i+1])
    # Find the current episode
    current_episode = next((j for j, start in enumerate(starts) if start > i), len(starts)) - 1
    # Calculate the number of steps since the start of the current episode
    if current_episode < 0:
    steps = i + 1
    else:
    steps = i - starts[current_episode] + 1
    ax.set_title(f"Iteration: {current_episode+1}, Total Reward: {cumulative_reward:.2f}, Steps: {steps}")
    grid = np.zeros((env.size, env.size))
    for obstacle in env.obstacles:
    grid[obstacle] = -1
    grid[env.goal] = 1
    grid[states[i]] = 0.5 # Use states[i] instead of env.state
    ax.imshow(grid, cmap='cool')
    ani = animation.FuncAnimation(fig, update, frames=range(len(states)), repeat=False)
    plt.show()

    This visualization is not only a nice way to see what the agent has learned, but it also provides insight into the agent’s behavior and decision-making process.

    By running this simulation multiple times (as indicated by the loop for i in range(10):), the agent can have multiple learning sessions, which can potentially lead to improved performance as the Q-table gets refined with each iteration.

    Now try this code out, and check how many steps it takes for the agent to reach the goal by iteration. Furthermore, try to increase the size of the environment, and see how this affects the performance.

    5: Next Steps and Future Directions

    As we take a step back to evaluate our journey with Q-learning and the GridWorld setup, it’s important to appreciate our progress but also to note where we hit snags. Sure, we’ve got our agents moving around a basic environment, but there are a bunch of hurdles we still need to jump over to kick their skills up a notch.

    5.1: Current Problems and Limitations

    Limited Complexity
    Right now, GridWorld is pretty basic and doesn’t quite match up to the messy reality of the world around us, which is full of unpredictable twists and turns.

    Scalability Issues
    When we try to make the environment bigger or more complex, our Q-table (our cheat sheet of sorts) gets too bulky, making Q-learning slow and a tough nut to crack.

    One-Size-Fits-All Rewards
    We’re using a simple reward system — dodging obstacles losing points, and reaching the goal and gaining points. But we’re missing out on the nuances, like varying rewards for different actions that could steer the agent more subtly.

    Discrete Actions and States
    Our current Q-learning vibe works with clear-cut states and actions. But life’s not like that; it’s full of shades of grey, requiring more flexible approaches.

    Lack of Generalization
    Our agent learns specific moves for specific situations without getting the knack for winging it in scenarios it hasn’t seen before or applying what it knows to different but similar tasks.

    5.2: Next Steps

    Policy Gradient Methods
    Policy gradient methods represent a class of algorithms in reinforcement learning that optimize the policy directly. They are particularly well-suited for problems with:

    • High-dimensional or continuous action spaces.
    • The need for fine-grained control over the actions.
    • Complex environments where the agent must learn more abstract concepts.

    The next article will cover everything necessary to understand and implement policy gradient methods.

    We’ll start with the conceptual underpinnings of policy gradient methods, explaining how they differ from value-based approaches and their advantages.

    We’ll dive into algorithms like REINFORCE and Actor-Critic methods, exploring how they work and when to use them. We’ll discuss the exploration strategies used in policy gradient methods, which are crucial for effective learning in complex environments.

    A key challenge with policy gradients is high variance in the updates. We will look into techniques like baselines and advantage functions to tackle this issue.

    A More Complex Environment
    To truly harness the power of policy gradient methods, we will introduce a more complex environment. This environment will have a continuous state and action space, presenting a more realistic and challenging learning scenario. Multiple paths to success, require the agent to develop nuanced strategies. The possibility of more dynamic elements, such as moving obstacles or changing goals.

    Stay tuned as we prepare to embark on this exciting journey into the world of policy gradient methods, where we’ll empower our agents to tackle challenges of increasing complexity and closer to real-world applications.

    6: Conclusion

    As we conclude this article, it’s clear that the journey through the fundamentals of reinforcement learning has set a robust stage for our next foray into the field. We’ve seen our agent start from scratch, learning to navigate the straightforward corridors of the GridWorld, and now it stands on the brink of stepping into a world that’s richer and more reflective of the complexities it must master.

    It was a lot, but you made it to the end. Congrats! I hope you enjoyed this article. If so consider leaving a clap and following me, as I will regularly post similar articles. Let me know what you think about the article, what you would like to see more. Consider leaving a clap or two and follow me to stay updated.


    Reinforcement Learning 101: Q-Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Reinforcement Learning 101: Q-Learning

    Go Here to Read this Fast! Reinforcement Learning 101: Q-Learning

  • Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

    Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

    Sanjeeb Panda

    Structured Query Language (SQL) is a complex language that requires an understanding of databases and metadata. Today, generative AI can enable people without SQL knowledge. This generative AI task is called text-to-SQL, which generates SQL queries from natural language processing (NLP) and converts text into semantically correct SQL. The solution in this post aims to […]

    Originally appeared here:
    Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

    Go Here to Read this Fast! Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

  • CentralBankRoBERTa: an LLM for Macroeconomics

    CentralBankRoBERTa: an LLM for Macroeconomics

    Moritz Pfeifer

    Harnessing AI to Classify Macroeconomic Sentiment and Addressed Agents

    Image by Author (with some help from DALL-E 3)

    This post was co-authored with Vincent P. Marohl and is based on the paper Pfeifer, M. and Marohl, V.P. (2023) “CentralBankRoBERTa: A Fine-Tuned Large Language Model for Central Bank Communications”, Journal of Finance and Data Science https://doi.org/10.1016/j.jfds.2023.100114.

    How do communications about economic policies affect economic outcomes? What is the central bank saying about small business, the housing sector or government finances? In this article we explore CentralBankRoBERTa, a state-of-the-art economic agent classifier that distinguishes five basic macroeconomic agents and binary sentiment classifier that identifies the emotional content of sentences in macroeconomic communications. We train our model on over 12,000 manually labeled sentences from communications of the U.S. Federal Reserve System, the European Central Bank and global members of the Bank of International Settlements.

    LLMs for Economics

    Advances in LLMs have made it much easier to fine-tune for specific applications. All that is needed to obtain state-of-the-art classification performance is extensive training data for the specific application domain. Thus far, no LLM could generate sentiment labels for macroeconomic topics. After all, what constitutes a ‘positive’ macroeconomic sentence?

    We have developed CentralBankRoBERTa. The model is based on the RoBERTa architecture and classifies sentences for economic sentiment. It also classifies who is most concerned. The model has initially been conceptualized for central bank communications, a subfield of economics that aims to quantify the economic impact of words.

    The advantage of central bank communications is that one has to think about what constitutes a positive economic signal to whom. For example, the sentence “wages are rising beyond expectations” may be labeled as positive for households, who receive wages, and negative for firms, who pay wages.

    CentralBankRoBERTa classifies sentences based on what is good for whom. We distinguish five different macroeconomic agents: households, firms, the financial sector, government and the central bank itself. The agent-signal dynamic allows the model to classify whether a sentence emits a positive or negative signal without further numeric context.

    Paying attention to context and audiences is key in text analysis, especially for complex subjects like economic policy. This is because the way a message is received can differ greatly based on the audience and situation. CentralBankRoBERTa highlights the importance of this by accurately identifying economic sentiments according to the specific audience and context.

    Applications

    The wide ranging responsibilities of the central bank make the model generally applicable. After all, it does not matter whether a central banker or a CEO expresses good or bad news for firms or other economic agents. This is also true for finance ministers, hedge fund managers, journalists and other economic players whose views on the economy contribute to shaping it.

    Next, we show how CentralBankRoBERTa can help analyze the effect of narratives on the economy by studying business and monetary policy messages. Any relevant text data can be used for this. Here, we use a dataset of U.S. public companies’ earnings call transcripts and SEC filings. We then clean this data with regex and label each sentence using CentralBankRoBERTa to obtain a sentiment score as described in more detail in the next section.

    The final dataset contains about 2000 U.S. based public firms, with each about 20 years of quarterly text data. To see how they track, we also label a text dataset of speeches from the Fed. We select only those sentences by the Fed that speak about firms, so that we do not pick unrelated information.

    We find that state-level average firm sentiment closely tracks the business cycle. Regional Fed communications, as shown below in the case of Texas, also closely track the business cycle.

    Figure 1: The sentiment from Texas public firms (Dark blue dashed line) closely tracks the business cycle (Red line). The Dallas Fed’s firm-specific communications (Turquoise Dash) also shows high co-movement.
    Figure 2: USA-wide firm sentiment (Dark blue dashed line) shares a large part of its co-movement with the business cycle (Red line). The FOMC communications (Turquoise Dash) restricted to sentences that speak about firms also tracks closely.

    This descriptive analysis of firm sentiments using CentralBankRoBERTa provides a glimpse into the relationship between economic narratives and the market dynamics of businesses. Especially downturns, such as during the Great Recession of 2008 and the COVID-19 pandemic, are accurately captured by business and FOMC sentiment.

    Our small example underscores the potential of text data to enrich economic models. Sentiment trends expressed in text data can influence economic dynamics, however they are notoriously difficult to capture. Tools like CentralBankRoBERTa may assist researchers and policymakers in filling the

    gap between the study of narratives and their effects on economic events

    as Robert Shiller, the recipient of the 2013 Nobel Memorial Prize in Economics wrote in his book Narrative Economics (2019). Shiller emphasizes how stories, or narratives, spread like viruses through society, directly influencing spending, saving, and investing decisions. Understanding the power of narratives offers a new dimension to economic analysis, suggesting that beyond traditional economic indicators, attention to the prevailing stories and their emotional resonance can offer predictive insights into market movements and economic shifts. Integrating narrative analysis into economic models, therefore, could enhance our ability to anticipate and respond to future economic challenges, making it a vital tool for economists, policymakers, and investors alike.

    How to use

    CentralBankRoBERTa is easy-to-use. To interface with the Hugging Face pipeline for both classification models, first, import the pipeline from the transformers package. Then, load the model using the model’s name on Hugging Face. Create an input sentence and pass it to the classifier. If you want to classify an entire data set, we have a sample script with additional code on github. CentralBankRoBERTa works best on a sentence-level, so we recommend users to parse large texts into individual sentences. For example, in the minutes of the last Federal Open Market Committee (FOMC) meeting, we can find the following view,

    The staff provided an update on its assessment of the stability of the U.S. financial system and, on balance, characterized the system’s financial vulnerabilities as notable.

    Given this sentence to our agent classifier, the model is 96.6% confident the sentence pertains to the “Financial Sector.” Similarly, the sentiment classifier output shows a 80.9% probability of the sentence being “negative.”

    Employing sentiment-classifier:

    from transformers import pipeline

    # Load the SentimentClassifier model
    agent_classifier = pipeline("text-classification", model="Moritz-Pfeifer/CentralBankRoBERTa-sentiment-classifier")

    # Choose your input
    input_sentence = "The early effects of our policy tightening are also becoming visible, especially in sectors like manufacturing and construction that are more sensitive to interest rate changes."

    # Perform sentiment analysis
    sentiment_result = agent_classifier(input_sentence)
    print("Sentiment:", sentiment_result[0]['label'])

    Employing agent classifier:

    from transformers import pipeline

    # Load the AgentClassifier model
    agent_classifier = pipeline("text-classification", model="Moritz-Pfeifer/CentralBankRoBERTa-agent-classifier")

    # Choose your input
    input_sentence = "We used our liquidity tools to make funding available to banks that might need it."

    # Perform agent classification
    agent_result = agent_classifier(input_sentence)
    print("Agent Classification:", agent_result[0]['label'])

    Future Directions

    CentralBankRoBERTa is an LLM that allows to label text for macroeconomic sentiment on unprecedented granularity. It also represents the first economic agent classifier. The model’s broad training data allows for general macroeconomic applications, and can be used for economic, financial and policy research. We hope you feel inspired by the possibilities opened by this model, and want to leave you with some possible future directions of research enabled by this LLM:

    FOMC Press Conference: Can we anticipate financial market movements caused by Fed communications using CentralBankRoBERTa? How about a firms’ earning calls?

    Newspaper Text: What is the press thinking about the economy? Is the news biased towards one economic group?

    Online Forums: Using CentralBankRoBERTa, can we forecast economic trends from online discussion forums?

    Audience classifier: What politicians are most friendly towards which economic group?

    Further Resources

    • The publication of our model in the Journal of Finance and Data Science:

    Pfeifer, M. and Marohl, V.P. (2023) “CentralBankRoBERTa: A Fine-Tuned Large Language Model for Central Bank Communications”, Journal of Finance and Data Science https://doi.org/10.1016/j.jfds.2023.100114

    • A seminar in which we explain the details of our model:

    • The model pipelines on Hugging Face:


    CentralBankRoBERTa: an LLM for Macroeconomics was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    CentralBankRoBERTa: an LLM for Macroeconomics

    Go Here to Read this Fast! CentralBankRoBERTa: an LLM for Macroeconomics

  • 50 First Dates with MemGPT

    50 First Dates with MemGPT

    Ethan Knox

    Sometimes a good story (and some slapstick comedy) goes a long way towards helping us understand complex problems.

    OpenAI. (2024). ChatGPT [Large language model]. /g/g-2fkFE8rbu-dall-e

    Note: originally published on pirate.baby, my little corner of the internet. Republishing here for reach.

    preamble

    The drive home from the movie theater was not going well. My then-girlfriend sat arms crossed in the passenger seat, the red glow from the traffic light matching her mood. “You can’t just let it be romantic. Why do you have to ruin everything!?!” We had just seen 50 First Dates, a classic Sandler/Barrymore romance about a woman whose brain injury prevents her from forming long-term memories. In the movie, Lucy (Barrymore) constructs her own “external memory” via her diary; a clever plot device, but one that required suspension of disbelief that I just could not abide. I had done shorthand math while the credits were rolling: If Lucy’s average diary entry took 20 minutes to write (that is less than most people, and most people aren’t trying to compensate for brain damage), each entry would take roughly half that time — so about 10 minutes — to read. Reading a week’s entries would take more than an hour. By the 6 month mark, the daily catch-up on her past would require more hours of reading than are in a day. Romantic? Yes. Realistic? No. And so we argued. I believe the conversation went something like this:

    “Her: But what if every Sunday she wrote a cheat sheet for the past week, and then she only read those? That would take less time.“
    Me: Even a weekly summary would become unreadable in less than a year.”
    Her: OK, then what if she summarized those cheat sheets?? She could keep making the history smaller and smaller.”
    Me: Yeah but eventually she’d lose too much detail and the summaries would be useless.”
    Her: But she’d still have her daily journals for when she needs those details!”
    Me: How would she ever search that? We’re back where we started.”

    Twenty years later, the “Lucy problem” is a perfect lens to help us understand one of the most important challenges in designing a Large Language Model Agent Framework. The solution proposed by researchers at UC Berkeley is remarkably innovative and offers exciting potential — and it is a solution that bears significant resemblance to the “Lucy solution” I was arguing against during that car ride home. It looks like I owe someone a long-overdue apology.

    Lucy the language model: a near-perfect analogy

    Large Language Models are, in reality, just functions. You input at least one argument (text) and they output in kind. This output is the product of the model’s business logic, combined parameters, and internal arguments — one of those arguments being the training data used to develop the inference model. This training data serves as the model’s “memories”; without it the LLM would output very little value, similar to attempting a deep conversation with a newborn. The training data “memories” in a large language model are fixed at inference time, exactly like Lucy’s memories in the movie. She has developed experiences and gathered information up to a very specific point (in her case, the day of her accident); from that day forward, she interprets stimuli based on the exact state of her mind, and her memories, at that time. This is precisely how inference with a large language model operates — fixed to the moment the training was complete, and the resulting function was pickled.

    Each time the LLM function is executed (here we will refer to this combined execution and response as a turn, borrowing from chat nomenclature) is exactly like one single day in the life of Lucy. With the model temperature turned down to 0 (deterministic) each turn with the same input will look exactly like Lucy’s early routine — repeating the same day over and over (and baking a lot of identical birthday cakes). An LLM cannot form new “memories” as a pure inference endpoint, any more than Lucy can.

    To compensate for this with an LLM, the natural next step is to prepend those new “memories” as part of the text passed to the LLM function effectively augmenting the training data of the language model for the duration of the turn(1). However, language model context windows — the combined amount of text that can be input and output in a single turn — are limited in size. Again, this is exactly how Barrymore’s character experiences the world; her context window is one single day. Just as I argued so many years earlier that Lucy’s memories would eventually take longer to consume than there are hours in a day for her to retain them, new knowledge that must be included in a turn in order for the language model to produce a useful output quickly outgrows the available context window.

    The limits of prompt engineering

    The lion’s share of LLM Engineering coverage has been devoted to prompt engineering, or crafting the content we submit in a turn so that it produces the most desirable outcome. An entire ecosystem has rapidly developed around prompt design, from prompt engineering classes to prompt exchange marketplaces — all from the idea that from the “perfect prompt” you can coax the “perfect output.”

    Henry, Sandler’s character in 50 First Dates, may have been one of the earliest prompt engineers. Early in the film Henry falls in love with Lucy and agrees not to tell her about her injury, instead wooing her anew each day. His daily “prompts” to re-win her heart begin abysmally, with most ending in rejection. Over time his technique evolves until Lucy consistently falls for him every day. We see this same example in countless language model demos, where a meticulously crafted prompt is used to visualize analytics for a dataset or generate a spot-on cover letter.

    The examples are impressive, but how useful is this prompting really? In the movie, Henry finally addresses the extreme limitations in a life of infinite first dates and tells Lucy about her condition. With a language model, a “perfect prompt” executed in isolation is just as limited in value. Complex tasks require many complex steps, each building on a modified state — and this cannot be accomplished in a single turn. While prompt engineering is certainly an important piece of the puzzle, it isn’t remotely a holistic solution to our problem.

    RAG, a newspaper, and a videotape

    For both Lucy and the language model, things get interesting once we start externalizing memories. Retrieval Augmented Generation (RAG) is probably a close second to prompt engineering in the sheer volume of attention paid in LLM-related content. RAG can be more simply stated as “store text somewhere, then on each turn search that text and add bits to the prompt.” The most common RAG implementations today are blind semantic searches, where every user input is searched against the RAG store by semantic similarity, and then the top few search results are combined with the user input as the prompt. They look something like this:

    # prompt with just user input
    Question: What is the last thing Todd said before he quit yesterday?

    vs

    # prompt with vector similarity search results for "What is the last thing Todd said before he quit yesterday?" via embeddings, prepended to prompt
    Context:
    "Margo: Todd is quitting today!"
    "Todd: I am quitting today. I've had enough."
    "Clark: I can't believe Todd finally quit, Margo is going to freak."

    Question: What is the last thing Todd said before he quit yesterday?

    The context injected by RAG might be very helpful, or it might be virtually irrelevant. What’s more, the question may not require context at all, and the RAG may just be noise.

    Again 50 First Dates does not disappoint with real-world analogs. In the film, Lucy’s condition is kept hidden from her with the help of falsified context clues; her father swaps out her newspaper with a reprinted one, passes off a recorded football game as live TV, and paints over a wall every evening so she can re-paint it the next day, none the wiser. This context adds to the prompt and allows Lucy to live a full day (albeit the same one over and over). It does a significantly better job of reaching the desired outcome (Lucy enjoys her day and is able to function within it) than relying completely on the day’s organic events. Later, Henry introduces the first attempt to be honest with Lucy in the form of a VHS recording. To the plot of the film this is a pivotal moment, as it is Lucy’s first step towards regaining agency. With the language model, it is functionally the same as the newspaper and the paint; each turn is potentially better and more informed when it includes RAG content, but it is still very much an isolated turn without true external state management.

    Regardless of which Lucy consumes — the fake newspaper or Henry’s real VHS tape — improvement in Lucy’s life is limited to the outcome of that day. Lucy still has no agency to live a full life, just as our language model can take no meaningful steps toward completing a complex task.

    Just like prompt engineering, RAG is a piece of the puzzle, but it is not the answer in and of itself.

    A mind with a diary

    Let’s review that theory from the car. What if Lucy kept a diary, and then managed this “external memory” by summarizing, consolidating, and making herself cheat sheets? Unlike her father’s newspapers or Henry’s VHS tapes, this memory would be completely under her control. She decides what information is critically important, what memories can be forgotten, and what knowledge should live on in “cold storage” to be dredged up only when required. The film touches on this idea (though it took my romance-ruining logic to really see it played out in detail). With an external memory like this, Lucy is now unbound from her context window. She can pursue her interests for as long as they take, participate actively in the mundane but important events of life, have a family, and live. She can make a menu for the week on Monday, go shopping for groceries on Wednesday, and cook them on Friday — all the elements of agency returned to her by a few notebooks and to-do lists.

    This is remarkably similar to the premise behind the MemGPT project(2). The context window, chat history, and RAG knowledge store are treated as tiered “memories”; each memory can be written and purged by the LLM using tool calls. Instead of approaching each turn as an isolated prompt + RAG + input = response, the “core” memory (the context window content) is processed continually as a daemon, with a new turn on each “heartbeart.” Just as each day Lucy is presented with her notes and to-do list from the previous day, with each turn the language model is presented with the core memory assembled by the previous turn. During that turn, the language model can edit this core memory, move information out of core and into archival memory, or search/recall from archival memory and add that back to core, and these changes will apply to the core memory presented in the next turn.

    From the MemGPT research paper. This is brilliant work, highly recommend you check it out first-hand.

    Here’s a walk-through of how an agent could work (with a bit of pseudo-code) to answer a user’s question about a company. The agent has a core memory describing itself and the user, a history section containing the last five messages, and a list of tools the agent can call. This core memory is passed as the prompt in a turn every 10 seconds, or sooner if externally triggered.

    1. On the first pass, the agent calls the archive_lines() tool to move the 2nd paragraph of the user information to cold storage. Why? because the system prompt informs the agent of the current and max memory sizes and instructs the agent to self-adjust when needed.
    2. On the next pass, the tool call request and result are the newest entries in the core memory history (the 2 oldest from the last turn have been moved to recall memory and are not visible). In this turn the agent calls pause_heartbeats() to wait for external stimuli, as it detects nothing else to update in its own core memory.
    3. The next turn is triggered by a user message, which is the newest history entry in this turn’s core memory. The user message says “How many people are in the Seattle office?” and the agent’s first action is to call the get_employee_list() tool, along with the internal monologue message “I must get the list of employees and count only the ones in the Seattle office. I must then send the message back to the user.”
    4. On the next turn, the agent tool call, internal monologue, and tool call result are in the core memory history, but the original user message has been pushed out. On this turn the agent’s internal monologue is “65 employees are in the Seattle office. I must retrieve the original question so I can answer it correctly” and the tool call is retrieve_message(user, -1) to get the last message from the user.
    5. On the next turn, the agent calls send_message(“there are 65 employees in the Seattle office”), and responds to the user.

    And the beat goes on.

    This “language model operating system” is a starkly contrasting view of how inference services could function to complete complex, multifaceted tasks. It also changes the paradigm of “evolution” as an agent, allowing the application to adjust and self-correct. Prompts and RAG become correctable elements within a generative “cycle”; if a RAG search is ineffective or a prompt misses the mark, it can be re-tried or compensated for on the next turn. Most distinctly important from single-turn agent design, the results of this self-managed memory are cumulative. This is an absolute necessity for true agency.

    I am very excited about what a framework built on this concept could mean; adding stimuli to a well-appointed agent (or cluster of agents) becomes an execution layer that evolves beyond text generation and an ROI that grows exponentially with the complexity of its charge. A language model operating in this fashion is still a language model — a function, not a sentient being — but it crosses a threshold of appearance that is the stuff of Sci-Fi. More importantly, it adds a critical element to the generative equation that I just don’t see autonomous agency succeeding without: repetition. Humans don’t immediately think of every required thought and blurt out the perfect response in one breath; we take steps, ask questions that uncover new questions, pause to consider internally, and arrive at an answer. By bestowing that same capacity on an application, this language model operating system could be a new paradigm in computing.

    For those of us building user-facing agents, this is a concept worth focus and cycles. Single-turn prompt libraries and slice-and-dice embedding building to RAG away bad responses was the best idea we had in the dark AI days of 6 months ago, but I don’t think they will get us where we want to go. In 50 First Dates Lucy married Henry, became a mom, and sailed to Alaska, all because she was restored the agency to own her mind and manage her memories. Language model agents need the same in an “operating system” if we want to unlock them.

    Join me next time, when we explore the parallels between Happy Gilmore and K8S (hint: there are none).

    Footnotes:

    1. Adding context to a prompt and fine-tuning or retraining a model are not really the same thing, but I was willing to take a few liberties with technical accuracy for the sake of clearly demonstrating the subject concepts.

    2. Note that throughout this writing I am referring to the concepts introduced by the research behind MemGPT, not the implementation itself. The nomenclature, pseudo-code, and description of events here are not intended to reflect the software project.

    MemGPT Citation:

    packer 2023 memgpt, MemGPT: Towards LLMs as Operating Systems,
    authors: Packer, Charles and Wooders, Sarah and Lin, Kevin and Fang, Vivian and Patil, Shishir G. and Stoica, Ion and Gonzalez, Joseph E.,
    arXiv preprint arXiv:2310.08560


    50 First Dates with MemGPT was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    50 First Dates with MemGPT

    Go Here to Read this Fast! 50 First Dates with MemGPT