Category: Artificial Intelligence

Key Insights for Teaching AI Agents to Remember

Recommendations on building robust memory capabilities based on experimentation with Autogen’s “Teachable Agents”

Memory is undoubtedly becoming a crucial aspect of Agentic AI. As the use cases for AI Agents grow in complexity, so does the need for these agents to learn from past experiences, utilize stored business-specific knowledge, and adapt to evolving scenarios based on accumulated information.

In my previous article, “Memory in AI: Key Benefits and Investment Considerations,” I explored why memory is pivotal for AI, discussing its role in recall, reasoning, and continuous learning. This piece, however, will dive directly into the implementation of memory by examining its impact through the “teachability” functionality in the popular agent framework, Autogen.

Note: While this article is technical in nature, it offers value for both technical professionals and business leaders looking to evaluate the role of memory in Agentic AI systems. I’ve structured it so that readers can skip over the code sections and still grasp the way memory can augment the responses of your AI systems. If you don’t wish to follow the code, you may read the descriptions of each step to get a sense of the process… or just the key findings and recommendations section.

Source: Dalle3 , Prompt Author:Sandi Besen

Key Findings and Recommendations

My exploration of Autogen’s Teachable Agents revealed both their potential and limitations in handling both simple and complex memory tasks.

Out of the box, Autogen’s TeachableAgent performs less brilliantly than expected. The Agen’ts reasoning ability conflates memories together in a non productive way and the included retrieval mechanism is not set up for multi-step searches necessary for answering complex questions. This limitation suggests that if you would like to use Autogen’s Teachable Agents, there needs to be substantial customization to both supplement reasoning capabilities and achieve more sophisticated memory retrieval.

To build more robust memory capabilities, it’s crucial to implement multi-step search functionality. A single memory search often falls short of providing the comprehensive information needed for complex tasks. Implementing a series of interconnected searches could significantly enhance the agent’s ability to gather and synthesize relevant information.

The “teachability” feature, while powerful, should be approached with caution. Continuous activation without oversight risks data poisoning and compromise of trusted information sources. Business leaders and solution architects should consider implementing a human-in-the-loop approach, allowing users to approve what the system learns versus treating every inference as ground truth the system should learn from. This oversight in Autogen’s current Teachable Agent design could cause significant risks associated with unchecked learning.

Lastly, the method of memory retrieval from a knowledge store plays a large role in the system’s effectiveness. Moving beyond simple nearest neighbor searches, which is the TeachableAgent’s default, to more advanced techniques such as hybrid search (combining keyword and vector approaches), semantic search, or knowledge graph utilization could dramatically improve the relevance and accuracy of retrieved information.

Descriptive Code Implementation

To appropriately demonstrate how external memory can be valuable, I created a fictitious scenario for a car parts manufacturing plant. Follow the code below to implement a Teachable Agent yourself.

Scenario: A car parts manufacturing facility needs to put a plan in place in case there are energy constraints. The plan needs to be flexible and adapt based on how much power consumption the facility can use and which parts and models are in demand.

Step 1:

Pre- set up requires you to pip install autogen if you don’t have it installed in your active environment and create a config JSON file.

Example of a compatible config file which uses Azure OpenAI’s service model GPT4–o:

[{
    "model": "gpt-4o",
    "api_key": "<YOUR API KEY>",
    "azure_endpoint": "<YOUR ENDPOINT>",
    "api_type": "azure",
    "api_version": "2024-06-01"
}]

Install Autogen for python:

pip install pyautogen

Step 2:

Import the necessary libraries to your notebook or file and load the config file.

import autogen
from autogen.agentchat.contrib.capabilities.teachability import Teachability
from autogen import ConversableAgent, UserProxyAgent

config_list = autogen.config_list_from_json(
    env_or_file="autogenconfig.json", #the json file name that stores the config 
    file_location=".", #this means the file is in the same directory
    filter_dict={
        "model": ["gpt-4o"], #select a subset of the models in your config
    },
)

Step 3:

Create the Agents. We will need two agents because of the way that Autogen’s framework works. We use a UserProxyAgent to execute tasks and interact with or replace human involvement (depending on the desired amount of human in the loop). We also create a Conversable Agent as the “Teachable Agent” which is meant to interact with other agents (not the user). You can read more about the UserProxyAgents and ConversableAgents here.

teachable_agent = ConversableAgent(
    name="teachable_agent",  # the name can't contain spaces
    llm_config={"config_list": config_list, "timeout": 120, "cache_seed": None},  # in this example we disable caching but if it is enabled it caches API requests so that they can be reused when the same request is used
) 

user = UserProxyAgent(
    name="user",
    human_input_mode="ALWAYS", #I want to have full control over the code executed so I am setting human_input_mode to ALWAYS. Other options are NEVER and TERMINATE.
    is_termination_msg=lambda x: True if "TERMINATE" in x.get("content") else False, #setting a termination  message is VERY important because it tells the agent when to finish.
    max_consecutive_auto_reply=0, #we don't need this agent to replies multiple times in a row
    code_execution_config={
        "use_docker": False
    },  # If you are planning on running code look into setting use_docker=True. For this example I am not because then I have to walk through the docker setup, but it is safer than running the code directly.
)

Step 4:

To have a baseline, let’s ask the same question to the agents before and after we “teach” it. I purposely decided to demonstrate different levels of question difficulty to show performance on tasks that require a single step vs multistep memory retrieval. To successfully address the intricate multi-step question, the retrieval mechanism must provide information from two separate memories as context to the language model responsible for generating a final response.

Simple Question: “The facility is experiencing a power shortage of 40%. What models need to be prioritized?”

Complex Multi Step Question: “The facility is experiencing a power shortage of 40%. Provide me a detailed breakdown of what machines should be deactivated and which machines should remain active.”

As expected the responses are generic and not particularly applicable to an individual business.

Response prior to enabling long term memory:

SIMPLE QUESTION RESPONSE
===========================
>>>>>>>> USING AUTO REPLY...
teachable_agent (to user):

To address a power shortage of 30% effectively, it's crucial to prioritize models that are essential to critical operations, safety, and those that have the highest impact on productivity. Here are steps to determine which models to prioritize:

1. **Identify Essential Systems:** Prioritize systems that are vital for safety, security, and essential operations. These might include:
   - Emergency lighting
   - Security systems (alarms, surveillance)
   - Critical environmental controls (e.g., refrigeration for hazardous materials)
   - Communication systems

2. **Evaluate Impact on Core Operations:** Determine which models are crucial to the facility’s core functions. This can include machinery, servers, or any processes that are indispensable for maintaining productivity and operations.

3. **Energy Efficiency:** Prioritize models and equipment that are more energy-efficient and can operate within the reduced power capacity.

4. **Downtime Cost:** Consider the cost of downtime for each model. Prioritize those where downtime would result in significant financial losses or production delays.

Based on these principles, here’s a generic prioritization framework:

1. **Safety and Security Systems:**
   - Emergency Lighting
   - Fire Alarms and Sprinkler Systems
   - Security Cameras and Sensors

2. **Critical Operational Systems:**
   - Essential Production Machinery (that cannot be shut down without major issues)
   - Servers and IT Infrastructure (especially those critical for operations and data integrity)
   - HVAC Systems (if necessary for maintaining safe operating conditions)

3. **Support Systems:**
   - Backup Power Systems (uninterruptible power supplies, generators)
   - Communication Systems (both internal and external communication)

Ensure that non-essential and high energy-consuming equipment is powered down or scheduled to operate during off-peak hours if possible.

Always re-evaluate periodically to ensure that the power allocation aligns with the dynamic needs of the facility. Communication with key stakeholders regarding priorities and decisions is also essential for smooth operations during a power shortage.


COMPLEX MULTISTEP QUESTION RESPONSE
====================================
>>>>>>>> USING AUTO REPLY...
teachable_agent (to user):

To effectively manage a facility experiencing a 30% power shortage, we need to prioritize essential operations and consider the criticality and energy consumption of each machine. Here is a detailed breakdown:

### Step 1: Identify Essential Operations

1. **Critical Infrastructure**: Safety systems, security systems, and essential IT infrastructure.
2. **Production Lines**: Machines involved in the primary production process.
3. **Support Systems**: HVAC systems, lighting, and other ancillary equipment necessary to maintain a safe working environment.

### Step 2: Categorize Machines by Importance and Energy Consumption

#### A. Essential and High-Energy Machines

1. **Safety and Security Systems**
   - Fire alarms and sprinklers
   - Surveillance cameras
   - Security access controls
   
2. **Primary Production Machines**
   - Core production lines
   - Central processing units or mainframe servers (for IT-heavy operations)
   
3. **Critical Support Systems**
   - Emergency lighting
   - Essential HVAC systems (for maintaining operation-critical temperature and humidity)

#### B. Non-Essential and High-Energy Machines

1. **Secondary Production Lines**
   - Non-primary or non-critical production lines
   - Equipment used for non-essential or ancillary processes

2. **Supplementary IT Infrastructure**
   - Servers used for non-essential operations
   - Backup systems that are not in immediate use

3. **Non-Essential HVAC and Lighting**
   - HVAC systems for non-critical areas
   - Lighting in non-essential zones

#### C. Non-Essential and Low-Energy Machines

1. **Office Equipment**
   - Computers, printers, and copiers in non-essential offices
2. **Non-essential Lab or Research Equipment**
   - Devices not directly linked to current essential tasks

### Step 3: Prioritize Deactivation

#### Immediate Deactivation

1. **Non-Essential and High-Energy Machines**
   - Secondary production lines
   - Non-essential backup servers

2. **Non-Essential HVAC and Lighting**
   - HVAC in non-critical areas
   - Non-essential lighting

3. **Non-Essential Office Equipment**
   - Printers, non-essential computers, and copiers

#### Limited Usage

1. **Essential and High-Energy Machines**
   - Operate critical production lines at reduced capacity if possible
   - Implement rotational operation schedules for essential processes

2. **Critical Support Systems**
   - Optimize HVAC settings (e.g., raising or lowering set points within acceptable ranges)
   - Reduce lighting to minimum safety standards

### Step 4: Implement Energy Conservation Measures

1. **Optimize Operations**
   - Utilize energy-efficient settings
   - Implement energy-saving protocols and encourage staff to conserve energy

2. **Scheduled Downtime**
   - Plan for downtime during off-peak hours for essential machinery to balance the load

3. **Regular Monitoring**
   - Monitor energy consumption in real-time to make quick adjustments as needed

### Summary

#### Remain Active:
1. Safety and security systems
2. Main production lines
3. Critical IT infrastructure
4. Essential HVAC and lighting for operational areas

#### Deactivate or Limit Use:
1. Non-essential production lines
2. Redundant IT systems
3. Non-critical HVAC and lighting
4. Office equipment in non-essential areas

**Note**: These recommendations should be adjusted based on the specific operational requirements of your facility. Additionally, it could be beneficial to communicate these changes to all staff to ensure compliance and understanding.

Code to ask baseline questions:


#simple question
user.initiate_chat(teachable_agent, message="The facility is experiencing a power shortage of 40%. What models need to be prioritized?", clear_history=True)
#multistep complex question
user.initiate_chat(teachable_agent, message="The facility is experiencing a power shortage of 30%. Provide me a detailed breakdown of what machines should be deactivated and which machines should remain active.", clear_history=True)

Step 5:

Create the “teachability” capability that you then add to the agent. The Teachability class inherits from the AgentCapabiliy class, which essentially allows you to add customizable capabilities to the Agents.

The Teachability class has many optional parameters that can be further explored here.

The out of the box Teachability class is a quick and convenient way of adding long term memory to the agents, but will likely need to be customized for use in a production setting, as outlined in the key findings section. It involves sending messages to an Analyzer Agent that evaluates the user messages for potential storage and retrieval. The Analyzer Agent looks for advice that could be applicable to similar tasks in the future and then summarizes and stores task-advice pairs in a binary database serving as the agent’s “memory”.

teachability = Teachability(
    verbosity=0,  # 0 for basic info, 1 to add memory operations, 2 for analyzer messages, 3 for memo lists.
    reset_db=True, # we want to reset the db because we are creating a new agent so we don't want any existing memories. If we wanted to use an existing memory store we would set this to false.
    path_to_db_dir="./tmp/notebook/teachability_db", #this is the default path you can use any path you'd like
    recall_threshold=1.5,  # Higher numbers allow more (but less relevant) memos to be recalled.
    max_num_retrievals=10 #10 is default bu you can set the max number of memos to be retrieved lower or higher
)

teachability.add_to_agent(teachable_agent)

Step 6:

Now that the teachable_agent is configured, we need to provide it the information that we want the agent to “learn” (store in the database and retrieve from).

In line with our scenario, I wanted the agent to have basic understanding of the facility which consisted of:

the types of components the manufacturing plant produces
the types of car models the components need to be made for
which machines are used to make each component

Additionally, I wanted to provide some operational guidance on the priorities of the facility depending on how power constrained it is. This includes:

Guidance in case of energy capacity constraint of more than 50%
Guidance in case of energy capacity constraint between 25–50%
Guidance in case of energy capacity constraint between 0–25%

business_info = """
# This manufacturing plant manufactures the following vehicle parts:
- Body panels (doors, hoods, fenders, etc.)
- Engine components (pistons, crankshafts, camshafts)
- Transmission parts
- Suspension components (springs, shock absorbers)
- Brake system parts (rotors, calipers, pads)

# This manufactoring plant produces parts for the following models:
- Ford F-150
- Ford Focus
- Ford Explorer
- Ford Mustang
- Ford Escape
- Ford Edge
- Ford Ranger

# Equipment for Specific Automotive Parts and Their Uses

## 1. Body Panels (doors, hoods, fenders, etc.)
- Stamping presses: Form sheet metal into body panel shapes
- Die sets: Used with stamping presses to create specific panel shapes
- Hydraulic presses: Shape and form metal panels with high pressure
- Robotic welding systems: Automate welding of body panels and structures
- Laser cutting machines: Precisely cut sheet metal for panels
- Sheet metal forming machines: Shape flat sheets into curved or complex forms
- Hemming machines: Fold and crimp edges of panels for strength and safety
- Metal finishing equipment (grinders, sanders): Smooth surfaces and remove imperfections
- Paint booths and spraying systems: Apply paint and protective coatings
- Drying ovens: Cure paint and coatings
- Quality control inspection systems: Check for defects and ensure dimensional accuracy

## 2. Engine Components (pistons, crankshafts, camshafts)
- CNC machining centers: Mill and drill complex engine parts
- CNC lathes: Create cylindrical parts like pistons and camshafts
- Boring machines: Enlarge and finish cylindrical holes in engine blocks
- Honing machines: Create a fine surface finish on cylinder bores
- Grinding machines: Achieve precise dimensions and smooth surfaces
- EDM equipment: Create complex shapes in hardened materials
- Forging presses: Shape metal for crankshafts and connecting rods
- Die casting machines: Produce engine blocks and cylinder heads
- Heat treatment furnaces: Alter material properties for strength and durability
- Quenching systems: Rapidly cool parts after heat treatment
- Balancing machines: Ensure rotating parts are perfectly balanced
- Coordinate Measuring Machines (CMMs): Verify dimensional accuracy

## 3. Transmission Parts
- Gear cutting machines: Create precise gear teeth on transmission components
- CNC machining centers: Mill and drill complex transmission housings and parts
- CNC lathes: Produce shafts and other cylindrical components
- Broaching machines: Create internal splines and keyways
- Heat treatment equipment: Harden gears and other components
- Precision grinding machines: Achieve extremely tight tolerances on gear surfaces
- Honing machines: Finish internal bores in transmission housings
- Gear measurement systems: Verify gear geometry and quality
- Assembly lines with robotic systems: Put together transmission components
- Test benches: Evaluate completed transmissions for performance and quality

## 4. Suspension Components (springs, shock absorbers)
- Coil spring winding machines: Produce coil springs to exact specifications
- Leaf spring forming equipment: Shape and form leaf springs
- Heat treatment furnaces: Strengthen springs and other components
- Shot peening equipment: Increase fatigue strength of springs
- CNC machining centers: Create precision parts for shock absorbers
- Hydraulic cylinder assembly equipment: Assemble shock absorber components
- Gas charging stations: Fill shock absorbers with pressurized gas
- Spring testing machines: Verify spring rates and performance
- Durability test rigs: Simulate real-world conditions to test longevity

## 5. Brake System Parts (rotors, calipers, pads)
- High-precision CNC lathes: Machine brake rotors to exact specifications
- Grinding machines: Finish rotor surfaces for smoothness
- Die casting machines: Produce caliper bodies
- CNC machining centers: Mill and drill calipers for precise fit
- Precision boring machines: Create accurate cylinder bores in calipers
- Hydraulic press: Compress and form brake pad materials
- Powder coating systems: Apply protective finishes to calipers
- Assembly lines with robotic systems: Put together brake components
- Brake dynamometers: Test brake system performance and durability

"""

business_rules_over50 = """
- The engine components are critical and machinery should be kept online that corresponds to producing these components when capacity constraint is more or equal to 50%: engine components
- Components for the following models should be prioritized when capacity constraint is more or equal to 50%: 1.Ford F-150
"""

business_rules_25to50 = """
- The following components are critical and machinery should be kept online that corresponds to producing these components when capacity constraint is between 25-50%: engine components and transmission parts 
- Components for the following models should be prioritized when capacity constraint is between 25-50%: 1.Ford F-150 2.Ford Explorer
"""

business_rules_0to25 = """
- The following components are critical and machinery should be kept online that corresponds to producing these components when capacity constraint is between 0-25%: engine components,transmission parts, Brake System Parts
- Components for the following models should be prioritized when capacity constraint is between 0-25%: 1.Ford F-150 2.Ford Explorer 3.Ford Mustang 4.Ford Focus
"""

user.initiate_chat(teachable_agent, message=business_info, clear_history=True)

user.initiate_chat(teachable_agent, message=business_rules_over50, clear_history=True)

user.initiate_chat(teachable_agent, message=business_rules_25to50, clear_history=True)

user.initiate_chat(teachable_agent, message=business_rules_0to25, clear_history=True)

Step 7:

Now that we have added important information to long term memory, let’s ask the Agent the same question and see the difference in its responses.

Simple Question: “The facility is experiencing a power shortage of 40%. What models need to be prioritized?”

Complex Multi Step Question: “The facility is experiencing a power shortage of 40%. Provide me a detailed breakdown of what machines should be deactivated and which machines should remain active.”

Results after implementing long term memory:

The Teachable Agent was able to partially answer the simple question. It was correct on the car models that need to be prioritized, but it provided additional incorrect information about which components needed to be included.

The complex multi-step question was not adequately addressed. It erroneously merged two distinct business guidelines for 0–25% and 25–50% capacity constraints into a single, generalized guideline, resulting in an inaccurate response. Additionally, it only provided a partial list of machines that should remain active or be deactivated for each component’s production, omitting crucial details from the complete operational protocol. This demonstrates the challenges in accurately retrieving and synthesizing information from multiple, separate memory sources for complex queries.

SIMPLE QUESTION RESPONSE
===========================
>>>>>>>> USING AUTO REPLY...
teachable_agent (to user):
Given the facility is experiencing a power shortage of 40%, this falls within the 25-50% capacity constraint range. According to your previously established priorities for such a scenario, the following models and components need to be prioritized:

### Components to Prioritize
- Engine components
- Transmission parts
- Brake System Parts

### Models to Prioritize
1. Ford F-150
2. Ford Explorer

You should ensure that the machinery necessary to produce these critical components for the Ford F-150 and Ford Explorer is kept online.

COMPLEX MULTISTEP QUESTION RESPONSE
====================================
>>>>>>>> USING AUTO REPLY...
teachable_agent (to user):

In response to the 40% power shortage, let's create a detailed plan to ensure the most critical machinery remains operational while deactivating less essential ones. 

### Critical Prioritization (0-50% capacity constraint):
- **Critical Components**:
  - Engine components (pistons, crankshafts, camshafts)
  - Transmission parts
  - Brake System Parts (rotors, calipers, pads)
  
- **Prioritized Models**:
  - Ford F-150
  - Ford Explorer

### Machines to Remain Active:
1. **Engine Component Machines**:
   - Engine Block Milling Machine
   - Crankshaft Grinding Machine
   - Camshaft Milling Machine
   
2. **Transmission Part Machines**:
   - Transmission Housing Caster
   - Gear Cutting Machine
   - Transmission Assembly Line
   
3. **Brake System Part Machines**:
   - Brake Rotor Casting Machine
   - Brake Caliper Milling Machine
   - Brake Pad Press

### Machines to Deactivate:
1. **Body Panel Machines**:
   - Door Stamping Machine
   - Hood Assembly Machine
   - Fender Bending Machine
   
2. **Suspension Component Machines**:
   - Coil Spring Forming Machine
   - Shock Absorber Assembly Line
   
### Detailed Breakdown:

#### Active Machines:
- **Engine Block Milling Machine**:
  - Produces critical engine blocks, essential for Ford F-150 and Ford Explorer.
- **Crankshaft Grinding Machine**:
  - Ensures high precision crankshafts; crucial for the performance of Ford F-150 and Ford Explorer.
- **Camshaft Milling Machine**:
  - Produces camshafts, necessary for Ford F-150 and Ford Explorer engines.

- **Transmission Housing Caster**:
  - Important for casting the housing of transmissions used in Ford F-150 and Ford Explorer.
- **Gear Cutting Machine**:
  - Precision gears production, vital for Ford F-150 and Ford Explorer transmissions.
- **Transmission Assembly Line**:
  - Assembly of complete transmission units for Ford F-150 and Ford Explorer.

- **Brake Rotor Casting Machine**:
  - Produces brake rotors, crucial for the safe braking systems in Ford F-150 and Ford Explorer.
- **Brake Caliper Milling Machine**:
  - Machining of brake calipers, essential for Ford F-150 and Ford Explorer.
- **Brake Pad Press**:
  - Manufacturing of brake pads, necessary for ensuring braking efficiency in Ford F-150 and Ford Explorer.

#### Deactivated Machines:
- **Door Stamping Machine**:
  - While important, body panels like doors are less critical than engine and transmission components under current constraints.
- **Hood Assembly Machine**:
  - Deactivating hood assembly to free up power for critical components manufacturing.
- **Fender Bending Machine**:
  - Non-critical compared to engine and brake components during a significant power shortage.
- **Coil Spring Forming Machine**:
  - Suspension springs are lower priorities under current circumstances.
- **Shock Absorber Assembly Line**:
  - Suspension components aren’t critical enough to keep operational during this level of power shortage.

This plan ensures that production focuses on maintaining the critical components needed for primary models, the Ford F-150 and Ford Explorer, thereby maximizing the utility of the remaining power while managing production constraints effectively.

Code:

#simple question
user.initiate_chat(teachable_agent, message="The facility is experiencing a power shortage of 40%. What models need to be prioritized?", clear_history=True)
#multistep complex question
user.initiate_chat(teachable_agent, message="The facility is experiencing a power shortage of 30%. Provide me a detailed breakdown of what machines should be deactivated and which machines should remain active.", clear_history=True)

Conclusion

While Autogen provides a straightforward introduction to AI systems with memory, it falls short in handling complex tasks effectively.

When developing your own AI Agent System with memory capabilities, consider focusing on these key capabilities:

Implement multi-step searches to ensure comprehensive and relevant results. This allows the agent to assess the usefulness of search outcomes and address all aspects of a query using the retrieved information. Additionally, consider using more advanced retrieval approaches such as semantic search, hybrid search, or knowledge graphs for the best results.
To limit the potential for data poisoning, develop a thoughtful approach to who should be able to “teach” the agent and when the agent should “learning”. Based on guidelines set by the business or developer, one can also use agent reasoning to determine if something should be added to memory and by whom.
Remove the likelihood of retrieving out of date information by adding a memory decaying mechanism that determines when a memory is no longer relevant or a newer memory should replace it.
For multi-agent systems involving group chats or inter-agent information sharing, explore various communication patterns. Determine the most effective methods for transferring supplemental knowledge and establish limits to prevent information overload.

Note: The opinions expressed both in this article and paper are solely those of the authors and do not necessarily reflect the views or policies of their respective employers.

If you still have questions or think that something needs to be further clarified? Drop me a DM on Linkedin! I‘m always eager to engage in food for thought and iterate on my work.

Key Insights for Teaching AI Agents to Remember was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Originally appeared here:
Key Insights for Teaching AI Agents to Remember

Go Here to Read this Fast! Key Insights for Teaching AI Agents to Remember

September 9, 2024

Is Multi-Collinearity Destroying Your Causal Inferences In Marketing Mix Modelling?
Ryan O’Sullivan
Causal AI, exploring the integration of causal reasoning into machine learning

Photo by NOAA on Unsplash

What is this series about?

Welcome to my series on Causal AI, where we will explore the integration of causal reasoning into machine learning models. Expect to explore a number of practical applications across different business contexts.

In the last article we covered powering experiments with CUPED and double machine learning. Today, we shift our focus to understanding how multi-collinearity can damage the causal inferences you make, particularly in marketing mix modelling.

If you missed the last article on powering experiments with CUPED and double machine learning, check it out here:

Powering Experiments with CUPED and Double Machine Learning

Introduction

In this article, we will explore how damaging multi-collinearity can be and evaluate some methods we can use to address it. The following aspects will be covered:
- What is multi-collinearity?
- Why is it a problem in causal inference?
- Why is it so common in marketing mix modelling?
- How can we detect it?
- How can we address it?
- An introduction to Bayesian priors.
- A Python case study exploring how Bayesian priors and random budget adjustments can help alleviate multi-collinearity.
The full notebook can be found here:

causal_ai/notebooks/is multi-collinearity destroying your mmm.ipynb at main · raz1470/causal_ai

What is multi-collinearity?

Multi-collinearity occurs when two or more independent variables in a regression model are highly correlated with each other. This high correlation means they provide overlapping information, making it difficult for the model to distinguish the individual effect of each variable.

Let’s take an example from marketing. You sell a product where demand is highly seasonal — therefore, it makes sense to spend more on marketing during peak periods when demand is high. However, if both TV and social media spend follow the same seasonal pattern, it becomes difficult for the model to accurately determine the individual contribution of each channel.

User generated image

Why is it a problem in causal inference?

Multi-collinearity can lead to the coefficients of the correlated variables becoming unstable and biased. When multi-collinearity is present, the standard errors of the regression coefficients tend to inflate. This means that the uncertainty around the estimates increases, making it harder to tell if a variable is truly significant.

Let’s go back to our marketing example, even if TV advertising and social media both drive sales, the model might struggle to separate their impacts because the inflated standard errors make the coefficient estimates unreliable.

We can simulate some examples in python to get a better understanding:

Example 1 — Marketing spend on each channel is equal, resulting in biased coefficients:
```
# Example 1 - marketing spend on each channel is equal: biased coefficients
np.random.seed(150)

tv_spend = np.random.normal(0, 50, 1000)
social_spend = tv_spend
sales = 0.10 * tv_spend + 0.20 * social_spend
X = np.column_stack((tv_spend, social_spend))
clf = LinearRegression()
clf.fit(X, sales)

print(f'Coefficients: {clf.coef_}')
```
User generated image

Example 2 — Marketing spend on each channel follows the same trend, this time resulting in a coefficient sign flip:
```
# Example 2 - marketing spend on each channel follows the same trend: biased coefficients and sign flip
np.random.seed(150)

tv_spend = np.random.normal(0, 50, 1000)
social_spend = tv_spend * 0.50
sales = 0.10 * tv_spend + 0.20 * social_spend
X = np.column_stack((tv_spend, social_spend))
clf = LinearRegression()
clf.fit(X, sales)

print(f'Coefficients: {clf.coef_}')
```
User generated image

Example 3 — The addition of random noise allows the model to estimate the correct coefficients:
```
# Example 3 - random noise added to marketing spend: correct coefficients
np.random.seed(150)

tv_spend = np.random.normal(0, 50, 1000)
social_spend = tv_spend * 0.50 + np.random.normal(0, 1, 1000)
sales = 0.10 * tv_spend + 0.20 * social_spend
X = np.column_stack((tv_spend, social_spend))
clf = LinearRegression()
clf.fit(X, sales)

print(f'Coefficients: {clf.coef_}')
```
User generated image

Additionally, multi-collinearity can cause a phenomenon known as sign flipping, where the direction of the effect (positive or negative) of a variable can reverse unexpectedly. For instance, even though you know social media advertising should positively impact sales, the model might show a negative coefficient simply because of its high correlation with TV spend. We can see this in example 2.

Why is it so common in marketing mix modelling?

We’ve already touched upon one key issue: marketing teams often have a strong understanding of demand patterns and use this knowledge to set budgets. Typically, they increase spending across multiple channels during peak demand periods. While this makes sense from a strategic perspective, it can inadvertently create a multi-collinearity problem.

Even for products where demand is fairly constant, if the marketing team upweight or downweight each channel by the same percentage each week/month, then this will also leave us with a multi-collinearity problem.

The other reason I’ve seen for multi-collinearity in MMM is poorly specified causal graphs (DAGs). If we just throw everything into a flat regression, it’s likely we will have a multi-collinearity problem. Take the example below — If paid search impressions can be explained using TV and Social spend, then including it alongside TV and Social in a flat linear regression model is likely going to lead to multi-collinearity.

User generated image

How can we detect it?

Detecting multi-collinearity is crucial to prevent it from skewing causal inferences. Here are some common methods to identify it:

Correlation

A simple and effective way to detect multi-collinearity is by examining the correlation matrix. This matrix shows pairwise correlations between all variables in the dataset. If two predictors have a correlation coefficient close to +1 or -1, they are highly correlated, which could indicate multi-collinearity.

Variance inflation factor (VIF)

Quantifies how much the variance of a regression coefficient is inflated due to multi-collinearity:

User generated image

The R-squared is obtained by regressing all of the other independent variables on the chosen variable. If the R-squared is high this means the chosen variable can be predicted using the other independent variables (which results in a high VIF for the chosen variable).

There are some rule-of-thumb cut-offs for VIF in terms of detecting multi-collinearity – However, i’ve not found any convincing resources backing them up so I will not quote them here.

Standard errors

The standard error of a regression coefficient tells you how precisely that coefficient is estimated. It is calculated as the square root of the variance of the coefficient estimate. High standard errors may indicate multi-collinearity.

Simulations

Also the knowing the 3 approaches highlighted above is useful, it can still be hard to quantify whether you have a serious problem with multi-collinearity. Another approach you could take is running a simulation with known coefficients and then seeing how well you can estimate them with your model. Let’s illustrate using an MMM example:
- Extract channel spend and sales data as normal.
```
-- example SQL code to extract data
select
  observation_date,
  sum(tv_spend) as tv_spend,
  sum(social_spend) as social_spend,
  sum(sales) as sales
from mmm_data_mart
group by
  observation_date;
```
- Create data generating process, setting a coefficient for each channel.
```
# set coefficients for each channel using actual spend data
marketing_contribution = tv_spend * 0.10 + social_spend * 0.20

# calculate the remaining contribution
other_contribution = sales - marketing_contribution

# create arrays for regression
X = np.column_stack((tv_spend, social_spend, other_contribution))
y = sales
```
- Train model and compare estimated coefficients to those set in the last step.
```
# train regression model
clf = LinearRegression()
clf.fit(X, y)

# recover coefficients
print(f'Recovered coefficients: {clf.coef_}')
```
Now we know how we can identify multi-collinearity, let’s move on and explore how we can address it!

How can we address it?

There are several strategies to address multi-collinearity:
1. Removing one of the correlated variables
  This is a straightforward way to reduce redundancy. However, removing a variable blindly can be risky — especially if the removed variable is a confounder. A helpful step is determining the causal graph (DAG). Understanding the causal relationships allows you to assess whether dropping a correlated variable still enables valid inferences.
2. Combining variables
  When two or more variables provide similar information, you can combine them. This method reduces the dimensionality of the model, mitigating multi-collinearity risk while preserving as much information as possible. As with the previous approach, understanding the causal structure of the data is crucial.
3. Regularization techniques
  Regularization methods such as Ridge or Lasso regression are powerful tools to counteract multi-collinearity. These techniques add a penalty to the model’s complexity, shrinking the coefficients of correlated predictors. Ridge focuses on reducing the magnitude of all coefficients, while Lasso can drive some coefficients to zero, effectively selecting a subset of predictors.
4. Bayesian priors
  Using Bayesian regression techniques, you can introduce prior distributions for the parameters based on existing knowledge. This allows the model to “regularize” based on these priors, reducing the impact of multi-collinearity. By informing the model about reasonable ranges for parameter values, it prevents overfitting to highly correlated variables. We’ll delve into this method in the case study to illustrate its effectiveness.
5. Random budget adjustments
  Another strategy, particularly useful in marketing mix modeling (MMM), is introducing random adjustments to your marketing budgets at a channel level. By systematically altering the budgets you can start to observe the isolated effects of each. There are two main challenges with this method (1) Buy-in from the marketing team and (2) Once up and running it could take months or even years to collect enough data for your model. We will also cover this one off in the case study with some simulations.
We will test some of these strategies out in the case study next.

An introduction to Bayesian priors

A deep dive into Bayesian priors is beyond the scope of this article, but let’s cover some of the intuition behind them to ensure we can follow the case study.

Bayesian priors represent our initial beliefs about the values of parameters before we observe any data. In a Bayesian approach, we combine these priors with actual data (via a likelihood function) to update our understanding and calculate the posterior distribution, which reflects both the prior information and the data.

To simplify: when building an MMM, we need to feed the model some prior beliefs about the coefficients of each variable. Instead of supplying a fixed upper and lower bound, we provide a distribution. The model then searches within this distribution and, using the data, calculates the posterior distribution. Typically, we use the mean of this posterior distribution to get our coefficient estimates.

Of course, there’s more to Bayesian priors than this, but the explanation above serves as a solid starting point!

Case study

You’ve recently joined a start-up who have been running their marketing strategy for a couple of years now. They want to start measuring it using MMM, but their early attempts gave unintuitive results (TV had a negative contribution!). It seems their problem stems from the fact that each marketing channel owner is setting their budget based on the demand forecast, leading to a problem with multi-collinearity. You are tasked with assessing the situation and recommending next steps.

Data-generating-process

Let’s start by creating a data-generating function in python with the following properties:
- Demand is made up of 3 components: trend, seasonality and noise.
- The demand forecast model comes from the data science team and can accurately predict within +/- 5% accuracy.
- This demand forecast is used by the marketing team to set the budget for social and TV spend — We can add some random variation to these budgets using the spend_rand_change parameter.
- The marketing team spend twice as much on TV compared to social media.
- Sales are driven by a linear combination of demand, social media spend and TV spend.
- The coefficients for social media and TV spend can be set using the true_coef parameter.
```
def data_generator(spend_rand_change, true_coef):
    '''
    Generate simulated marketing data with demand, forecasted demand, social and TV spend, and sales.

    Args:
        spend_rand_change (float): Random variation parameter for marketing spend.
    true_coef (list): True coefficients for demand, social media spend, and TV spend effects on sales.

    Returns:
        pd.DataFrame: DataFrame containing the simulated data.
    '''
    
    # Parameters for data generation
    start_date = "2018-01-01"
    periods = 365 * 3  # Daily data for three years
    trend_slope = 0.01  # Linear trend component
    seasonal_amplitude = 5  # Amplitude of the seasonal component
    seasonal_period = 30.44  # Monthly periodicity
    noise_level = 5  # Level of random noise in demand

    # Generate time variables
    time = np.arange(periods)
    date_range = pd.date_range(start=start_date, periods=periods)

    # Create demand components
    trend_component = trend_slope * time
    seasonal_component = seasonal_amplitude * np.sin(2 * np.pi * time / seasonal_period)
    noise_component = noise_level * np.random.randn(periods)

    # Combine to form demand series
    demand = 100 + trend_component + seasonal_component + noise_component

    # Initialize DataFrame
    df = pd.DataFrame({'date': date_range, 'demand': demand})

    # Add forecasted demand with slight random variation
    df['demand_forecast'] = df['demand'] * np.random.uniform(0.95, 1.05, len(df))

    # Simulate social media and TV spend with random variation
    df['social_spend'] = df['demand_forecast'] * 10 * np.random.uniform(1 - spend_rand_change, 1 + spend_rand_change, len(df))
    df['tv_spend'] = df['demand_forecast'] * 20 * np.random.uniform(1 - spend_rand_change, 1 + spend_rand_change, len(df))
    df['total_spend'] = df['social_spend'] + df['tv_spend']

    # Calculate sales based on demand, social, and TV spend, with some added noise
    df['sales'] = (
        df['demand'] * true_coef[0] + 
        df['social_spend'] * true_coef[1] + 
        df['tv_spend'] * true_coef[2]
    )
    sales_noise = 0.01 * df['sales'] * np.random.randn(len(df))
    df['sales'] += sales_noise
    
    return df
```
Initial assessment

Now let’s simulate some data with no random variation applied to how the marketing team set the budget — We will try and estimate the true coefficients. The function below is used to train the regression model:
```
def run_reg(df, features, target):
    '''
    Runs a linear regression on the specified features to predict the target variable.

    Args:
        df (pd.DataFrame): The input data containing features and target.
    features (list): List of column names to be used as features in the regression.
    target (str): The name of the target column to be predicted.
    Returns:
        np.ndarray: Array of recovered coefficients from the linear regression model.
    '''
    
    # Extract features and target values
    X = df[features].values
    y = df[target].values

    # Initialize and fit linear regression model
    model = LinearRegression()
    model.fit(X, y)

    # Output recovered coefficients
    coefficients = model.coef_
    print(f'Recovered coefficients: {coefficients}')
    
    return coefficients
```
```
np.random.seed(40)

true_coef = [0.35, 0.15, 0.05]

features = [
    "demand",
    "social_spend",
    "tv_spend"
]

target = "sales"

sim_1 = data_generator(0.00, true_coef)
reg_1 = run_reg(sim_1, features, target)

print(f"True coefficients: {true_coef}")
```
User generated image

We can see that the coefficient for social spend is underestimated whilst the coefficient for tv spend is overestimated. Good job you didn’t give the marketing team this model to optimise their budgets — It would have ended in disaster!

In the short-term, could using Bayesian priors give less biased coefficients?

In the long-term, would random budget adjustments create a dataset which doesn’t suffer from multi-collinearity?

Let’s try and find out!

Bayesian priors

Let’s start with exploring Bayesian priors…

We will be using my favourite MMM implementation pymc marketing:

Guide – pymc-marketing 0.8.0 documentation

We will use the same data we generated in the initial assessment:
```
date_col = "date"

y_col = "sales"

channel_cols = ["social_spend",
                "tv_spend"]

control_cols = ["demand"]

X = sim_1[[date_col] + channel_cols + control_cols]
y = sim_1[y_col]
```
Before we get into the modelling lets have a look at the contribution for each variable:
```
# calculate contribution
true_contributions = [round(np.sum(X["demand"] * true_coef[0]) / np.sum(y), 2), 
                      round(np.sum(X["social_spend"] * true_coef[1]) / np.sum(y), 2), 
                      round(np.sum(X["tv_spend"] * true_coef[2]) / np.sum(y), 2)]
true_contributions
```
User generated image

Bayesian (default) priors

Let’s see what result we get if we use the default priors. Below you can see that there are a lot of priors! This is because we have to supply priors for the intercept, ad stock and saturation transformation amongst other things. It’s the saturation beta we are interested in – This is the equivalent of the variable coefficients we are trying to estimate.
```
mmm_default = MMM(
    adstock="geometric",
    saturation="logistic",
    date_column=date_col,
    channel_columns=channel_cols,
    control_columns=control_cols,
    adstock_max_lag=4,
    yearly_seasonality=2,
)

mmm_default.default_model_config
```
User generated image

We have to supply a distribution. The HalfNormal is a sensible choice for channel coefficients as we know they can’t be negative. Below we visualise what the distribution looks like to bring it to life:
```
sigma = 2

x1 = np.linspace(0, 10, 1000)
y1 = halfnorm.pdf(x1, scale=sigma)

plt.figure(figsize=(8, 6))
plt.plot(x1, y1, 'b-')
plt.fill_between(x1, y1, alpha=0.2, color='blue')
plt.title('Saturation beta: HalfNormal Distribution (sigma=2)')
plt.xlabel('Saturation beta')
plt.ylabel('Probability Density')
plt.grid(True)
plt.show()
```
User generated image

Now we are ready to train the model and extract the contributions of each channel. As before our coefficients are biased (we know this as the contributions for each channel aren’t correct — social media should be 50% and TV should be 35%). However, interestingly they are much closer to the true contribution compared to when we ran linear regression before. This would actually be a reasonable starting point for the marketing team!
```
mmm_default.fit(X, y)
mmm_default.plot_waterfall_components_decomposition();
```
User generated image

Bayesian (custom) priors

Before we move on, let’s take the opportunity to think about custom priors. One (very bold) assumption we can make is that each channel has a similar return on investment (or in our case where we don’t have revenue, cost per sale). We can therefore use the spend distribution across channel to set some custom priors.

As the MMM class does feature scaling in both the target and features, priors also need to be supplied in the scaled space. This actually makes it quite easy for us to do as you can see in the below code:
```
total_spend_per_channel = df[channel_cols].sum(axis=0)
spend_share = total_spend_per_channel / total_spend_per_channel.sum()

n_channels = df[channel_cols].shape[1]
prior_sigma = n_channels * spend_share.to_numpy()

spend_share
```
User generated image

We then need to feed the custom priors into the model.
```
my_model_config = {'saturation_beta': {'dist': 'HalfNormal', 'kwargs': {'sigma': prior_sigma}}}

mmm_priors = MMM(
    model_config=my_model_config,
    adstock="geometric",
    saturation="logistic",
    date_column=date_col,
    channel_columns=channel_cols,
    control_columns=control_cols,
    adstock_max_lag=4,
    yearly_seasonality=2,
)

mmm_priors.default_model_config
```
User generated image

When we train the model and extract the coefficients we see that the priors have come into play, with tv now having the highest contribution (because we spent more than social). However, this is very wrong and illustrates why we have to be so careful when setting priors! The marketing team should really think about running some experiments to help them set priors.
```
mmm_priors.fit(X, y)
mmm_priors.plot_waterfall_components_decomposition();
```
Random budget adjustments

Now we have our short-term plan in place, let’s think about the longer term plan. If we could persuade the marketing team to apply small random adjustments to their marketing channel budgets each month, would this create a dataset without multi-collinearity?

The code below uses the data generator function and simulates a range of random spend adjustments:
```
np.random.seed(40)

# Define list to store results
results = []

# Loop through a range of random adjustments to spend
for spend_rand_change in np.arange(0.00, 0.05, 0.001):
    # Generate simulated data with the current spend_rand_change
    sim_data = data_generator(spend_rand_change, true_coef)
    
    # Run the regression
    coefficients = run_reg(sim_data, features=['demand', 'social_spend', 'tv_spend'], target='sales')
    
    # Store the spend_rand_change and coefficients for later plotting
    results.append({
        'spend_rand_change': spend_rand_change,
        'coef_demand': coefficients[0],
        'coef_social_spend': coefficients[1],
        'coef_tv_spend': coefficients[2]
    })

# Convert results to DataFrame for easy plotting
results_df = pd.DataFrame(results)

# Plot the coefficients as a function of spend_rand_change
plt.figure(figsize=(10, 6))
plt.plot(results_df['spend_rand_change'], results_df['coef_demand'], label='Demand Coef', color='r', marker='o')
plt.plot(results_df['spend_rand_change'], results_df['coef_social_spend'], label='Social Spend Coef', color='g', marker='o')
plt.plot(results_df['spend_rand_change'], results_df['coef_tv_spend'], label='TV Spend Coef', color='b', marker='o')

# Add lines for the true coefficients
plt.axhline(y=true_coef[0], color='r', linestyle='--', label='True Demand Coef')
plt.axhline(y=true_coef[1], color='g', linestyle='--', label='True Social Spend Coef')
plt.axhline(y=true_coef[2], color='b', linestyle='--', label='True TV Spend Coef')

plt.title('Regression Coefficients vs Spend Random Change')
plt.xlabel('Spend Random Change')
plt.ylabel('Coefficient Value')
plt.legend()
plt.grid(True)
plt.show()
```
We can see from the results that just a small random adjustment to the budget for each channel can break free of the multi-collinearity curse!

User generated image

It’s worth noting that if I change the random seed (almost like resampling), the starting point for the coefficients varies — However, whatever seed I used the coefficients stabilised after a 1% random change in spend. I’m sure this will vary depending on your data-generating process so make sure you test it out using your own data!

Final thoughts
- Although the focus of this article was multi-collinearity, the big take away is the importance of simulating data and then trying to estimate the known coefficients (remember you set them yourself so you know them) — It’s an essential step if you want to have confidence in your results!
- When it comes to MMM, it can be useful to use your actual spend and sales data as the base for your simulation — This will help you understand if you have a multi-collinearity problem.
- If you use actual spend and sales data you can also carry out a random budget adjustment simulation to help come up with a suitable randomisation strategy for the marketing team. Keep in mind my simulation was simplistic to illustrate a point — We could design a much more effective strategy e.g. testing different areas of the response curve for each channel.
- Bayesian can be a steep learning curve — The other approach we could take is using a constrained regression in which you set upper and lower bounds for each channel coefficient based on prior knowledge.
- If you are setting Bayesian priors, it’s super important to be transparent about how they work and how they were selected. If you go down the route of using the channel spend distribution, the assumption that each channel has a similar ROI needs signing off from the relevant stakeholders.
- Bayesian priors are not magic! Ideally you would use results from experiments to set your priors — It’s worth checking out how the pymc marketing have approached this:
Lift Test Calibration – pymc-marketing 0.8.0 documentation

That is it, hope you enjoyed this instalment! Follow me if you want to continue this journey into Causal AI – In the next article we will immerse ourselves in the topic of bad controls!

Is Multi-Collinearity Destroying Your Causal Inferences In Marketing Mix Modelling? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Is Multi-Collinearity Destroying Your Causal Inferences In Marketing Mix Modelling?

Go Here to Read this Fast! Is Multi-Collinearity Destroying Your Causal Inferences In Marketing Mix Modelling?
September 9, 2024
Align Meta Llama 3 to human preferences with DPO, Amazon SageMaker Studio, and Amazon SageMaker Ground Truth

Anastasia Tzeveleka

In this post, we show you how to enhance the performance of Meta Llama 3 8B Instruct by fine-tuning it using direct preference optimization (DPO) on data collected with SageMaker Ground Truth.

Originally appeared here:
Align Meta Llama 3 to human preferences with DPO, Amazon SageMaker Studio, and Amazon SageMaker Ground Truth

Go Here to Read this Fast! Align Meta Llama 3 to human preferences with DPO, Amazon SageMaker Studio, and Amazon SageMaker Ground Truth

September 9, 2024
Amazon EC2 P5e instances are generally available

Avi Kulkarni

In this post, we discuss the core capabilities of Amazon Elastic Compute Cloud (Amazon EC2) P5e instances and the use cases they’re well-suited for. We walk you through an example of how to get started with these instances and carry out inference deployment of Meta Llama 3.1 70B and 405B models on them.

Originally appeared here:
Amazon EC2 P5e instances are generally available

Go Here to Read this Fast! Amazon EC2 P5e instances are generally available

September 9, 2024
Exploring data using AI chat at Domo with Amazon Bedrock

Joe Clark

In this post, we share how Domo, a cloud-centered data experiences innovator is using Amazon Bedrock to provide a flexible and powerful AI solution.

Originally appeared here:
Exploring data using AI chat at Domo with Amazon Bedrock

Go Here to Read this Fast! Exploring data using AI chat at Domo with Amazon Bedrock

September 9, 2024
Data Science at Home: Solving the Nanny Schedule Puzzle with Monte Carlo and Genetic Algorithms
Courtney Perigo
Bringing order to chaos while simplifying our search for the perfect nanny for our childcare

As a data science leader, I’m used to having a team that can turn chaos into clarity. But when the chaos is your own family’s nanny schedule, even the best-laid plans can go awry. The thought of work meetings, nap times, and unpredictable shifts have our minds running in circles — until I realized I could use the same algorithms that solve business problems to solve a very personal one. Armed with Monte Carlo simulation, genetic algorithms, and a dash of parental ingenuity, I embarked on a journey to tame our wild schedules, one algorithmic tweak at a time. The results? Well, let’s just say our nanny’s new schedule looks like a perfect fit.

Photo by Markus Spiske on Unsplash

Setting the Stage: The Great Schedule Puzzle

Our household schedule looks like the aftermath of a bull in a china shop. Parent 1, with a predictable 9-to-5, was the easy piece of the puzzle. But then came Parent 2, whose shifts in a bustling emergency department at a Chicago hospital were anything but predictable. Some days started with the crack of dawn, while others stretched late into the night, with no rhyme or reason to the pattern. Suddenly, what used to be a straightforward schedule turned into a Rubik’s Cube with no solution in sight.

Photo by Nick Fewings on Unsplash

We imagined ourselves as parents in this chaos. Mornings becoming a mad dash, afternoons always being a guessing game, and evenings — who knows? Our family was headed for a future of playing “who’s on nanny duty?” We needed a decision analytics solution that could adapt as quickly as the ER could throw us a curveball.

That’s when it hit me: what if I could use the same tools I rely on at work to solve this ever-changing puzzle? What if, instead of fighting against the chaos, we could harness it — predict it even? Armed with this idea, it was time to put our nanny’s schedule under the algorithmic microscope.

The Data Science Toolbox: When in Doubt, Simulate

With our household schedule resembling the aftermath of a bull in a china shop, it was clear that we needed more than just a calendar and a prayer. That’s when I turned to Monte Carlo simulation — the data scientist’s version of a crystal ball. The idea was simple: if we can’t predict exactly when chaos will strike, why not simulate all the possible ways it could go wrong?

Monte Carlo simulation is a technique that uses random sampling to model a system’s behavior. In this case, we’re going to use it to randomly generate possible work schedules for Parent 2, allowing us to simulate the unpredictable nature of their shifts over many iterations.

Imagine running thousands of “what-if” scenarios: What if Parent 2 gets called in for an early shift? What if an emergency keeps them late at the hospital? What if, heaven forbid, both parents’ schedules overlap at the worst possible time? The beauty of Monte Carlo is that it doesn’t just give you one answer — it gives you thousands, each one a different glimpse into the future.

This wasn’t just about predicting when Parent 2 might get pulled into a code blue; it was about making sure our nanny was ready for every curveball the ER could throw at us. Whether it was an early morning shift or a late-night emergency, the simulation helped us see all the possibilities, so we could plan for the most likely — and the most disastrous — scenarios. Think of it as chaos insurance, with the added bonus of a little peace of mind.

In the following code block, the simulation generates a work schedule for Parent 2 over a five-day workweek (Monday-Friday). Each day, there’s a probability that Parent 2 is called into work, and if so, a random shift is chosen from a set of predefined shifts based on those probabilities. We’ve also added a feature that accounts for a standing meeting on Wednesdays at 1pm and adjusts Parent 2’s schedule accordingly.
```
import numpy as np

def simulate_parent_2_schedule(num_days=5):
    parent_2_daily_schedule = []  # Initialize empty schedule for Parent 2

    for day in range(num_days):
        if np.random.rand() < parent_2_work_prob:  # Randomly determine if Parent 2 works
            shift = np.random.choice(
                list(parent_2_shift_probabilities.keys()), 
                p=[parent_2_shift_probabilities[shift]['probability'] for shift in parent_2_shift_probabilities]
            )
            start_hour = parent_2_shift_probabilities[shift]['start_hour']  # Get start time
            end_hour = parent_2_shift_probabilities[shift]['end_hour']  # Get end time
            
            # Check if it's Wednesday and adjust schedule to account for a meeting
            if day == 2:  
                meeting_start = 13
                meeting_end = 16
                # Adjust schedule if necessary to accommodate the meeting
                if end_hour <= meeting_start:  
                    end_hour = meeting_end  
                elif start_hour >= meeting_end:
                    parent_2_daily_schedule.append({'start_hour': meeting_start, 'end_hour': end_hour})
                    continue  
                else:
                    if start_hour > meeting_start:
                        start_hour = meeting_start  
                    if end_hour < meeting_end:
                        end_hour = meeting_end  
            
            parent_2_daily_schedule.append({'start_hour': start_hour, 'end_hour': end_hour})
        else:
            # If Parent 2 isn't working that day, leave the schedule empty or just the meeting
            if day == 2:  
                parent_2_daily_schedule.append({'start_hour': 14, 'end_hour': 16})
            else:
                parent_2_daily_schedule.append({'start_hour': None, 'end_hour': None})

    return parent_2_daily_schedule
```
We can use the simulate_parent_2_schedule function to simulate Parent 2’s schedule over a workweek and combine it with Parent 1’s more predictable 9–5 schedule. By repeating this process for 52 weeks, we can simulate a typical year and identify the gaps in parental coverage. This allows us to plan for when the nanny is needed the most. The image below summarizes the parental unavailability across a simulated 52-week period, helping us visualize where additional childcare support is required.

Image Special from Author

Evolving the Perfect Nanny: The Power of Genetic Algorithms

Armed with simulation of all the possible ways our schedule can throw curveballs at us, I knew it was time to bring in some heavy-hitting optimization techniques. Enter genetic algorithms — a natural selection-inspired optimization method that finds the best solution by iteratively evolving a population of candidate solutions.

Photo by Sangharsh Lohakare on Unsplash

In this case, each “candidate” was a potential set of nanny characteristics, such as their availability and flexibility. The algorithm evaluates different nanny characteristics, and iteratively improves those characteristics to find the one that fits our family’s needs. The result? A highly optimized nanny with scheduling preferences that balance our parental coverage gaps with the nanny’s availability.

At the heart of this approach is what I like to call the “nanny chromosome.” In genetic algorithm terms, a chromosome is simply a way to represent potential solutions — in our case, different nanny characteristics. Each “nanny chromosome” had a set of features that defined their schedule: the number of days per week the nanny could work, the maximum hours she could cover in a day, and their flexibility to adjust to varying start times. These features were the building blocks of every potential nanny schedule the algorithm would consider.

Defining the Nanny Chromosome

In genetic algorithms, a “chromosome” represents a possible solution, and in this case, it’s a set of features defining a nanny’s schedule. Here’s how we define a nanny’s characteristics:
```
# Function to generate nanny characteristics
def generate_nanny_characteristics():
    return {
        'flexible': np.random.choice([True, False]),  # Nanny's flexibility
        'days_per_week': np.random.choice([3, 4, 5]),  # Days available per week
        'hours_per_day': np.random.choice([6, 7, 8, 9, 10, 11, 12])  # Hours available per day
    }
```
Each nanny’s schedule is defined by their flexibility (whether they can adjust start times), the number of days they are available per week, and the maximum hours they can work per day. This gives the algorithm the flexibility to evaluate a wide variety of potential schedules.

Building the Schedule for Each Nanny

Once the nanny’s characteristics are defined, we need to generate a weekly schedule that fits those constraints:
```
# Function to calculate a weekly schedule based on nanny's characteristics
def calculate_nanny_schedule(characteristics, num_days=5):
    shifts = []
    for _ in range(num_days):
        start_hour = np.random.randint(6, 12) if characteristics['flexible'] else 9  # Flexible nannies have varying start times
        end_hour = start_hour + characteristics['hours_per_day']  # Calculate end hour based on hours per day
        shifts.append((start_hour, end_hour))
    return shifts  # Return the generated weekly schedule
```
This function builds a nanny’s schedule based on their defined flexibility and working hours. Flexible nannies can start between 6 AM and 12 PM, while others have fixed schedules that start and end at set times. This allows the algorithm to evaluate a range of possible weekly schedules.

Selecting the Best Candidates

Once we’ve generated an initial population of nanny schedules, we use a fitness function to evaluate which ones best meet our childcare needs. The most fit schedules are selected to move on to the next generation:
```
# Function for selection in genetic algorithm
def selection(population, fitness_scores, num_parents):
    # Normalize fitness scores and select parents based on probability
    min_fitness = np.min(fitness_scores)
    if min_fitness < 0:
        fitness_scores = fitness_scores - min_fitness
    
    fitness_scores_sum = np.sum(fitness_scores)
    probabilities = fitness_scores / fitness_scores_sum if fitness_scores_sum != 0 else np.ones(len(fitness_scores)) / len(fitness_scores)
    
    # Select parents based on their fitness scores
    selected_parents = np.random.choice(population, size=num_parents, p=probabilities)
    return selected_parents
```
In the selection step, the algorithm evaluates the population of nanny schedules using a fitness function that measures how well the nanny’s availability aligns with the family’s needs. The most fit schedules, those that best cover the required hours, are selected to become “parents” for the next generation.

Adding Mutation to Keep Things Interesting

To avoid getting stuck in suboptimal solutions, we add a bit of randomness through mutation. This allows the algorithm to explore new possibilities by occasionally tweaking the nanny’s schedule:
```
# Function to mutate nanny characteristics
def mutate_characteristics(characteristics, mutation_rate=0.1):
    if np.random.rand() < mutation_rate:
        characteristics['flexible'] = not characteristics['flexible']
    if np.random.rand() < mutation_rate:
        characteristics['days_per_week'] = np.random.choice([3, 4, 5])
    if np.random.rand() < mutation_rate:
        characteristics['hours_per_day'] = np.random.choice([6, 7, 8, 9, 10, 11, 12])
    return characteristics
```
By introducing small mutations, the algorithm is able to explore new schedules that might not have been considered otherwise. This diversity is important for avoiding local optima and improving the solution over multiple generations.

Evolving Toward the Perfect Schedule

The final step was evolution. With selection and mutation in place, the genetic algorithm iterates over several generations, evolving better nanny schedules with each round. Here’s how we implement the evolution process:
```
# Function to evolve nanny characteristics over multiple generations
def evolve_nanny_characteristics(all_childcare_weeks, population_size=1000, num_generations=10):
    population = [generate_nanny_characteristics() for _ in range(population_size)]  # Initialize the population
    
    for generation in range(num_generations):
        print(f"n--- Generation {generation + 1} ---")
        
        fitness_scores = []
        hours_worked_collection = []
        
        for characteristics in population:
            fitness_score, yearly_hours_worked = fitness_function_yearly(characteristics, all_childcare_weeks)
            fitness_scores.append(fitness_score)
            hours_worked_collection.append(yearly_hours_worked)
        
        fitness_scores = np.array(fitness_scores)

        # Find and store the best individual of this generation
        max_fitness_idx = np.argmax(fitness_scores)
        best_nanny = population[max_fitness_idx]
        best_nanny['actual_hours_worked'] = hours_worked_collection[max_fitness_idx]
        
        # Select parents and generate a new population
        parents = selection(population, fitness_scores, num_parents=population_size // 2)
        new_population = []
        for i in range(0, len(parents), 2):
            parent_1, parent_2 = parents[i], parents[i + 1]
            child = {
                'flexible': np.random.choice([parent_1['flexible'], parent_2['flexible']]),
                'days_per_week': np.random.choice([parent_1['days_per_week'], parent_2['days_per_week']]),
                'hours_per_day': np.random.choice([parent_1['hours_per_day'], parent_2['hours_per_day']])
            }
            child = mutate_characteristics(child)
            new_population.append(child)
        
        population = new_population  # Replace the population with the new generation

    return best_nanny  # Return the best nanny after all generations
```
Here, the algorithm evolves over multiple generations, selecting the best nanny schedules based on their fitness scores and allowing new solutions to emerge through mutation. After several generations, the algorithm converges on the best possible nanny schedule, optimizing coverage for our family.

Final Thoughts

With this approach, we applied genetic algorithms to iteratively improve nanny schedules, ensuring that the selected schedule could handle the chaos of Parent 2’s unpredictable work shifts while balancing our family’s needs. Genetic algorithms may have been overkill for the task, but they allowed us to explore various possibilities and optimize the solution over time.

The images below describe the evolution of nanny fitness scores over time. The algorithm was able to quickly converge on the best nanny chromosome after just a few generations.

Image Special from Author

Image Special from Author

From Chaos to Clarity: Visualizing the Solution

After the algorithm had done its work and optimized the nanny characteristics we were looking for, the next step was making sense of the results. This is where visualization came into play, and I have to say, it was a game-changer. Before we had charts and graphs, our schedule felt like a tangled web of conflicting commitments, unpredictable shifts, and last-minute changes. But once we turned the data into something visual, everything started to fall into place.

The Heatmap: Coverage at a Glance

The heatmap provided a beautiful splash of color that turned the abstract into something tangible. The darker the color, the more nanny coverage there was, and the lighter the color, the less nanny coverage we needed. This made it easy to spot any potential issues at a glance. Need more coverage on Friday? Check the heatmap. Will the nanny be working too many hours on Wednesday? (Yes, that’s very likely.) The heatmap will let you know. It gave us instant clarity, helping us tweak the schedule where needed and giving us peace of mind when everything lined up perfectly.

Image Special from Author

By visualizing the results, we didn’t just solve the scheduling puzzle — we made it easy to understand and follow. Instead of scrambling to figure out what kind of nanny we needed, we could just look at the visuals and see what they needed to cover. From chaos to clarity, these visual tools turned data into insight and helped us shop for nannies with ease.

The Impact: A Household in Harmony

Before I applied my data science toolkit to our family’s scheduling problem, it felt a little overwhelming. We started interviewing nannies without really understanding what we were looking for, or needed, to keep our house in order.

But after optimizing the nanny schedule with Monte Carlo simulations and genetic algorithms, the difference was night and day. Where there was once chaos, now there’s understanding. Suddenly, we had a clear plan, a map of who was where and when, and most importantly, a roadmap for the kind of nanny to find.

The biggest change wasn’t just in the schedule itself, though — it was in how we felt. There’s a certain peace of mind that comes with knowing you have a plan that works, one that can flex and adapt when the unexpected happens. And for me personally, this project was more than just another application of data science. It was a chance to take the skills I use every day in my professional life and apply them to something that directly impacts my family.

The Power of Data Science at Home

We tend to think of data science as something reserved for the workplace, something that helps businesses optimize processes or make smarter decisions. But as I learned with our nanny scheduling project, the power of data science doesn’t have to stop at the office door. It’s a toolkit that can solve everyday challenges, streamline chaotic situations, and, yes, even bring a little more calm to family life.

Photo by Kenny Eliason on Unsplash

Maybe your “nanny puzzle” isn’t about childcare. Maybe it’s finding the most efficient grocery list, managing home finances, or planning your family’s vacation itinerary. Whatever the case may be, the tools we use at work — Monte Carlo simulations, genetic algorithms, and data-driven optimization — can work wonders at home too. You don’t need a complex problem to start, just a curiosity to see how data can help untangle even the most mundane challenges.

So here’s my challenge to you: Take a look around your life and find one area where data could make a difference. Maybe you’ll stumble upon a way to save time, money, or even just a little peace of mind. It might start with something as simple as a spreadsheet, but who knows where it could lead? Maybe you’ll end up building your own “Nanny Olympics” or solving a scheduling nightmare of your own.

And as we move forward, I think we’ll see data science becoming a more integral part of our personal lives — not just as something we use for work, but as a tool to manage our day-to-day challenges. In the end, it’s all about using the power of data to make our lives a little easier.

The code and data for the Nanny Scheduling problem can be found on Github: https://github.com/agentdanger/nanny-simulation

Professional information about me can be found on my website: https://courtneyperigo.com

Data Science at Home: Solving the Nanny Schedule Puzzle with Monte Carlo and Genetic Algorithms was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Data Science at Home: Solving the Nanny Schedule Puzzle with Monte Carlo and Genetic Algorithms

Go Here to Read this Fast! Data Science at Home: Solving the Nanny Schedule Puzzle with Monte Carlo and Genetic Algorithms
September 9, 2024
GenAI with Python: Coding Agents

Mauro Di Pietro

Build a Data Scientist AI that can query db with SQL, analyze data with Python, write reports with HTML, and do Machine Learning (No GPU…

Continue reading on Towards Data Science »

Originally appeared here:
GenAI with Python: Coding Agents

Go Here to Read this Fast! GenAI with Python: Coding Agents

September 9, 2024
Introducing Semantic Tag Filtering: Enhancing Retrieval with Tag Similarity
Michelangiolo Mazzeschi
Semantic Tag Filtering

How to use Semantic Similarity to improve tag filtering

***To understand this article, the knowledge of both Jaccard similarity and vector search is required. The implementation of this algorithm has been released on GitHub and is fully open-source.

Over the years, we have discovered how to retrieve information from different modalities, such as numbers, raw text, images, and also tags.
With the growing popularity of customized UIs, tag search systems have become a convenient way of easily filtering information with a good degree of accuracy. Some cases where tag search is commonly employed are the retrieval of social media posts, articles, games, movies, and even resumes.

However, traditional tag search lacks flexibility. If we are to filter the samples that contain exactly the given tags, there might be cases when, especially for databases containing only a few thousand samples, there might not be any (or only a few) matching samples for our query.

difference of the two searches in front of a scarcity of results, Image by author

***Through the following article I am trying to introduce several new algorithms that, to the extent of my knowledge, I have been unable to find. I am open to criticism and welcome any feedback.

How does traditional tag search work?

Traditional systems employ an algorithm called Jaccard similarity (commonly executed through the minhash algo), which is able to compute the similarity between two sets of elements (in our case, those elements are tags). As previously clarified, the search is not flexible at all (sets either contain or do not contain the queried tags).

example of a simple AND bitwise operation (this is not Jaccard similarity, but can give you an approximate idea of the filtering method), Image by author

Can we do better?

What if, instead, rather than just filtering a sample from matching tags, we could take into account all the other labels in the sample that are not identical, but are similar to our chosen tags? We could be making the algorithm more flexible, expanding the results to non-perfect matches, but still good matches. We would be applying semantic similarity directly to tags, rather than text.

Introducing Semantic Tag Search

As briefly explained, this new approach attempts to combine the capabilities of semantic search with tag filtering systems. For this algorithm to be built, we need only one thing:
- A database of tagged samples
The reference data I will be using is the open-source collection of the Steam game library (downloadable from Kaggle — MIT License) — approx. 40,000 samples, which is a good amount of samples to test our algorithm. As we can see from the displayed dataframe, each game has several assigned tags, with over 400 unique tags in our database.

Screenshot of the Steam dataframe available in the example notebook, Image by author

Now that we have our starting data, we can proceed: the algorithm will be articulated in the following steps:
1. Extracting tags relationships
2. Encoding queries and samples
3. Perform the semantic tag search using vector retrieval
4. Validation
In this article, I will only explore the math behind this new approach (for an in-depth explanation of the code with a working demo, please refer to the following notebook: instructions on how to use simtag are available in the README.md file on root).

1. Extracting tag relationships

The first question that comes to mind is how can we find the relationships between our tags. Note that there are several algorithms used to obtain the same result:
- Using statistical methods
  The simplest employable method we can use to extract tag relationships is called co-occurrence matrix, which is the format that (for both its effectiveness and simplicity) I will employ in this article.
- Using Deep Learning
  The most advanced ones are all based on Embeddings neural networks (such as Word2Vec in the past, now it is common to use transformers, such as LLMs) that can extract the semantic relationships between samples. Creating a neural network to extract tag relationships (in the form of an autoencoder) is a possibility, and it is usually advisable when facing certain circumstances.
- Using a pre-trained model
  Because tags are defined using human language, it is possible to employ existing pre-trained models to compute already existing similarities. This will likely be much faster and less troubling. However, each dataset has its uniqueness. Using a pre-trained model will ignore the customer behavior.
  Ex. We will later see how 2D has a strong relationship with Fantasy: such a pair will never be discovered using pre-trained models.
The choice of the algorithm may depend on many factors, especially when we have to work with a huge data pool or we have scalability concerns (ex. # tags will equal our vector length: if we have too many tags, we need to use Machine Learning to stem this problem.

a. Build co-occurence matrix using Michelangiolo similarity

As mentioned, I will be using the co-occurrence matrix as a means to extract these relationships. My goal is to find the relationship between every pair of tags, and I will be doing so by applying the following count across the entire collection of samples using IoU (Intersection over Union) over the set of all samples (S):

formula to compute the similarity between a pair of tags, Image by author

This algorithm is very similar to Jaccard similarity. While it operates on samples, the one I introduce operates on elements, but since (as of my knowledge) this specific application has not been codified, yet, we can name it Michelangiolo similarity. (To be fair, the use of this algorithm has been previously mentioned in a StackOverflow question, yet, never codified).

difference between Jaccard similarity and Michelangiolo similarity, Image by author

For 40,000 samples, it takes about an hour to extract the similarity matrix, this will be the result:

co-occurrence matrix of all unique tags in our sample list S, Image by author

Let us make a manual check of the top 10 samples for some very common tags, to see if the result makes sense:

sample relationships extracted from the co-occurrence matrix, Image by author

The result looks very promising! We started from plain categorical data (only convertible to 0 and 1), but we have extracted the semantic relationship between tags (without even using a neural network).

b. Use a pre-trained neural network

Equally, we can extract existing relationships between our samples using a pre-trained encoder. This solution, however, ignores the relationships that can only be extracted from our data, only focusing on existing semantic relationships of the human language. This may not be a well-suited solution to work on top of retail-based data.

On the other hand, by using a neural network we would not need to build a relationship matrix: hence, this is a proper solution for scalability. For example, if we had to analyze a large batch of Twitter data, we reach 53.300 tags. Computing a co-occurrence matrix from this number of tags will result in a sparse matrix of size 2,500,000,000 (quite a non-practical feat). Instead, by using a standard encoder that outputs a vector length of 384, the resulting matrix will have a total size of 19,200,200.

snapshot of an encoded set of tags usin a pre-trained encoder

2. Encoding queries and samples

Our goal is to build a search engine capable of supporting the semantic tag search: with the format we have been building, the only technology capable of supporting such an enterprise is vector search. Hence, we need to find a proper encoding algorithm to convert both our samples and queries into vectors.

In most encoding algorithms, we encode both queries and samples using the same algorithm. However, each sample contains more than one tag, each represented by a different set of relationships that we need to capture in a single vector.

Covariate Encoding, Image by author

In addition, we need to address the aforementioned problem of scalability, and we will do so by using a PCA module (when we use a co-occurrence matrix, instead, we can skip the PCA because there is no need to compress our vectors).

When the number of tags becomes too large, we need to abandon the possibility of computing a co-occurrence matrix, because it scales at a squared rate. Therefore, we can extract the vector of each existing tag using a pre-trained neural network (the first step in the PCA module). For example, all-MiniLM-L6-v2 converts each tag into a vector of length 384.

We can then transpose the obtained matrix, and compress it: we will initially encode our queries/samples using 1 and 0 for the available tag indexes, resulting in an initial vector of the same length as our initial matrix (53,300). At this point, we can use our pre-computed PCA instance to compress the same sparse vector in 384 dims.

Encoding samples

In the case of our samples, the process ends just right after the PCA compression (when activated).

Encoding queries: Covariate Encoding

Our query, however, needs to be encoded differently: we need to take into account the relationships associated with each existing tag. This process is executed by first summing our compressed vector to the compressed matrix (the total of all existing relationships). Now that we have obtained a matrix (384×384), we will need to average it, obtaining our query vector.

Because we will make use of Euclidean search, it will first prioritize the search for features with the highest score (ideally, the one we activated using the number 1), but it will also consider the additional minor scores.

Weighted search

Because we are averaging vectors together, we can even apply a weight to this calculation, and the vectors will be impacted differently from the query tags.

3. Perform the semantic tag search using vector retrieval

The question you might be asking is: why did we undergo this complex encoding process, rather than just inputting the pair of tags into a function and obtaining a score — f(query, sample)?

If you are familiar with vector-based search engines, you will already know the answer. By performing calculations by pairs, in the case of just 40,000 samples the computing power required is huge (can take up to 10 seconds for a single query): it is not a scalable practice. However, if we choose to perform a vector retrieval of 40,000 samples, the search will finish in 0.1 seconds: it is a highly scalable practice, which in our case is perfect.

4. Validate

For an algorithm to be effective, needs to be validated. For now, we lack a proper mathematical validation (at first sight, averaging similarity scores from M already shows very promising results, but further research is needed for an objective metric backed up by proof).

However, existing results are quite intuitive when visualized using a comparative example. The following is the top search result (what you are seeing are the tags assigned to this game) of both search methods.

comparison between traditional tag search and semantic tag search, Image by author
- Traditional tag search
  We can see how traditional search might (without additional rules, samples are filtered based on the availability of all tags, and not sorted) return a sample with a higher number of tags, but many of them may not be relevant.
- Semantic tag search
  Semantic tag search sorts all samples based on the relevance of all tags, in simple terms, it disqualifies samples containing irrelevant tags.
The real advantage of this new system is that when traditional search does not return enough samples, we can select as many as we want using semantic tag search.

difference of the two searches in front of a scarcity of results, Image by author

In the example above, using traditional tag filtering does not return any game from the Steam library. However, by using semantic tag filtering we still get results that are not perfect, but the best ones matching our query. The ones you are seeing are the tags of the top 5 games matching our search.

Conclusion

Before now, it was not possible to filter tags also taking into account their semantic relationships without resorting to complex methods, such as clustering, deep learning, or multiple knn searches.

The degree of flexibility offered by this algorithm should allow the detachment from traditional manual labeling methods, which force the user to choose between a pre-defined set of tags, and open the possibility of using LLMs of VLMs to freely assign tags to a text or an image without being confined to a pre-existing structure, opening up new options for scalable and improved search methods.

It is with my best wishes that I open this algorithm to the world, and I hope it will be utilized to its full potential.

Introducing Semantic Tag Filtering: Enhancing Retrieval with Tag Similarity was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Introducing Semantic Tag Filtering: Enhancing Retrieval with Tag Similarity

Go Here to Read this Fast! Introducing Semantic Tag Filtering: Enhancing Retrieval with Tag Similarity
September 9, 2024
5 Pillars for a Hyper-Optimized AI Workflow
Gilad Rubin
An introduction to a methodology for creating production-ready, extensible & highly optimized AI workflows

Credit: Google Gemini, prompt by the Author

Intro

In the last decade, I carried with me a deep question in the back of my mind in every project I’ve worked on:

How (the hell) am I supposed to structure and develop my AI & ML projects?

I wanted to know — is there an elegant way to build production-ready code in an iterative way? A codebase that is extensible, optimized, maintainable & reproducible?

And if so — where does this secret lie? Who owns the knowledge to this dark art?

I searched intensively for an answer over the course of many years — reading articles, watching tutorials and trying out different methodologies and frameworks. But I couldn’t find a satisfying answer. Every time I thought I was getting close to a solution, something was still missing.

After about 10 years of trial and error, with a focused effort in the last two years, I think I’ve finally found a satisfying answer to my long-standing quest. This post is the beginning of my journey of sharing what I’ve found.

My research has led me to identify 5 key pillars that form the foundation of what I call a hyper-optimized AI workflow. In the post I will shortly introduce each of them — giving you an overview of what’s to come.

I want to emphasize that each of the pillars that I will present is grounded in practical methods and tools, which I’ll elaborate on in future posts. If you’re already curious to see them in action, feel free to check out this video from Hamilton’s meetup where I present them live:

<a href="https://medium.com/media/8feacaf1300c096ed9bb9b777bda43ac/href">https://medium.com/media/8feacaf1300c096ed9bb9b777bda43ac/href</a>

Note: Throughout this post and series, I’ll use the terms Artificial Intelligence (AI), Machine Learning (ML), and Data Science (DS) interchangeably. The concepts we’ll discuss apply equally to all these fields.

Now, let’s explore each pillar.

1 — Metric-Based Optimization

In every AI project there is a certain goal we want to achieve, and ideally — a set of metrics we want to optimize.

These metrics can include:
- Predictive quality metrics: Accuracy, F1-Score, Recall, Precision, etc…
- Cost metrics: Actual $ amount, FLOPS, Size in MB, etc…
- Performance metrics: Training speed, inference speed, etc…
We can choose one metric as our “north star” or create an aggregate metric. For example:
- 0.7 × F1-Score + 0.3 × (1 / Inference Time in ms)
- 0.6 × AUC-ROC + 0.2 × (1 / Training Time in hours) + 0.2 × (1 / Cloud Compute Cost in $)
There’s a wonderful short video by Andrew Ng. where here explains about the topic of a Single Number Evaluation Metric.

Once we have an agreed-upon metric to optimize and a set of constraints to meet, our goal is to build a workflow that maximizes this metric while satisfying our constraints.

2 — Interactive Developer Experience

In the world of Data Science and AI development — interactivity is key.

As AI Engineers (or whatever title we Data Scientists go by these days), we need to build code that works bug-free across different scenarios.

Unlike traditional software engineering, our role extends beyond writing code that “just” works. A significant aspect of our work involves examining the data and inspecting our models’ outputs and the results of various processing steps.

The most common environment for this kind of interactive exploration is Jupyter Notebooks.

Working within a notebook allows us to test different implementations, experiment with new APIs and inspect the intermediate results of our workflows and make decisions based on our observations. This is the core of the second pillar.

However, As much as we enjoy these benefits in our day-to-day work, notebooks can sometimes contain notoriously bad code that can only be executed in a non-trivial order.

In addition, some exploratory parts of the notebook might not be relevant for production settings, making it unclear how these can effectively be shipped to production.

3 — Production-Ready Code

“Production-Ready” can mean different things in different contexts. For one organization, it might mean serving results within a specified time frame. For another, it could refer to the service’s uptime (SLA). And yet for another, it might mean the code, model, or workflow has undergone sufficient testing to ensure reliability.

These are all important aspects of shipping reliable products, and the specific requirements may vary from place to place. Since my exploration is focused on the “meta” aspect of building AI workflows, I want to discuss a common denominator across these definitions: wrapping our workflow as a serviceable API and deploying it to an environment where it can be queried by external applications or users.

This means we need to have a way to abstract the complexity of our codebase into a clearly defined interface that can be used across various use-cases. Let’s consider an example:

Imagine a complex RAG (Retrieval-Augmented Generation) system over PDF files that we’ve developed. It may contain 10 different parts, each consisting of hundreds of lines of code.

However, we can still wrap them into a simple API with just two main functions:
1. upload_document(file: PDF) -> document_id: str
2. query_document(document_id: str, query: str, output_format: str) -> response: str
This abstraction allows users to:
1. Upload a PDF document and receive a unique identifier.
2. Ask questions about the document using natural language.
3. Specify the desired format for the response (e.g., markdown, JSON, Pandas Dataframe).
By providing this clean interface, we’ve effectively hidden the complexities and implementation details of our workflow.

Having a systematic way to convert arbitrarily complex workflows into deployable APIs is our third pillar.

In addition, we would ideally want to establish a methodology that ensures that our iterative, daily work stays in sync with our production code.

This means if we make a change to our workflow — fixing a bug, adding a new implementation, or even tweaking a configuration — we should be able to deploy these changes to our production environment with just a click of a button.

4 — Modular & Extensible Code

Another crucial aspect of our methodology is maintaining a Modular & Extensible codebase.

This means that we can add new implementations and test them against existing ones that occupy the same logical step without modifying our existing code or overwriting other configurations.

This approach aligns with the open-closed principle, where our code is open for extension but closed for modification. It allows us to:
1. Introduce new implementations alongside existing ones
2. Easily compare the performance of different approaches
3. Maintain the integrity of our current working solutions
4. Extend our workflow’s capabilities without risking the stability of the whole system
Let’s look at a toy example:

Image by the Author

Image by the Author

In this example, we can see a (pseudo) code that is modular and configurable. In this way, we can easily add new configurations and test their performance:

Image by the Author

Once our code consists of multiple competing implementations & configurations, we enter a state that I like to call a “superposition of workflows”. In this state we can instantiate and execute a workflow using a specific set of configurations.

5 — Hierarchical & Visual Structures

What if we take modularity and extensibility a step further? What if we apply this approach to entire sections of our workflow?

So now, instead of configuring this LLM or that retriever, we can configure our whole preprocessing, training, or evaluation steps.

Let’s look at an example:

Image by the Author

Here we see our entire ML workflow. Now, let’s add a new Data Prep implementation and zoom into it:

Image by the Author

When we work in this hierarchical and visual way, we can select a section of our workflow to improve and add a new implementation with the same input/output interface as the existing one.

We can then “zoom in” to that specific section, focusing solely on it without worrying about the rest of the project. Once we’re satisfied with our implementation — we can start testing it out alongside other various configurations in our workflow.

This approach unlocks several benefits:
1. Reduced mental overload: Focus on one section at a time, providing clarity and reducing complexity in decision-making.
2. Easier collaboration: A modular structure simplifies task delegation to teammates or AI assistants, with clear interfaces for each component.
3. Reusability: These encapsulated implementations can be utilized in different projects, potentially without modification to their source code.
4. Self-documentation: Visualizing entire workflows and their components makes it easier to understand the project’s structure and logic without diving into unnecessary details.
Summary

These are the 5 pillars that I’ve found to hold the foundation to a “hyper-optimized AI workflow”:
1. Metric-Based Optimization: Define and optimize clear, project-specific metrics to guide decision-making and workflow improvements.
2. Interactive Developer Experience: Utilize tools for iterative coding & data inspection like Jupyter Notebooks.
3. Production-Ready Code: Wrap complete workflows into deployable APIs and sync development and production code.
4. Modular & Extensible Code: Structure code to easily add, swap, and test different implementations.
5. Hierarchical & Visual Structures: Organize projects into visual, hierarchical components that can be independently developed and easily understood at various levels of abstraction.
In the upcoming blog posts, I’ll dive deeper into each of these pillars, providing more detailed insights, practical examples, and tools to help you implement these concepts in your own AI projects.

Specifically, I intend to introduce the methodology and tools I’ve built on top of DAGWorks Inc* Hamilton framework and my own packages: Hypster and HyperNodes (still in its early days).

Stay tuned for more!

*I am not affiliated with or employed by DAGWorks Inc.

5 Pillars for a Hyper-Optimized AI Workflow was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
5 Pillars for a Hyper-Optimized AI Workflow

Go Here to Read this Fast! 5 Pillars for a Hyper-Optimized AI Workflow
September 9, 2024
Does Semi-Supervised Learning Help to Train Better Models?
Reinhard Sellmair
Evaluating how semi-supervised learning can leverage unlabeled data

Image by the author — created with Image Creator in Bing

One of the most common challenges Data Scientists faces is the lack of enough labelled data to train a reliable and accurate model. Labelled data is essential for supervised learning tasks, such as classification or regression. However, obtaining labelled data can be costly, time-consuming, or impractical in many domains. On the other hand, unlabeled data is usually easy to collect, but they do not provide any direct input to train a model.

How can we make use of unlabeled data to improve our supervised learning models? This is where semi-supervised learning comes into play. Semi-supervised learning is a branch of machine learning that combines labelled and unlabeled data to train a model that can perform better than using labelled data alone. The intuition behind semi-supervised learning is that unlabeled data can provide useful information about the underlying structure, distribution, and diversity of the data, which can help the model generalize better to new and unseen examples.

In this post, I present three semi-supervised learning methods that can be applied to different types of data and tasks. I will also evaluate their performance on a real-world dataset and compare them with the baseline of using only labelled data.

What is semi-supervised learning?

Semi-supervised learning is a type of machine learning that uses both labelled and unlabeled data to train a model. Labelled data are examples that have a known output or target variable, such as the class label in a classification task or the numerical value in a regression task. Unlabeled data are examples that do not have a known output or target variable. Semi-supervised learning can leverage the large amount of unlabeled data that is often available in real-world problems, while also making use of the smaller amount of labelled data that is usually more expensive or time-consuming to obtain.

The underlying idea to use unlabeled data to train a supervised learning method is to label this data via supervised or unsupervised learning methods. Although these labels are most likely not as accurate as actual labels, having a significant amount of this data can improve the performance of a supervised-learning method compared to training this method on labelled data only.

The scikit-learn package provides three semi-supervised learning methods:
- Self-training: a classifier is first trained on labelled data only to predict labels of unlabeled data. In the next iteration, another classifier is training on the labelled data and on prediction from the unlabeled data which had high confidence. This procedure is repeated until no new labels with high confidence are predicted or a maximum number of iterations is reached.
- Label-propagation: a graph is created where nodes represent data points and edges represent similarities between them. Labels are iteratively propagated through the graph, allowing the algorithm to assign labels to unlabeled data points based on their connections to labelled data.
- Label-spreading: uses the same concept as label-propagation. The difference is that label spreading uses a soft assignment, where the labels are updated iteratively based on the similarity between data points. This method may also “overwrite” labels of the labelled dataset.
To evaluate these methods I used a diabetes prediction dataset which contains features of patient data like age and BMI together with a label describing if the patient has diabetes. This dataset contains 100,000 records which I randomly divided into 80,000 training, 10,000 validation and 10,000 test data. To analyze how effective the learning methods are with respect to the amount of labelled data, I split the training data into a labelled and an unlabeled set, where the label size describes how many samples are labelled.

Partition of dataset (image by the author)

I used the validation data to assess different parameter settings and used the test data to evaluate the performance of each method after parameter tuning.

I used XG Boost for prediction and F1 score to evaluate the prediction performance.

Baseline

The baseline was used to compare the self-learning algorithms against the case of not using any unlabeled data. Therefore, I trained XGB on labelled data sets of different size and calculate the F1 score on the validation data set:

Baseline score (image by the author)

The results showed that the F1 score is quite low for training sets of less than 100 samples, then steadily improves to a score of 79% until a sample size of 1,000 is reached. Higher sample sizes hardly improved the F1 score.

Self-learning

Self-training is using multiple iteration to predict labels of unlabeled data which will then be used in the next iteration to train another model. Two methods can be used to select predictions to be used as labelled data in the next iteration:
1. Threshold (default): all predictions with a confidence above a threshold are selected
2. K best: the predictions of the k highest confidence are selected
I evaluated the default parameters (ST Default) and tuned the threshold (ST Thres Tuned) and the k best (ST KB Tuned) parameter based on the validation dataset. The prediction results of these model were evaluated on the test dataset:

Self-learning score (image by the author)

For small sample sizes (<100) the default parameters (red line) performed worse than the baseline (blue line). For higher sample sizes slightly better F1 scores than the baseline were achieved. Tuning the threshold (green line) brought a significant improvement, for example at a label size of 200 the baseline F1 score was 57% while the algorithm with tuned thresholds achieved 70%. With one exception at a label size of 30, tuning the K best value (purple line) resulted in almost the same performance as the baseline.

Label Propagation

Label propagation has two built-in kernel methods: RBF and KNN. The RBF kernel produces a fully connected graph using a dense matrix, which is memory intensive and time consuming for large datasets. To consider memory constraints, I only used a maximum training size of 3,000 for the RBF kernel. The KNN kernel uses a more memory friendly sparse matrix, which allowed me to fit on the whole training data of up to 80,000 samples. The results of these two kernel methods are compared in the following graph:

Label propagation score (image by the author)

The graph shows the F1 score on the test dataset of different label propagation methods as a function of the label size. The blue line represents the baseline, which is the same as for self-training. The red line represents the label propagation with default parameters, which clearly underperforms the baseline for all label sizes. The green line represents the label propagation with RBF kernel and tuned parameter gamma. Gamma defines how far the influence of a single training example reaches. The tuned RBF kernel performed better than the baseline for small label sizes (<=100) but worse for larger label sizes. The purple line represents the label propagation with KNN kernel and tuned parameter k, which determines the number of nearest neighbors to use. The KNN kernel had a similar performance as the RBF kernel.

Label Spreading

Label spreading is a similar approach to label propagation, but with an additional parameter alpha that controls how much an instance should adopt the information of its neighbors. Alpha can range from 0 to 1, where 0 means that the instance keeps its original label and 1 means that it completely adopts the labels of its neighbors. I also tuned the RBF and KNN kernel methods for label spreading. The results of label spreading are shown in the next graph:

Label spreading score (image by the author)

The results of label spreading were very similar to those of label propagation, with one notable exception. The RBF kernel method for label spreading has a lower test score than the baseline for all label sizes, not only for small ones. This suggests that the “overwriting” of labels by the neighbors’ labels has a rather negative effect for this dataset, which might have only few outliers or noisy labels. On the other hand, the KNN kernel method is not affected by the alpha parameter. It seems that this parameter is only relevant for the RBF kernel method.

Comparison of all methods

Next, I compared all methods with their best parameters against each other.

Comparison of best scores (image by the author)

The graph shows the test score of different semi-supervised learning methods as a function of the label size. Self-training outperforms the baseline, as it leverages the unlabeled data well. Label propagation and label spreading only beat the baseline for small label sizes and perform worse for larger label sizes.

Conclusion

The results may significantly vary for different datasets, classifier methods, and metrics. The performance of semi-supervised learning depends on many factors, such as the quality and quantity of the unlabeled data, the choice of the base learner, and the evaluation criterion. Therefore, one should not generalize these findings to other settings without proper testing and validation.

If you are interested in exploring more about semi-supervised learning, you are welcome to check out my git repo and experiment on your own. You can find the code and data for this project here.

One thing that I learned from this project is that parameter tuning was important to significantly improve the performance of these methods. With optimized parameters, self-training performed better than the baseline for any label size and reached better F1 scores of up to 13%! Label propagation and label spreading only turned out to improve the performance for very small sample size, but the user must be very careful not to get worse results compared to not using any semi-supervised learning method.

Does Semi-Supervised Learning Help to Train Better Models? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Does Semi-Supervised Learning Help to Train Better Models?

Go Here to Read this Fast! Does Semi-Supervised Learning Help to Train Better Models?
September 9, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Category: Artificial Intelligence

Recommendations on building robust memory capabilities based on experimentation with Autogen’s “Teachable Agents”

Key Findings and Recommendations

Descriptive Code Implementation

Step 7:

Conclusion

Causal AI, exploring the integration of causal reasoning into machine learning

What is this series about?

Introduction

What is multi-collinearity?

Why is it a problem in causal inference?

Why is it so common in marketing mix modelling?

How can we detect it?

Correlation

Variance inflation factor (VIF)

Standard errors

Simulations

How can we address it?

An introduction to Bayesian priors

Case study

Data-generating-process

Initial assessment

Bayesian priors

Bayesian (default) priors

Bayesian (custom) priors

Random budget adjustments

Final thoughts

Bringing order to chaos while simplifying our search for the perfect nanny for our childcare

Setting the Stage: The Great Schedule Puzzle

The Data Science Toolbox: When in Doubt, Simulate

Evolving the Perfect Nanny: The Power of Genetic Algorithms

Defining the Nanny Chromosome

Building the Schedule for Each Nanny

Selecting the Best Candidates

Adding Mutation to Keep Things Interesting

Evolving Toward the Perfect Schedule

Final Thoughts

From Chaos to Clarity: Visualizing the Solution

The Heatmap: Coverage at a Glance

The Impact: A Household in Harmony

The Power of Data Science at Home

Semantic Tag Filtering

How to use Semantic Similarity to improve tag filtering

How does traditional tag search work?

Can we do better?

Introducing Semantic Tag Search

1. Extracting tag relationships

a. Build co-occurence matrix using Michelangiolo similarity

b. Use a pre-trained neural network

2. Encoding queries and samples

Encoding samples

Encoding queries: Covariate Encoding

Weighted search

3. Perform the semantic tag search using vector retrieval

4. Validate

Conclusion

An introduction to a methodology for creating production-ready, extensible & highly optimized AI workflows

Intro

1 — Metric-Based Optimization

2 — Interactive Developer Experience

3 — Production-Ready Code

4 — Modular & Extensible Code

5 — Hierarchical & Visual Structures

Summary

Evaluating how semi-supervised learning can leverage unlabeled data

What is semi-supervised learning?

Baseline

Self-learning

Label Propagation

Label Spreading

Comparison of all methods

Conclusion