Leverage multi-agentic workflows for code testing and debugging
It’s April 2024 and it’s been about 17 months since we’ve been using LLMs like ChatGPT to aid us in code generation and debugging tasks. While it has added a great level of productivity, there are indeed times when the code generated is full of bugs and makes us take the good ole StackOverflow route.
In this article, I’ll give a quick demonstration on how we can address this lack of “verification” using Conversable Agents offered by AutoGen.
What is AutoGen?
“AutoGen is a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks.”
Presenting LeetCode Problem Solver:
Start with quietly installing autogen:
!pip install pyautogen -q --progress-bar off
I’m using Google Colab so I entered by OPENAI_API_KEY in the Secrets tab, and securely loaded it along with other modules:
import os
import csv
import autogen
from autogen import Cache
from google.colab import userdata
userdata.get('OPENAI_API_KEY')
I’m using gpt-3.5-turbo only because it’s cheaper than gpt4. If you can afford more expensive experimentation and/or you’re doing things more “seriously”, you should obviously use a stronger model.
llm_config = {
"config_list": [{"model": "gpt-3.5-turbo", "api_key": userdata.get('OPENAI_API_KEY')}],
"cache_seed": 0, # seed for reproducibility
"temperature": 0, # temperature to control randomness
}
Now, I’ll copy the problem statement from my favourite LeetCode problem Two Sum. It’s one of the most commonly asked questions in leetcode-style interviews and covers basic concepts like caching using hashmaps and basic equation manipulation.
LEETCODE_QUESTION = """
Title: Two Sum
Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order.
Example 1:
Input: nums = [2,7,11,15], target = 9
Output: [0,1]
Explanation: Because nums[0] + nums[1] == 9, we return [0, 1].
Example 2:
Input: nums = [3,2,4], target = 6
Output: [1,2]
Example 3:
Input: nums = [3,3], target = 6
Output: [0,1]
Constraints:
2 <= nums.length <= 104
-109 <= nums[i] <= 109
-109 <= target <= 109
Only one valid answer exists.
Follow-up: Can you come up with an algorithm that is less than O(n2) time complexity?
"""
We can now define both of our agents. One agent acts as the “assistant” agent that suggests the solution and the other serves as a proxy to us, the user and is also responsible for executing the suggested Python code.
# create an AssistantAgent named "assistant"
SYSTEM_MESSAGE = """You are a helpful AI assistant.
Solve tasks using your coding and language skills.
In the following cases, suggest python code (in a python coding block) or shell script (in a sh coding block) for the user to execute.
1. When you need to collect info, use the code to output the info you need, for example, browse or search the web, download/read a file, print the content of a webpage or a file, get the current date/time, check the operating system. After sufficient info is printed and the task is ready to be solved based on your language skill, you can solve the task by yourself.
2. When you need to perform some task with code, use the code to perform the task and output the result. Finish the task smartly.
Solve the task step by step if you need to. If a plan is not provided, explain your plan first. Be clear which step uses code, and which step uses your language skill.
When using code, you must indicate the script type in the code block. The user cannot provide any other feedback or perform any other action beyond executing the code you suggest. The user can't modify your code. So do not suggest incomplete code which requires users to modify. Don't use a code block if it's not intended to be executed by the user.
If you want the user to save the code in a file before executing it, put # filename: <filename> inside the code block as the first line. Don't include multiple code blocks in one response. Do not ask users to copy and paste the result. Instead, use 'print' function for the output when relevant. Check the execution result returned by the user.
If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try.
When you find an answer, verify the answer carefully. Include verifiable evidence in your response if possible.
Additional requirements:
1. Within the code, add functionality to measure the total run-time of the algorithm in python function using "time" library.
2. Only when the user proxy agent confirms that the Python script ran successfully and the total run-time (printed on stdout console) is less than 50 ms, only then return a concluding message with the word "TERMINATE". Otherwise, repeat the above process with a more optimal solution if it exists.
"""
assistant = autogen.AssistantAgent(
name="assistant",
llm_config=llm_config,
system_message=SYSTEM_MESSAGE
)
# create a UserProxyAgent instance named "user_proxy"
user_proxy = autogen.UserProxyAgent(
name="user_proxy",
human_input_mode="NEVER",
max_consecutive_auto_reply=4,
is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
code_execution_config={
"work_dir": "coding",
"use_docker": False,
},
)
I set the human_input_mode to “NEVER” because I’m not planning to give any inputs myself and max_consecutive_auto_reply to 4 to limit the back-and-forth turns in the conversation. The Assistant agent has been instructed to respond with the word “TERMINATE” that tells the UserProxyAgent when to conclude the conversation.
Now, the fun time! We’ll initiate the conversation by sending a message from our UserProxyAgent to our Assistant.
One added benefit of using AutoGen (even for non-agentic workflows) is that it provides explicit caching capability to help you save API costs during development. Here, I’m caching responses on the disk but you can also integrate redis for this purpose.
# Use DiskCache as cache
with Cache.disk(cache_seed=7) as cache:
# the assistant receives a message from the user_proxy, which contains the task description
chat_res = user_proxy.initiate_chat(
assistant,
message="""Solve the following leetcode problem and also comment on it's time and space complexity:nn""" + LEETCODE_QUESTION
)
Output:
user_proxy (to assistant):
Solve the following leetcode problem and also comment on it's time and space complexity:
Title: Two Sum
...
Follow-up: Can you come up with an algorithm that is less than O(n2) time complexity?
--------------------------------------------------------------------------------
assistant (to user_proxy):
To solve this problem efficiently, we can use a dictionary to store the elements we have seen so far along with their indices. While iterating through the array, we can check if the complement of the current element (target - current element) is already in the dictionary. If it is, we have found the pair that sums up to the target.
Here's the Python code to solve the Two Sum problem with a time complexity less than O(n^2):
```python
# filename: two_sum.py
def two_sum(nums, target):
num_dict = {}
for i, num in enumerate(nums):
complement = target - num
if complement in num_dict:
return [num_dict[complement], i]
num_dict[num] = i
# Test the function with the given examples
nums1 = [2, 7, 11, 15]
target1 = 9
print(two_sum(nums1, target1)) # Output: [0, 1]
nums2 = [3, 2, 4]
target2 = 6
print(two_sum(nums2, target2)) # Output: [1, 2]
nums3 = [3, 3]
target3 = 6
print(two_sum(nums3, target3)) # Output: [0, 1]
```
The time complexity of this solution is O(n) because we iterate through the list of numbers only once. The space complexity is also O(n) because we store elements in a dictionary.
You can run this code and check if it provides the correct output. Let me know the total run-time, and if it's less than 50 ms, you can reply with "TERMINATE". If it's not, I can suggest a more optimal solution.
--------------------------------------------------------------------------------
>>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is python)...
user_proxy (to assistant):
exitcode: 0 (execution succeeded)
Code output:
[0, 1]
[1, 2]
[0, 1]
--------------------------------------------------------------------------------
assistant (to user_proxy):
The Python script ran successfully and provided the correct output for the given examples. Since the total run-time is not provided, I will suggest measuring the total run-time of the algorithm using the "time" library to determine if it's less than 50 ms.
You can add the functionality to measure the total run-time by importing the "time" library and using it to calculate the elapsed time before and after calling the `two_sum` function. Here's how you can modify the code:
```python
# filename: two_sum.py
import time
def two_sum(nums, target):
num_dict = {}
for i, num in enumerate(nums):
complement = target - num
if complement in num_dict:
return [num_dict[complement], i]
num_dict[num] = i
# Test the function with the given examples and measure the total run-time
nums1 = [2, 7, 11, 15]
target1 = 9
start_time = time.time()
print(two_sum(nums1, target1)) # Output: [0, 1]
end_time = time.time()
print("Total run-time:", (end_time - start_time) * 1000, "ms")
nums2 = [3, 2, 4]
target2 = 6
start_time = time.time()
print(two_sum(nums2, target2)) # Output: [1, 2]
end_time = time.time()
print("Total run-time:", (end_time - start_time) * 1000, "ms")
nums3 = [3, 3]
target3 = 6
start_time = time.time()
print(two_sum(nums3, target3)) # Output: [0, 1]
end_time = time.time()
print("Total run-time:", (end_time - start_time) * 1000, "ms")
```
Please run this modified code and provide the total run-time for each example. If the total run-time is less than 50 ms for all examples, you can reply with "TERMINATE".
--------------------------------------------------------------------------------
>>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is python)...
user_proxy (to assistant):
exitcode: 0 (execution succeeded)
Code output:
[0, 1]
Total run-time: 0.01239776611328125 ms
[1, 2]
Total run-time: 0.00286102294921875 ms
[0, 1]
Total run-time: 0.0016689300537109375 ms
--------------------------------------------------------------------------------
assistant (to user_proxy):
The Python script ran successfully, and the total run-time for each example is significantly less than 50 ms. Therefore, you can reply with "TERMINATE". The algorithm provided an efficient solution with a time complexity of O(n) and a space complexity of O(n).
--------------------------------------------------------------------------------
This is what’s happening here:
- The UserProxyAgent asks the Assistant to solve the problem based on the task description.
- The Assistant suggests a solution with a Python block
- The UserProxyAgent executes the python code.
- The Assistant reads the console output and responds back with a modified solution (with time measurement functionality. Honestly, I would’ve expected this modified solution right away but this behavior can be tuned through prompt engineering or by employing a stronger LLM).
With AutoGen, you can also display the cost of the agentic workflow.
chat_res.cost
({'total_cost': 0,
'gpt-3.5-turbo-0125': {'cost': 0,
'prompt_tokens': 14578,
'completion_tokens': 3460,
'total_tokens': 18038}}
Concluding Remarks:
Thus, by using AutoGen’s conversable agents:
- We automatically verified that the Python code suggested by the LLM actually works.
- And created a framework by which the LLM can further respond to syntax or logical errors by reading the output in the console.
Thanks for reading! Please follow me and subscribe to be the first when I post a new article! 🙂
Check out my other articles:
- A Deep Dive into Evaluation in Azure Prompt Flow
- Develop a UI for Azure Prompt Flow with Streamlit
- Build a custom Chatbot using Hugging Face Chat UI and Cosmos DB on Azure Kubernetes Service
- Deploy Hugging Face Text Generation Inference on Azure Container Instance
Generate “Verified” Python Code Using AutoGen Conversable Agents was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Generate “Verified” Python Code Using AutoGen Conversable Agents
Go Here to Read this Fast! Generate “Verified” Python Code Using AutoGen Conversable Agents