Tag: artificial intelligence

  • Reasoning as the Engine Driving Legal Arguments

    Reasoning as the Engine Driving Legal Arguments

    Vern R Walker

    Statements of reasoning indicate types of argument

    Image of a layered mountain side, with graphics depicting the mining of reasoning sentences from legal decisions.
    Image by Vern R. Walker, CC BY 4.0.

    In a legal case at the trial level, the task of the trier of fact (whether judge or jury or administrative tribunal) is to assess the probative value of the evidence and to arrive at a conclusion about the facts. But what are a tribunal’s methods for performing that task? How many methods does a tribunal employ? There are at least three stages typical for any type of fact-finding institution.

    First, a trier of fact must determine which items of available evidence are relevant for deciding which issues of fact. An item of evidence is relevant to proving a factual proposition if it tends to make that proposition more or less probable than it would be without that evidence.

    Second, for each issue and set of relevant evidence, a trier of fact must evaluate the trustworthiness of each item of evidence. A person might use various criteria to evaluate the credibility of a witness’s testimony, or the trustworthiness of a document’s contents, or the probative value of a piece of physical evidence. It would be useful to determine which factors a tribunal tends to use in evaluating the credibility or trustworthiness of a particular piece of evidence. Also, can we determine priorities among those factors?

    Third, a trier of fact needs to weigh competing evidence. A person needs to balance inconsistent but credible evidence, and then determine the net probative value of all the relevant evidence. There might be different approaches to resolving conflicts between the testimonies of two different witnesses, or of the same witness over time. Or there may be different methods for deciding between statements within different documents, or between testimony and written statements. Can we determine patterns or “soft rules” for doing such comparisons?

    A particular type of sentence found in legal decisions provides important clues about the answers to such questions. A well-written legal decision expressly states at least some of the decision maker’s chains of intermediate inferences. Of particular importance are the sentences that state its evidential reasoning — which I will call “reasoning sentences.”

    In this article, I discuss the distinguishing characteristics and usefulness of such reasoning sentences. I also discuss the linguistic features that make it possible for machine-learning (ML) models to automatically label reasoning sentences within legal decision documents. I discuss why the adequacy of the performance of those models depends upon the use case, and why even basic ML models can be suitable for the task. I conclude by positioning reasoning sentences within the broader task of using generative AI and large language models to address the challenges of argument mining.

    The Characteristics and Usefulness of Reasoning Sentences

    In a fact-finding legal decision, a statement of the evidential reasoning explains how the evidence and legal rules support the findings of fact. A reasoning sentence, therefore, is a statement by the tribunal describing some part of the reasoning behind those findings of fact. An example is the following sentence from a fact-finding decision by the Board of Veterans’ Appeals (BVA) in a claim for benefits for a service-related disability:

    Also, the clinician’s etiological opinions are credible based on their internal consistency and her duty to provide truthful opinions.

    In other articles, I have discussed evidence sentences, legal-rule sentences, and finding sentences. In inferential terms, the evidence and legal rules function as premises, and the finding of fact functions as a conclusion. You can consider reasoning sentences as premises also, in the sense that they explain the probative value of the evidence.

    For attorneys and parties involved in a case, reasoning sentences provide an official explanation of why a party’s argument based on the evidence was successful or not. The parties are entitled to hold the tribunal to its stated reasons. Attorneys for the parties can use those stated reasons to help develop arguments against the logic used by the tribunal, or to develop additional support for that logic. Such arguments can be made at the trial level or on appeal.

    For attorneys not involved in the case, reasoning sentences can identify the methods of evidence assessment that a tribunal employed in a past case, even if those methods are not binding precedent for the tribunal. If attorneys can gather past cases that exhibit similar issues and evidence, then the reasoning used in those cases can provide possible lines of argument in new similar cases.

    For those of us mining types of legal argument generally, we can classify cases by the types of reasoning or argument used. Moreover, if ML algorithms can learn to identify the sentences that state the reasoning of the tribunal, we may be able to automatically find similar cases within very large datasets.

    For regulators or legislators, if a standard pattern of reasoning emerges from past cases, they may be able to codify it as a presumption in a regulation or statute, to make future fact-finding more efficient and uniform.

    Legal researchers and commentators can at least recommend such patterns as “soft rules” to guide legal reasoning.

    For all these reasons, an important focal point in argument mining from legal decisions is to identify, and learn how to use, sentences that state the decision’s reasoning.

    Linguistic Features of Reasoning Sentences

    In determining which sentences state the reasoning of the tribunal, lawyers consider a number of features.

    First, a sentence is more likely to be a sentence about reasoning if the sentence does one or more of the following:

    • Explicitly states what evidence is relevant to what issue of fact, or narrows the scope of evidence considered relevant to the issue;
    • Contains an explicit statement about the credibility of a witness, or about the trustworthiness of an item of evidence;
    • Contains a statement that two items of evidence are in conflict or are inconsistent;
    • Compares the probative value of two items of evidence, or emphasizes which evidence is more important than other evidence; or
    • States that evidence is lacking, insufficient, or non-existent.

    Second, a reasoning sentence must state the reasoning of the trier of fact, not of someone else. That is, we must have a good basis to attribute the reasoning to the tribunal, as distinguished from it being merely the reasoning given by a witness, or an argument made by an attorney or party.

    Many different linguistic features can warrant an attribution of the stated reasoning to the decision maker. Sometimes those features are within the contents of the sentence itself. For example, phrases that warrant attribution to the decision maker might be: the Board considers that, or the Board has taken into account that.

    At other times, the location of the sentence within a paragraph or section of the decision is sufficient to warrant attribution to the trier of fact. For example, depending on the writing format of the tribunal, the decision might contain a section entitled “Reasons and Bases for the Decision,” or just “Discussion” or “Analysis.” Unqualified reasoning sentences within such sections are probably attributable to the tribunal, unless the sentence itself attributes the reasoning to a witness or party.

    Machine-Learning Results

    In our experiments, ML algorithms have the hardest time classifying reasoning sentences, compared to other sentence types. Nevertheless, trained models can still provide useful predictions about sentence type. We trained a Logistic Regression model on a dataset of 50 BVA decisions created by Hofstra Law’s Law, Logic & Technology Research Laboratory (LLT Lab). That dataset contains 5,797 manually labeled sentences after preprocessing, 710 of which are reasoning sentences. In a multi-class scenario, the model classified reasoning sentences with precision = 0.66 and recall = 0.52. We got comparable results with a neural network (“NN”) model that we later trained on the same BVA dataset, and we tested on 1,846 sentences. The model precision for reasoning sentences was 0.66, and the recall was 0.51.

    It is tempting to dismiss such ML performance as too low to be useful. Before doing so, it is important to investigate the nature of the errors made, and the practical cost of an error given a use case.

    Practical Error Analysis

    Of the 175 sentences that the neural net model predicted to be reasoning sentences, 59 were misclassifications (precision = 0.66). Here the confusion was with several other types of sentences. Of the 59 sentences misclassified as reasoning sentences, 24 were actually evidence sentences, 15 were finding sentences, and 11 were legal-rule sentences.

    Such confusion is understandable if the wording of a reasoning sentence closely tracks the evidence being evaluated, or the finding being supported, or the legal rule being applied. An evidence sentence might also use words or phrases that signify inference, but the inference being reported in the sentence is not that of the trier of fact, but is in fact part of the content of the evidence.

    As an example of a false positive (or precision error), the trained NN model mistakenly predicted the following to be a reasoning sentence, when it is actually an evidence sentence (the model originally assigned a background color of green, which the expert reviewer manually changed to blue) (the screenshot is taken from the software application LA-MPS, developed by Apprentice Systems):

    Example of an evidence sentence, text highlighted with blue background color, misclassified by the NN model as a reasoning sentence.
    Image by Vern R. Walker, CC BY 4.0.

    While this is an evidence sentence that primarily recites the findings reflected in the reports of an examiner from the Department of Veterans Affairs (VA), the NN model classified the sentence as stating the reasoning of the tribunal itself, probably due in part to the occurrence of the words ‘The Board notes that.’ The prediction scores of the model, however, indicate that the confusion was a reasonably close call (see below the sentence text): reasoning sentence (53.88%) vs. evidence sentence (44.92%).

    As an example of a false negative (or recall error), the NN model misclassified the following sentence as an evidence sentence, when clearly it is a reasoning sentence (the model originally assigned a background color of blue, which the expert reviewer manually changed to green):

    Example of a reasoning sentence, text highlighted with green background color, misclassified by the NN model as an evidence sentence.
    Image by Vern R. Walker, CC BY 4.0.

    This sentence refers to the evidence, but it does so in order to explain the tribunal’s reasoning that the probative value of the evidence from the VA outweighed that of the private treatment evidence. The prediction scores for the possible sentence roles (shown below the sentence text) show that the NN model erroneously predicted this to be an evidence sentence (score = 45.01%), although reasoning sentence also received a relatively high score (33.01%).

    In fact, the wording of sentences can make their true classification highly ambiguous, even for lawyers. An example is whether to classify the following sentence as a legal-rule sentence or as a reasoning sentence:

    No further development or corroborative evidence is required, provided that the claimed stressor is “consistent with the circumstances, conditions, or hardships of the veteran’s service.”

    Given the immediate context within the decision, we manually labeled this sentence as stating a legal rule about when further development or corroborative evidence is required. But the sentence also contains wording consistent with a trier of fact’s reasoning within the specifics of a case. Based only on the sentence wording, however, even lawyers might reasonably classify this sentence in either category.

    The cost of a classification error depends upon the use case and the type of error. For the purpose of extracting and presenting examples of legal reasoning, the precision and recall noted above might be acceptable to a user. A precision of 0.66 means that about 2 of every 3 sentences predicted to be reasoning sentences are correctly predicted, and a recall of 0.51 means that about half of the actual reasoning sentences are correctly detected. If high recall is not essential, and the goal is helpful illustration of past reasoning, such performance might be acceptable.

    An error might be especially low-cost if it consists of confusing a reasoning sentence with an evidence sentence or legal-rule sentence that still contains insight about the reasoning at work in the case. If the user is interested in viewing different examples of possible arguments, then a sentence classified either as reasoning or evidence or legal rule might still be part of an illustrative argument pattern.

    Such low precision and recall would be unacceptable, however, if the goal is to compile accurate statistics on the occurrence of arguments involving a particular kind of reasoning. Our confidence would be very low for descriptive or inferential statistics based on a sample drawn from a set of decisions in which the reasoning sentences were automatically labeled using such a model.

    Summary

    In sum, reasoning sentences can contain extremely valuable information about the types of arguments and reasoning at work in a decision.

    First, they signal patterns of reasoning recognized by triers of fact in past cases, and they can suggest possible patterns of argument in future cases. We can gather illustrative sets of similar cases, examine the use of evidence and legal rules in combination, and illustrate their success or failure as arguments.

    Second, if we extract a set of reasoning sentences from a large dataset, we can survey them to develop a list of factors for evaluating individual items of evidence, and to develop soft rules for comparing conflicting items of evidence.

    It is worth noting also that if our goal is automated argument mining at scale, then identifying and extracting whole arguments rests on more classifiers than merely those for reasoning sentences. I have suggested in other articles that automatic classifiers are adequate for certain use cases in labeling evidence sentences, legal-rule sentences, and finding sentences. Perhaps auto-labeling such sentence types in past decisions can help large language models to address the challenges in argument mining — that is, help them to summarize reasoning in past cases and to recommend arguments in new cases.


    Reasoning as the Engine Driving Legal Arguments was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Reasoning as the Engine Driving Legal Arguments

    Go Here to Read this Fast! Reasoning as the Engine Driving Legal Arguments

  • Building a Multilingual Multi-Agent Chat Application Using LangGraph — Part I

    Building a Multilingual Multi-Agent Chat Application Using LangGraph — Part I

    Roshan Santhosh

    Building a Multilingual Multi-Agent Chat Application Using LangGraph — Part I

    In this 3 part series, learn how to build a RAG-based, multilingual, agentic chat application along with an integrated AI assistant to streamline tasks at workplaces

    Background

    Despite the advancements in technology, language barriers still exist in today’s world. Whether it’s at work or while you are outside, there are always situations where differences in languages can create awkward situations. This is especially true for large enterprises that have teams spread across different geographies, speaking different languages. As part of the recently concluded Aya Expedition organized by the Cohere for AI research community, I was able to work on a project that aimed to address this language barrier along with other workplace-related inefficiencies by developing a multilingual agentic chat application for workplaces.

    Instead of talking more about the product, I think the best way to introduce the product and what we will be building through this series is to actually watch it in action.

    The following tutorial series covers the development of this application which includes :

    1. An agentic workflow for translating to the user’s preferred language
    2. Building features for the AI assistant: RAG-based question answering, Documentation-on-the-Go, and Smart Summarize features
    3. Deploying the agentic workflow through FastAPI and developing a web UI to interface with it

    High-level framework

    Given the popularity of LangChain and its graph based counterpart, LangGraph, I don’t want this to be another tutorial that explains the basics of these packages and their methods. Instead I want to focus more on the design choices and challenges faced while implementing our solution through these packages as I feel that would be more useful in the long run.

    LangChain vs LangGraph

    The first design choice we faced was selecting between LangChain and LangGraph.

    In a simple scenario (as pictured below), where every message provided by a user is sent to all other users and translated to their preferred language, then LangChain would have been a sufficient choice. This is a unidirectional flow that starts from the user sending the message and ends with the users receiving the messages:

    Unidirectional information flow without Aya

    However, the primary constraint in our scenario was the inclusion of an AI assistant, which we shall be referring to as Aya (named after the Expedition). Aya was planned to be a significant component of this chat application and added a new layer of complexity to our system. With Aya, messages from the sending user needed to be analyzed and depending on the nature of the message (if it was a command addressed to Aya), the system needed to send back a message, which in turn needed to be sent again to the receiving users.

    Information flow with Aya

    Defining a Run: Another design choice thats relevant here is the definition of one ‘run’ or one ‘iteration’ of the messaging cycle.

    In the definition we chose, we considered each run to be initiated by any user sending a message and it being terminated when all messages related to that initial message reach the receiving users.

    So if it’s a message that doesn’t address Aya and is just a direct message to other users, the run is considered to be terminated when the initial translated message is received by all users. And if it’s a message that addresses Aya, then it’s considered terminated when the initial message along with the response from Aya, BOTH, reach all users.

    So with this design choice/definition of a run, we wanted a flow where we wait for the responses from Aya to be generated and pushed to the users before terminating the run. And for implementing such a flow, we used LangGraph, as it was specifically built to solve for such cases.

    Building the Agents

    The backbone of this application are the agents and their interactions. Overall, we had two different types of agents :

    1. User Agents: Agents attached to each user. Primarily tasked with translating incoming messages into the user’s preferred language
    2. Aya Agents: Various agents associated with Aya, each with its own specific role/job

    User Agents

    The UserAgent class is used to define an agent that will be associated with every user part of the chat room. Some of the functions implemented by the UserAgent class:

    1. Translate incoming messages into the user’s preferred language

    2. Activate/Invoke graph when a user sends a message

    3. Maintain a chat history to help provide context to the translation task to allow for ‘context-aware’ translation

    class UserAgent(object):

    def __init__(self, llm, userid, user_language):
    self.llm = llm
    self.userid = userid
    self.user_language = user_language
    self.chat_history = []

    prompt = ChatPromptTemplate.from_template(USER_SYSTEM_PROMPT2)

    self.chain = prompt | llm


    def set_graph(self, graph):
    self.graph = graph

    def send_text(self,text:str, debug = False):

    message = ChatMessage(message = HumanMessage(content=text), sender = self.userid)
    inputs = {"messages": [message]}
    output = self.graph.invoke(inputs, debug = debug)
    return output

    def display_chat_history(self, content_only = False):

    for i in self.chat_history:
    if content_only == True:
    print(f"{i.sender} : {i.content}")
    else:
    print(i)


    def invoke(self, message:BaseMessage) -> AIMessage:

    output = self.chain.invoke({'message':message.content, 'user_language':self.user_language})

    return output

    For the most part, the implementation of UserAgent is pretty standard LangChain/LangGraph code:

    • Define a LangChain chain ( a prompt template + LLM) that is responsible for doing the actual translation.
    • Define a send_text function thats used to invoke the graph whenever a user wants to send a new message

    For the most part, the performance of this agent is dependent on the translation quality of the LLM, as translation is the primary objective of this agent. And LLM performance can vary significantly for translation, especially depending on the languages involved. Certain low resource languages don’t have good representation in the training data of some models and this does affect the translation quality for those languages.

    Aya Agents

    For Aya, we actually have a system of separate agents that all contributes towards the overall assistant. Specifically, we have

    1. AyaSupervisor : Control agent that supervises the operation of the other Aya agents.
    2. AyaQuery : Agent for running RAG based question answering
    3. AyaSummarizer : Agent for generating chat summaries and doing task identification
    4. AyaTranslator: Agent for translating messages to English
    class AyaTranslator(object):

    def __init__(self, llm) -> None:
    self.llm = llm
    prompt = ChatPromptTemplate.from_template(AYA_TRANSLATE_PROMPT)
    self.chain = prompt | llm

    def invoke (self, message: str) -> AIMessage:
    output = self.chain.invoke({'message':message})
    return output

    class AyaQuery(object):

    def __init__(self, llm, store, retriever) -> None:
    self.llm = llm
    self.retriever = retriever
    self.store = store
    qa_prompt = ChatPromptTemplate.from_template(AYA_AGENT_PROMPT)
    self.chain = qa_prompt | llm

    def invoke(self, question : str) -> AIMessage:

    context = format_docs(self.retriever.invoke(question))
    rag_output = self.chain.invoke({'question':question, 'context':context})
    return rag_output

    class AyaSupervisor(object):

    def __init__(self, llm):

    prompt = ChatPromptTemplate.from_template(AYA_SUPERVISOR_PROMPT)
    self.chain = prompt | llm

    def invoke(self, message : str) -> str:
    output = self.chain.invoke(message)
    return output.content

    class AyaSummarizer(object):

    def __init__(self, llm):

    message_length_prompt = ChatPromptTemplate.from_template(AYA_SUMMARIZE_LENGTH_PROMPT)
    self.length_chain = message_length_prompt | llm

    prompt = ChatPromptTemplate.from_template(AYA_SUMMARIZER_PROMPT)
    self.chain = prompt | llm



    def invoke(self, message : str, agent : UserAgent) -> str:

    length = self.length_chain.invoke(message)

    try:
    length = int(length.content.strip())
    except:
    length = 0

    chat_history = agent.chat_history

    if length == 0:
    messages_to_summarize = [chat_history[i].content for i in range(len(chat_history))]
    else:
    messages_to_summarize = [chat_history[i].content for i in range(min(len(chat_history), length))]

    print(length)
    print(messages_to_summarize)

    messages_to_summarize = "n ".join(messages_to_summarize)

    output = self.chain.invoke(messages_to_summarize)
    output_content = output.content

    print(output_content)

    return output_content

    Most of these agents have a similar structure, primarily consisting of a LangChain chain consisting of a custom prompt and a LLM. Exceptions include the AyaQuery agent which has an additional vector database retriever to implement RAG and AyaSummarizer which has multiple LLM functions being implemented within it.

    Design considerations

    Role of AyaSupervisor Agent: In the design of the graph, we had a fixed edge going from the Supervisor node to the user nodes. Which meant that all messages that reached the Supervisor node were pushed to the user nodes itself. Therefore, in cases where Aya was being addressed, we had to ensure that only a single final output from Aya was being pushed to the users. We didn’t want intermediate messages, if any, to reach the users. Therefore, we had the AyaSupervisor agent that acted as the single point of contact for the Aya agent. This agent was primarily responsible for interpreting the intent of the incoming message, direct the message to the appropriate task-specific agent, and then outputting the final message to be shared with the users.

    Design of AyaSummarizer: The AyaSummarizer agent is slightly more complex compared to the other Aya agents as it carries out a two-step process. In the first step, the agent first determines the number of messages that needs to be summarized, which is a LLM call with its own prompt. In the second step, once we know the number of messages to summarize, we collate the required messages and pass it to the LLM to generate the actual summary. In addition to the summary, in this step itself, the LLM also identifies any action items that were present in the messages and lists it out separately.

    So broadly there were three tasks: determining the length of the messages to be summarized, summarizing messages, identifying action items. However, given that the first task was proving a bit difficult for the LLM without any explicit examples, I made the choice to have this be a separate LLM call and then combine the two last two tasks as their own LLM call.

    It may be possible to eliminate the additional LLM call and combine all three tasks in one call. Potential options include :

    1. Providing very detailed examples that cover all three tasks in one step
    2. Generating lot of examples to actually finetune a LLM to be able to perform well in this task

    Role of AyaTranslator: One of the goals with respect to Aya was to make it a multilingual AI assistant which can communicate in the user’s preferred language. However, it would be difficult to handle different languages internally within the Aya agents. Specifically, if the Aya agents prompt is in English and the user message is in a different language, it could potentially create issues. So in order to avoid such situations, as a filtering step, we translated any incoming user messages to Aya into English. As a result, all of the internal work within the Aya group of agents was done in English, including the output. We didnt have to translate the Aya output back to the original language because when the message reaches the users, the User agents will take care of translating the message to their respective assigned language.

    Prompt Design

    With respect to prompt designs, the majority of the work was focused on getting the LLM to output responses in a particular format in a consistent manner. For most cases, I was able to achieve this by providing explicit instructions. In some cases, instructions alone was not enough and I had to provide examples for the agent to behave consistently.

    For the most part, the prompt template had the following structure :

    [High level task definition] You are an AI assistant that answers user's questions... 

    [List of specific constraints related to the response]
    Obey the following rules :
    1. ....

    [Providing context/user input]
    Message :

    To take a specific example, we take a look at the prompt used by the User Agent:

    You are a {user_language} translator, translating a conversation between work colleagues. Translate the message provided by the user into {user_language}. 

    Obey the following rules :
    1. Only translate the text thats written after 'Message:' and nothing else
    2. If the text is already in {user_language} then return the message as it is.
    3. Return only the translated text
    4. Ensure that your translation uses formal language

    Message:
    {message}

    With regards to this agent, an important constraint was to ensure that the model only outputted the translated text and no supporting text like “Here’s the translated text” or “Sure, the following is a translation for the provided text”. In this case, adding a specific rule to obey (rule #3) was enough to ensure that the models were only outputting the translated text and nothing else.

    An example of an instance that required examples in the prompt were the prompts related to the summarizer agent. Specifically for the agent responsible for identifying the number of messages to summarize over. I found it difficult to get the agent to consistently extract the number of messages listed, if any, and output it in a specific format. Therefore, it became necessary to provide examples to better explain what I was expecting as a response from the agent.

    Other implementation details

    ChatMessage

    Those familiar with LangChain should already be aware of AIMessage, HumanMessage classes that are used to hold AI and human messages. For our use case, we needed to be able to store the ID of the sender for downstream tasks. Therefore to address this, we created a new derived class called ChatMessage that stores a message along with the sender’s ID

    class ChatMessage(object):

    def __init__(self, message : BaseMessage, sender : str = None):
    self.message = message
    self.sender = sender
    self.content = message.content

    def __repr__(self) -> str:
    return f"{self.sender} | {self.content}"

    Graph State

    In LangGraph, one of key elements of the graph is the graph state. The state variable/object is crucial for proper communication between the agents as well as for keeping track of the progress through the graph workflow.

    def reducer(a : list, b : list | str ) -> list:

    if type(b) == list:
    return a + b
    else:
    return a


    class AgentState(TypedDict):
    messages: Annotated[Sequence[ChatMessage], reducer]

    In most LangGraph examples, the state variable is a list of strings that keeps getting appended to after passing through every agent. In our use case, I wanted to exclude the outputs from certain nodes from affecting the state of the graph, despite the workflow having passed through that node. To accommodate such cases, I differentiated between the two types of state changes by having one as a list and the other as a string. In cases where the state update is in the form of a list, it gets appended to the overall state object. In cases, where the state update is a string, we ignore that update and propagate the existing state. This is achieved using the custom reducer function defined above.

    Conclusion

    At this stage, we have covered the design choices of one of the key components of the agentic workflow : the agents. In the next tutorial, we will cover more details about the actual LangGraph graph and its implementation, along with some more details about the features associated with Aya.

    Resources

    For the code, you can refer to the repo here : Multilingual Chatbot.

    Unless specified otherwise, all images are created by the author.

    In addition to Medium, I share my thoughts, ideas and other updates on LinkedIn.


    Building a Multilingual Multi-Agent Chat Application Using LangGraph — Part I was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Building a Multilingual Multi-Agent Chat Application Using LangGraph — Part I

    Go Here to Read this Fast! Building a Multilingual Multi-Agent Chat Application Using LangGraph — Part I

  • Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock

    Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock

    Shreyas Subramanian

    In this post, we show you how to use LlamaIndex with Amazon Bedrock to build robust and sophisticated RAG pipelines that unlock the full potential of LLMs for knowledge-intensive tasks.

    Originally appeared here:
    Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock

    Go Here to Read this Fast! Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock

  • Evaluating prompts at scale with Prompt Management and Prompt Flows for Amazon Bedrock

    Evaluating prompts at scale with Prompt Management and Prompt Flows for Amazon Bedrock

    Antonio Rodriguez

    In this post, we demonstrate how to implement an automated prompt evaluation system using Amazon Bedrock so you can streamline your prompt development process and improve the overall quality of your AI-generated content.

    Originally appeared here:
    Evaluating prompts at scale with Prompt Management and Prompt Flows for Amazon Bedrock

    Go Here to Read this Fast! Evaluating prompts at scale with Prompt Management and Prompt Flows for Amazon Bedrock

  • An Introduction to Bayesian A/B Testing

    An Introduction to Bayesian A/B Testing

    Laurin Brechter

    Gain better insights from your data

    A/B testing, also known as split testing, allows businesses to experiment with different versions of a webpage or marketing asset to determine which one performs better in terms of user engagement, click-through rates, and, most importantly, conversion rates.

    Conversion rates — the percentage of visitors who complete a desired action, such as making a purchase or signing up for a newsletter — are often the key metrics that determine the success of online campaigns. By carefully testing variations of a webpage, businesses can make data-driven decisions that significantly improve these rates. Whether it’s tweaking the color of a call-to-action button, changing the headline, or rearranging the layout, A/B testing provides actionable insights that can transform the effectiveness of your online presence.

    In this post, I will show how to do Bayesian A/B testing for looking at conversion rates. We will also look at a more complicated example where we will look at the differences in changes of customer behavior after an intervention. We will also look at the differences when comparing this approach to a frequentist approach and what the possible advantages or disadvantages are.

    Comparing Conversion Rates

    Let’s say we want to improve upon our e-commerce website. We do so by exposing two groups of customers to two versions of our website where we e.g. change a button. We then stop this experiment after having exposed a certain number of visitors to both these versions. After that, we get a binary array with a 1 indicating conversion and a 0 if there was no conversion.

    Observed Data after A/B Test

    We can summarize the data in a contingency table that shows us the (relative) frequencies.

    contingency = np.array([[obsA.sum(), (1-obsA).sum()], [obsB.sum(), (1-obsB).sum()]])
    Contingency Table

    In our case, we showed each variation to 100 customers. In the first variation, 5 (or 5%) converted, and in the second variation 3 converted.

    Frequentist Setting

    We will do a statistical test to measure if this result is significant or due to chance. In this case, we will use a Chi2 test which compares the observed frequencies to the ones that might be expected if there were no true differences between the two versions (the null hypothesis). For more information, one can look at this blog post that goes into more detail.

    In this case, the p-value does not fall under the threshold for significance (e.g. 5%) and therefore we cannot reject the null hypothesis that the two variants differ in their effect on the conversion rate.

    Now, there are some pitfalls when using the Chi2 test that can make the insights gained from it erroneous. Firstly, it is very sensitive to the sample size. With a large sample size even tiny differences will become significant whereas with a small sample size, the test may fail to detect differences. This is especially the case if the calculated expected frequencies for any of the fields are smaller than five. In this case, one has to use some other test. Additionally, the test does not provide information on the magnitude or practical significance of the difference. When conducting multiple A/B tests simultaneously, the probability of finding at least one significant result due to chance increases. The Chi2 test does not account for this multiple comparisons problem, which can lead to false positives if not properly controlled (e.g., through Bonferroni correction).

    Another common pitfall occurs when interpreting the results of the Chi2 test (or any statistical test for that matter). The p-value gives us the probability of observing the data, given that the null hypothesis is true. It does not make a statement about the distribution of conversion rates or their difference. And this is a major problem. We cannot make statements such as “the probability that the conversion rate of variant B is 2% is X%” because for that we would need the probability distribution of the conversion rate (conditioned on the observed data).

    These pitfalls highlight the importance of understanding the limitations of the Chi2 test and using it appropriately within its constraints. When applying this test, it is crucial to complement it with other statistical methods and contextual analysis to ensure accurate and meaningful conclusions.

    Bayesian Setting

    After looking at the frequentist way of dealing with A/B testing, let’s look at the Bayesian version. Here, we are modeling the data-generating process (and therefore the conversion rate) directly. That is, we are specifying a likelihood and a prior that could lead to the observed outcome. Think of this as specifying a ‘story’ for how the data could have been created.

    Bayes Formula

    In this case, I am using the Python package PyMC for modeling since it has a clear and concise syntax. Inside the ‘with’ statement, we specify distributions that we can combine and that give rise to a data-generating process.

    with pm.Model() as ConversionModel:
    # priors
    pA = pm.Uniform('pA', 0, 1)
    pB = pm.Uniform('pB', 0, 1)

    delta = pm.Deterministic('delta', pA - pB)

    obsA = pm.Bernoulli('obsA', pA, observed=obsA)
    obsB = pm.Bernoulli('obsB', pB, observed=obsB)

    trace = pm.sample(2000)

    We have pA and pB which are the probabilities of conversion in groups A and B respectively. With pm.Uniform we specify our prior belief about these parameters. This is where we could encode prior knowledge. In our case, we are being neutral and allowing for any conversion rate between 0 and 1 to be equally likely.

    PyMC then allows us to draw samples from the posterior distribution which is our updated belief about the parameters after seeing the data. We now obtain a full probability distribution for the conversion probabilities.

    Posterior Distributions for Conversion Rates

    From these distributions, we can directly read quantities of interest such as credible intervals. This allows us to answer questions such as “What is the likelihood of a conversion rate between X% and Y%?”.

    The Bayesian approach allows for much more flexibility as we will see later. Interpreting the results is also more straightforward and intuitive than in the frequentist setting.

    Model Arbitrary Data-generating Processes

    We will now look at a more complicated example of A/B testing. Let’s say we expose subjects to some intervention at the beginning of the observation period. This would be the A/B part where one group gets intervention A and the other intervention B. We then look at the interaction of the 2 groups with our platform in the next 100 days (maybe something like the number of logins). What we might see is the following.

    We now want to know if these two groups show a meaningful difference in their response to the intervention. How would we solve this with a statistical test? Frankly, I don’t know. Someone would have to come up with a statistical test for exactly this scenario. The alternative is to again come back to a Bayesian setting, where we will first come up with a data-generating process. We will assume, that each individual is independent and its interactions with the platform are normally distributed. They have a switch point where they change their behavior. This switch point occurs only once but can happen at any given point in time. Before the switch point, we assume a mean interaction intensity of mu1 and after that an intensity of mu2. The syntax might look a bit complicated especially if you have never used PyMC before. In that case, I would recommend checking out their learning material.

    with pm.Model(coords={
    'ind_id': ind_id,
    }) as SwitchPointModel:

    sigma = pm.HalfCauchy("sigma", beta=2, dims="ind_id")

    # draw a switchpoint from a uniform distribution for each individual
    switchpoint = pm.DiscreteUniform("switchpoint", lower=0, upper=100, dims="ind_id")

    # priors for the two groups
    mu1 = pm.HalfNormal("mu1", sigma=10, dims="ind_id")
    mu2 = pm.HalfNormal("mu2", sigma=10, dims="ind_id")


    diff = pm.Deterministic("diff", mu1 - mu2)

    # create a deterministic variable for the
    intercept = pm.math.switch(switchpoint < X.T, mu1, mu2)

    obsA = pm.Normal("y", mu=intercept, sigma=sigma, observed=obs)

    trace = pm.sample()

    The model can then show us the distribution for the switch point location as well as the distribution of differences before and after the switch point.

    We can take a closer look at those differences with a forest plot.

    We can nicely see how the differences between Group A (id 0 through 9) and Group B (10 through 19) are very much different where group B shows a much greater response to the intervention.

    Conclusion

    Bayesian Inference offers a lot of flexibility when it comes to modeling situations in which we do not have a lot of data and where we care about modeling uncertainty. Additionally, we have to make our assumptions explicit and think about them. In more simple scenarios, frequentist statistical tests are often simpler to use but one has to be aware of the assumptions that come along with them.

    All code used in this article can be found on my GitHub. Unless otherwise stated, all images are created by the author.


    An Introduction to Bayesian A/B Testing was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    An Introduction to Bayesian A/B Testing

    Go Here to Read this Fast! An Introduction to Bayesian A/B Testing

  • The Latest on LLMs: Decision-Making, Knowledge Graphs, Reasoning Skills, and More

    TDS Editors

    Feeling inspired to write your first TDS post? We’re always open to contributions from new authors.

    With the pace at which large language models continue to evolve, staying up-to-date with the field is a major challenge. We see new models, cutting-edge research, and LLM-based apps proliferate on a daily basis, and as a result, many practitioners are understandably concerned about falling behind or not using the latest and shiniest tools.

    First, let’s all take a deep breath: when an entire ecosystem is moving rapidly in dozens of different directions, nobody can expect (or be expected) to know everything. We should also not forget that most of our peers are in a very similar situation, zooming in on the developments that are most essential to their work, while avoiding too much FOMO—or at least trying to.

    If you’re still interested in learning about some of the biggest questions currently dominating conversations around LLMs, or are curious about the emerging themes machine learning professionals are exploring, we’re here to help. In this week’s Variable, we’re highlighting standout articles that dig deep into the current state of LLMs, both in terms of their underlying capabilities and practical real-world applications. Let’s dive in!

    Photo by Mick Haupt on Unsplash

    The world of data science and machine learning is vast, and goes far beyond contemporary LLMs—which is why we encourage you to explore some of our other reading recommendations on other topics:

    Thank you for supporting the work of our authors! As we mentioned above, we love publishing articles from new authors, so if you’ve recently written an interesting project walkthrough, tutorial, or theoretical reflection on any of our core topics, don’t hesitate to share it with us.

    Until the next Variable,

    TDS Team


    The Latest on LLMs: Decision-Making, Knowledge Graphs, Reasoning Skills, and More was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    The Latest on LLMs: Decision-Making, Knowledge Graphs, Reasoning Skills, and More

    Go Here to Read this Fast! The Latest on LLMs: Decision-Making, Knowledge Graphs, Reasoning Skills, and More