Category: AI

  • Understanding Deduplication Methods: Ways to Preserve the Integrity of Your Data

    Understanding Deduplication Methods: Ways to Preserve the Integrity of Your Data

    Rendy Dalimunthe

    Increasing growth and data complexities have made data deduplication even more relevant

    Data duplication is still a problem for many organisations. Although data processing and storage systems have developed rapidly along with technological advances, the complexity of the data produced is also increasing. Moreover, with the proliferation of Big Data and the utilisation of cloud-based applications, today’s organisations must increasingly deal with fragmented data sources.

    Photo by Damir: https://www.pexels.com/photo/serene-lakeside-reflection-with-birch-trees-29167854/

    Ignoring the phenomenon of the large amount of duplicated data will have a negative impact on the organisation. Such as:

    • Disruption of the decision-making process. Unclean data can bias metrics and not reflect the actual conditions. For example: if there is one customer that is actually the same, but is represented as 2 or 3 customers data in CRM, this can be a distortion when projecting revenue.
    • Swelling storage costs because every bit of data basically takes up storage space.
    • Disruption of customer experience. For example: if the system has to provide notifications or send emails to customers, it is very likely that customers whose data is duplicate will receive more than one notification.
    • Making the AI ​​training process less than optimal. When an organisation starts developing an AI solution, one of the requirements is to conduct training with clean data. If there is still a lot of duplicates, the data cannot be said to be clean and when forced to be used in AI training, it will potentially produce biased AI.

    Given the crucial impact caused when an organisation does not attempt to reduce or eliminate data duplication, the process of data deduplication becomes increasingly relevant. It is also critical to ensure data quality. The growing sophistication and complexity of the system must be accompanied by the evolution of adequate deduplication techniques.

    On this occasion, we will examine the 3 latest deduplication methods, which can be a reference for practitioners when planning the deduplication process.

    Global Deduplication

    It is the process of eliminating duplicate data across multiple storage locations. It is now common for organisations to store their data across multiple servers, data centers, or the cloud. Global deduplication ensures that only one copy of the data is stored.

    This method works by creating a global index, which is a list of all existing data, in the form of a unique code (hash) using an algorithm such as SHA256 that represents each piece of data. When a new file is uploaded to a server (for example Server 1), the system will store a unique code for that file.

    On another day when a user uploads a file to Server 2, the system will compare the unique code of the new file with the global index. If the new file is found to have the same unique code/hash as the global index, then instead of continuing to store the same file in two places, the system will replace the duplicate file stored on Server 2 with a reference/pointer that points to a copy of the file that already exists on Server 1.

    With this method, storage space can clearly be saved. And if combined with Data Virtualisation technique then when the file is needed the system will take it from the original location, but all users will still feel the data is on their respective servers.

    The illustration below shows how Global Deduplication works where each server only stores one copy of the original data and duplicates on other servers are replaced by references to the original file.

    source: Author

    It should be noted that the Global Deduplication method does not work in real-time, but post-process. Which means the method can only be applied when the file has entered storage.

    Inline Deduplication

    Unlike Global Deduplication, this method works in real-time right when data is being written to the storage system. With the Inline Deduplication technique, duplicate data is immediately replaced with references without going through the physical storage process.

    The process begins when data is about to enter the system or a file is being uploaded, the system will immediately divide the file into several small pieces or chunks. Using an algorithm such as SHA-256, each chunk will then be given a hash value as a unique code. Example:

    Chunk1 -> hashA

    Chunk2-> hashB

    Chunk3 -> hashC

    The system will then check whether any of the chunks have hashes already in the storage index. If one of the chunks is found whose unique code is already in the storage hash, the system will not re-save the physical data from the chunk, but will only store a reference to the original chunk location that was previously stored.

    While each unique chunk will be stored physically.

    Later, when a user wants to access the file, the system will rearrange the data from the existing chunks based on the reference, so that the complete file can be used by the user.

    Inline Deduplication is widely used by cloud service providers such as Amazon S3 or Google Drive. This method is very useful for optimising storage capacity.

    The simple illustration below illustrates the Inline Deduplication process, from data chunking to how data is accessed.

    Source: Author

    ML-Enhanced Deduplication

    Machine learning-powered deduplication uses AI to detect and remove duplicate data, even if it is not completely identical.

    The process begins when incoming data, such as files/documents/records, are sent to the deduplication system for analysis. For example, the system receives two scanned documents that at first glance look similar but actually have subtle differences in layout or text format.

    The system will then intelligently extract important features, usually in the form of metadata or visual patterns. These important features will then be analysed and compared for similarity. The similarity of a feature will be represented as a value/score. And each system/organisation can define whether data is a duplicate or not based on its similarity score. For example: only data with a similarity score above 90% can be said to be potentially duplicate.

    Based on the similarity score, the system can judge whether the data is a duplicate. If stated that it is a duplicate, then steps can be taken like other duplication methods, where for duplicate data only the reference is stored.

    What’s interesting about ML-enhanced Deduplication is that it allows human involvement to validate the classification that has been done by the system. So that the system can continue to get smarter based on the inputs that have been learned (feedback loop).

    However, it should be noted that unlike Inline Deduplication, ML-enhanced deduplication is not suitable for use in real-time. This is due to the latency factor, where ML takes time to extract features and process data. In addition, if forced to be real-time, this method requires more intensive computing resources.

    Although not real-time, the benefits it brings are still optimal, especially with its ability to handle unstructured or semi-structured data.

    The following is an illustration of the steps of ML-enhanced Deduplication along with examples.

    Source: Author

    From the explanation above, it is clear that organisations have many choices of methods according to their capabilities and needs. So there is no reason to not doing deduplication, especially if the data stored or handled by the organisation is critical data that concerns the lives of many people.

    Organisations should be able to use several consideration items to choose the most appropriate method. Aspects such as the purpose of deduplication, the type and volume of data, and the infrastructure capabilities of the organisation can be used for initial assessment.

    It should be noted that there is flexible options that organisations can choose, such as a hybrid method of combining Inline Deduplication with the ML-enhanced counterpart. In that way, a wider benefit can be potentially obtained.

    Data management regulations such as GDPR and HIPAA regulate sensitive information. Hence, organisations need to ensure that deduplication does not violate privacy policies. For example: an organisation could combine customer data from two different systems after detecting duplicates, without obtaining user consent. Organisations have to ensure this kind of thing is not happening.

    Whatever the challenge is, deduplication must still be done, and organisations need to make efforts from the start. Do not wait until the data gets bigger and deduplication efforts become more expensive and complex.

    Start now and reap the benefit along the way.


    Understanding Deduplication Methods: Ways to Preserve the Integrity of Your Data was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Understanding Deduplication Methods: Ways to Preserve the Integrity of Your Data

    Go Here to Read this Fast! Understanding Deduplication Methods: Ways to Preserve the Integrity of Your Data

  • Introducing Layer Enhanced Classification (LEC)

    Introducing Layer Enhanced Classification (LEC)

    Tula Masterman

    A novel approach for lightweight safety classification using pruned language models

    Leveraging the hidden state from an intermediate Transformer layer for efficient and robust content safety and prompt injection classification

    Image by author and GPT-4o meant to represent the robust language understanding provided by Large Language Models.

    Introduction

    As the adoption of Language Models (LMs) grows, it’s more and more important to detect inappropriate content in both the user’s input and the generated outputs of the language model. With each new model release from any major model provider, one of the first things people try to do is find ways to “jailbreak” or otherwise manipulate the model to respond in ways it shouldn’t. A quick search on Google or X reveals many examples of how people have found ways around model alignment tuning to get models to respond to inappropriate requests. Furthermore, many companies have released Generative AI based chatbots publicly for tasks like customer service, which often end up suffering from prompt injection attacks and responding to tasks both inappropriate and far beyond their intended use. Detecting and classifying these instances is extremely important for businesses so that they don’t end up with a system that can be easily manipulated by their users, especially if they deploy their chat systems publicly.

    My team, Mason Sawtell, Sandi Besen, Jim Brown, and I recently published our paper Lightweight Safety Classification using Pruned Language Models as an ArXiv preprint. Our work introduces a new approach, Layer Enhanced Classification (LEC), and demonstrates that using LEC it is possible to effectively classify both content safety violations and prompt injection attacks by using the hidden state(s) from the intermediate transformer layer(s) of a Language Model to train a penalized logistic regression classifier with very few trainable parameters (769 on the low end) and a small number of training examples, often fewer than 100. This approach combines the computational efficiency of a simple classification model with the robust language understanding of a Language Model.

    All of the models trained using our approach, LEC, outperform special-purpose models designed for each task as well as GPT-4o. We find that there are optimal intermediate transformer layers that produce the necessary features for both content safety and prompt injection classification tasks. This is important because it suggests you can use the same model to simultaneously classify content safety violations, prompt injections, and generate the output tokens. Alternatively, you could use a very small LM, prune it to the optimal intermediate layer, and use the outputs from this layer as the features for the classification task. This would allow for an incredibly compute efficient and lightweight classifier that integrates well with an existing LM inference pipeline.

    This is the first of several articles I plan to share on this topic. In this article I will summarize the goals, approach, key results, and implications of our research. In a future article, I plan to share how we applied our approach to IBM’s Granite-8B model and an open-source model without any guardrails, allowing both models to detect content safety & prompt injection violations and generate output tokens all in one pass through the model. For further details on our research feel free to check out the full paper or reach out with questions.

    Goals & Approach

    Overview: Our research focuses on understanding how well the hidden states of intermediate transformer layers perform when used as the input features for classification tasks. We wanted to understand if small general-purpose models and special-purpose models for content safety and prompt injection classification tasks would perform better on these tasks if we could identify the optimal layer to use for the task instead of using the entire model / the last layer for classification. We also wanted to understand how small of a model, in terms of the total number of parameters, we could use as a starting point for this task. Other research has shown that different layers of the model focus on different characteristics of any given prompt input, our work finds that the intermediate layers tend to best capture the features that are most important for these classification tasks.

    Datasets: For both content safety and prompt injection classification tasks we compare the performance of models trained using our approach to baseline models on task-specific datasets. Previous work indicated our classifiers would only see small performance improvements after a few hundred examples so for both classification tasks we used a task-specific dataset with 5,000 randomly sampled examples, allowing for enough data diversity while minimizing compute and training time. For the content safety dataset we use a combination of the SALAD Data dataset from OpenSafetyLab and the LMSYS-Chat-1M dataset from LMSYS. For the prompt injection dataset we use the SPML dataset since it includes system and user prompt pairs. This is critical because some user requests might seem “safe” (e.g., “help me solve this math problem”) but they ask the model to respond outside of the system’s intended use as defined in the system prompt (e.g. “You are a helpful AI assistant for Company X, you only respond to questions about our company”).

    Model Selection: We use GPT-4o as a baseline model for both tasks since it is widely considered one of the most capable LLMs and in some cases outperformed the baseline special-purpose model(s). For content safety classification we use Llama Guard 3 1B and 8B models and for prompt injection classification we use Protect AI’s DeBERTA v3 Base Prompt Injection v2 model since these models are considered leaders in their respective areas. We apply our approach, LEC, to the baseline special purpose models (Llama Guard 3 1B, Llama Guard 3 8B, and DeBERTa v3 Base Prompt Injection) and general-purpose models. For general-purpose models we selected Qwen 2.5 Instruct in sizes 0.5B, 1.5B, and 3B since these models are relatively close in size to the special-purpose models.

    This setup allows us to compare 3 key things:

    1. How well our approach performs when applied to a small general-purpose model compared to both baseline models (GPT-4o and the special-purpose model).
    2. How much applying our approach improves the performance of the special-purpose model relative to its own baseline performance on that task.
    3. How well our approach generalizes across model architectures, by evaluating its performance on both general-purpose and
      special-purpose models.

    Important Implementation Details: For both Qwen 2.5 Instruct models and task-specific special-purpose models we prune individual layers and capture the hidden state of the transformer layer to train a Penalized Logistic Regression (PLR) model with L2 regularization. The PLR model has the same number of trainable parameters as the size of the model’s hidden state plus one for the bias in binary classification tasks, this ranges from 769 for the smallest model (Protect AI’s DeBERTa) to 4097 for the largest model (Llama Guard 3 8B). We train the classifier with varying numbers of examples for each layer allowing us to understand the impact of individual layers on the task and how many training examples are necessary to surpass the baseline models’ performance or achieve optimal performance in terms of F1 score. We run our entire test set through the baseline models to establish their performance on each task.

    Image by author and team demonstrating the LEC training process at a high level. Training examples are independently passed through a model and the hidden state at each transformer layer is captured. These hidden states are then used to train classifiers. Each classifier is trained with a varying number of examples. The results allow us to determine which layers produce the most task-relevant features and how many examples are needed to achieve the best performance.

    Key Results

    In this section I’ll cover the important results across both tasks and for each task, content safety classification and prompt injection classification, individually.

    Key findings across both tasks:

    1. Overall, our approach results in a higher F1 score across all evaluated tasks, models, and number of of training examples, typically surpassing baseline model performance within 20–100 examples.
    2. The intermediate layers tend to show the largest improvement in F1 score compared to the final layer when trained on fewer examples. These layers also tend to have the best performance relative to the baseline models. This indicates that local features important to both classification tasks are represented early on in the transformer network and suggests that use cases with fewer training examples can especially benefit from our approach.
    3. Furthermore, we found that applying our approach to the special-purpose models outperforms the models own baseline performance, typically within 20 examples, by identifying and using the most task-relevant layer.
    4. Both general-purpose Qwen 2.5 Instruct models and task-specific special-purpose models achieve higher F1 scores within fewer examples with our approach. This suggests that our approach generalizes across architectures and domains.
    5. In the Qwen 2.5 Instruct models, we find that the intermediate model layers attain higher F1 scores with fewer examples for both content safety and prompt injection classification tasks. This suggests that it’s feasible to use one model for both classification tasks and generate the outputs in one pass. The additional compute time for these extra classification steps would be almost negligible given the small size of the classifiers.

    Content safety classification results:

    Image by author and team demonstrating LEC performance at select layers on the binary content safety classification task for Qwen 2.5 Instruct 0.5B, Llama Guard 3 1B, and Llama Guard 3 8b. The x-axis shows the number of training examples, and the Y-axis reflects the weighted F1-score.
    1. For both binary and multi-class classification, the general and special purpose models trained using our approach typically outperform the baseline Llama Guard 3 models within 20 examples and GPT-4o in fewer than 100 examples.
    2. For both binary and multi-class classification, the general and special purpose LEC models typically surpass all baseline models performance for the intermediate layers if not all layers. Our results on binary content safety classification surpass the baselines by the widest margins attaining maximum F1-scores of 0.95 or 0.96 for both Qwen 2.5 Instruct and Llama Guard LEC models. In comparison, GPT-4o’s baseline F1 score is 0.82, Llama Guard 3 1B’s is 0.65 , and Llama Guard 3 8B’s is 0.71.
    3. For binary classification our approach performs comparably when applied to Qwen 2.5 Instruct 0.5B, Llama Guard 3 1B, and Llama Guard 3 8B. The models attain a maximum F1 score of 0.95, 0.96, and 0.96 respectively. Interestingly, Qwen 2.5 Instruct 0.5B surpasses GPT-4o’s baseline performance in 15 examples for the middle layers while it takes both Llama Guard 3 models 55 examples to do so.
    4. For multi-class classification, a very small LEC model using the hidden state from the middle layers of Qwen 2.5 Instruct 0.5B surpasses GPT-4o’s baseline performance within 35 training examples for all three difficulty levels of the multi-class classification task.

    Prompt injection classification results:

    Image by author and team demonstrating LEC performance at select layers on the prompt injection classification task for Qwen 2.5 Instruct 0.5B and DeBERTa v3 Prompt Injection v2 models. The x-axis shows the number of training examples, and the Y-axis reflects the weighted F1-score. These graphs demonstrate how both LEC models outperform the baselines for the intermediate model layers with minimal training examples.
    1. Applying our approach to both general-purpose Qwen 2.5 Instruct models and special-purpose DeBERTa v3 Prompt Injection v2 results in both models intermediate layers outperforming the baseline models in fewer than 100 training examples. This again indicates that our approach generalizes across model architectures and domains.
    2. All three Qwen 2.5 Instruct model sizes surpass the baseline DeBERTa v3 Prompt Injection v2 model’s F1 score of 0.73 within 5 training examples for all model layers.
    3. Qwen 2.5 Instruct 0.5B surpasses GPT-4o’s performance for the middle layer, layer 12 in 55 examples. Similar, but slightly better performance is observed for the larger Qwen 2.5 Instruct models.
    4. Applying our approach to the DeBERTa v3 Prompt Injection v2 model results in a maximum F1 score of 0.98, significantly surpassing the model’s baseline performance F1 score of 0.73 on this task.
    5. The intermediate layers achieve the highest weighted F1 scores for both the DeBERTa model and across Qwen 2.5 Instruct model sizes.

    Conclusion

    In our research we focused on two responsible AI related classification tasks but expect this approach to work for other classification tasks provided that the important features for the task can be detected by the intermediate layers of the model.

    We demonstrated that our approach of training a classification model on the hidden state from an intermediate transformer layer creates effective content safety and prompt injection classification models with minimal parameters and training examples. Furthermore, we illustrated how our approach improves the performance of existing special-purpose models compared to their own baseline results.

    Our results suggest two promising options for integrating top-performing content safety and prompt injection classifiers into existing LLM inference workflows. One option is to take a lightweight small model like the ones explored in our paper, prune it to the optimal layer and use it as a feature extractor for the classification task. The classification model could then be used to identify any content safety violations or prompt injections before processing the user input with a closed-source model like GPT-4o. The same classification model could be used to validate the generated response before sending it to the user. A second option is to apply our approach to an open-source, general-purpose model, like IBM’s Granite or Meta’s Llama models, identify which layers are most relevant to the classification task, then update the inference pipeline to simultaneously classify content safety and prompt injections while generating the output response. If content safety or prompt injections are detected you could easily stop the output generation, otherwise if there are no violations, the model can continue generating it’s response. Either of these options could be extended to apply to AI-agent based scenarios depending on the model used for each agent.

    In summary, LEC provides a new promising and practical solution to safeguarding Generative AI based systems by identifying content safety and prompt injection attacks with better performance and fewer training examples compared to existing approaches. This is critical for any person or business building with Generative AI today to ensure their systems are operating both responsibly and as intended.

    Note: The opinions expressed both in this article and the research paper are solely those of the authors and do not necessarily reflect the views or policies of their respective employers.

    Interested in discussing further or collaborating? Reach out on LinkedIn!

    Additional References:


    Introducing Layer Enhanced Classification (LEC) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Introducing Layer Enhanced Classification (LEC)

    Go Here to Read this Fast! Introducing Layer Enhanced Classification (LEC)

  • From Prototype to Production: Enhancing LLM Accuracy

    From Prototype to Production: Enhancing LLM Accuracy

    Mariya Mansurova

    Implementing evaluation frameworks to optimize accuracy in real-world applications

    Image created by DALL-E 3

    Building a prototype for an LLM application is surprisingly straightforward. You can often create a functional first version within just a few hours. This initial prototype will likely provide results that look legitimate and be a good tool to demonstrate your approach. However, this is usually not enough for production use.

    LLMs are probabilistic by nature, as they generate tokens based on the distribution of likely continuations. This means that in many cases, we get the answer close to the “correct” one from the distribution. Sometimes, this is acceptable — for example, it doesn’t matter whether the app says “Hello, John!” or “Hi, John!”. In other cases, the difference is critical, such as between “The revenue in 2024 was 20M USD” and “The revenue in 2024 was 20M GBP”.

    In many real-world business scenarios, precision is crucial, and “almost right” isn’t good enough. For example, when your LLM application needs to execute API calls, or you’re doing a summary of financial reports. From my experience, ensuring the accuracy and consistency of results is far more complex and time-consuming than building the initial prototype.

    In this article, I will discuss how to approach measuring and improving accuracy. We’ll build an SQL Agent where precision is vital for ensuring that queries are executable. Starting with a basic prototype, we’ll explore methods to measure accuracy and test various techniques to enhance it, such as self-reflection and retrieval-augmented generation (RAG).

    Setup

    As usual, let’s begin with the setup. The core components of our SQL agent solution are the LLM model, which generates queries, and the SQL database, which executes them.

    LLM model — Llama

    For this project, we will use an open-source Llama model released by Meta. I’ve chosen Llama 3.1 8B because it is lightweight enough to run on my laptop while still being quite powerful (refer to the documentation for details).

    If you haven’t installed it yet, you can find guides here. I use it locally on MacOS via Ollama. Using the following command, we can download the model.

    ollama pull llama3.1:8b

    We will use Ollama with LangChain, so let’s start by installing the required package.

    pip install -qU langchain_ollama 

    Now, we can run the Llama model and see the first results.

    from langchain_ollama import OllamaLLM

    llm = OllamaLLM(model="llama3.1:8b")
    llm.invoke("How are you?")
    # I'm just a computer program, so I don't have feelings or emotions
    # like humans do. I'm functioning properly and ready to help with
    # any questions or tasks you may have! How can I assist you today?

    We would like to pass a system message alongside customer questions. So, following the Llama 3.1 model documentation, let’s put together a helper function to construct a prompt and test this function.

    def get_llama_prompt(user_message, system_message=""):
    system_prompt = ""
    if system_message != "":
    system_prompt = (
    f"<|start_header_id|>system<|end_header_id|>nn{system_message}"
    f"<|eot_id|>"
    )
    prompt = (f"<|begin_of_text|>{system_prompt}"
    f"<|start_header_id|>user<|end_header_id|>nn"
    f"{user_message}"
    f"<|eot_id|>"
    f"<|start_header_id|>assistant<|end_header_id|>nn"
    )
    return prompt


    system_prompt = '''
    You are Rudolph, the spirited reindeer with a glowing red nose,
    bursting with excitement as you prepare to lead Santa's sleigh
    through snowy skies. Your joy shines as brightly as your nose,
    eager to spread Christmas cheer to the world!
    Please, answer questions concisely in 1-2 sentences.
    '''
    prompt = get_llama_prompt('How are you?', system_prompt)
    llm.invoke(prompt)

    # I'm feeling jolly and bright, ready for a magical night!
    # My shiny red nose is glowing brighter than ever, just perfect
    # for navigating through the starry skies.

    The new system prompt has changed the answer significantly, so it works. With this, our local LLM setup is ready to go.

    Database — ClickHouse

    I will use an open-source database ClickHouse. I’ve chosen ClickHouse because it has a specific SQL dialect. LLMs have likely encountered fewer examples of this dialect during training, making the task a bit more challenging. However, you can choose any other database.

    Installing ClickHouse is pretty straightforward — just follow the instructions provided in the documentation.

    We will be working with two tables: ecommerce.users and ecommerce.sessions. These tables contain fictional data, including customer personal information and their session activity on the e-commerce website.

    You can find the code for generating synthetic data and uploading it on GitHub.

    With that, the setup is complete, and we’re ready to move on to building the basic prototype.

    The first prototype

    As discussed, our goal is to build an SQL Agent — an application that generates SQL queries to answer customer questions. In the future, we can add another layer to this system: executing the SQL query, passing both the initial question and the database results back to the LLM, and asking it to generate a human-friendly answer. However, for this article, we’ll focus on the first step.

    The best practice with LLM applications (similar to any other complex tasks) is to start simple and then iterate. The most straightforward implementation is to do one LLM call and share all the necessary information (such as schema description) in the system prompt. So, the first step is to put together the prompt.

    generate_query_system_prompt = '''
    You are a senior data analyst with more than 10 years of experience writing complex SQL queries.
    There are two tables in the database with the following schemas.

    Table: ecommerce.users
    Description: customers of the online shop
    Fields:
    - user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004
    - country (string) - country of residence, for example, "Netherlands" or "United Kingdom"
    - is_active (integer) - 1 if customer is still active and 0 otherwise
    - age (integer) - customer age in full years, for example, 31 or 72

    Table: ecommerce.sessions
    Description: sessions of usage the online shop
    Fields:
    - user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004
    - session_id (integer) - unique identifier of session, for example, 106 or 1023
    - action_date (date) - session start date, for example, "2021-01-03" or "2024-12-02"
    - session_duration (integer) - duration of session in seconds, for example, 125 or 49
    - os (string) - operation system that customer used, for example, "Windows" or "Android"
    - browser (string) - browser that customer used, for example, "Chrome" or "Safari"
    - is_fraud (integer) - 1 if session is marked as fraud and 0 otherwise
    - revenue (float) - income in USD (the sum of purchased items), for example, 0.0 or 1506.7


    Write a query in ClickHouse SQL to answer the following question.
    Add "format TabSeparatedWithNames" at the end of the query to get data from ClickHouse database in the right format.
    '''

    I’ve included the example values for each field in the prompt to ensure that LLM understands the data format.

    And that’s it! With this, we have our first functional prototype for the SQL Agent. Now, it’s time to put it to the test.

    prompt = get_llama_prompt('How many customers made purchase in December 2024?', 
    generate_query_system_prompt)
    llm.invoke(prompt)

    # To find out how many customers made a purchase in December 2024,
    # we need to join the `sessions` table with the `users` table on the `user_id`,
    # and then filter by `action_date` for sessions that took place in December 2024.

    # Here is the query:
    #
    # ```sql
    # SELECT COUNT(DISTINCT u.user_id)
    # FROM ecommerce.sessions s
    # JOIN ecommerce.users u ON s.user_id = u.user_id
    # WHERE
    # EXTRACT(YEAR FROM s.action_date) = 2024
    # AND EXTRACT(MONTH FROM s.action_date) = 12
    # AND revenue > 0;
    # ```

    # This query first joins the `sessions` table with the `users` table
    # on the `user_id`. Then it filters out sessions where no purchase
    # was made (revenue is greater than 0). Finally, it selects the distinct
    # user IDs of these customers.

    # When you run this query in ClickHouse SQL, you can add "format TabSeparatedWithNames"
    # at the end to get the result in the right format:

    # ```sql
    # SELECT COUNT(DISTINCT u.user_id)
    # FROM ecommerce.sessions s
    # JOIN ecommerce.users u ON s.user_id = u.user_id
    # WHERE
    # EXTRACT(YEAR FROM s.action_date) = 2024
    # AND EXTRACT(MONTH FROM s.action_date) = 12
    # AND revenue > 0;
    # format TabSeparatedWithNames;
    # ```

    The agent produced a fairly decent result, but there’s one issue — the LLM returned not only the SQL query but also some commentary. Since we plan to execute SQL queries later, this format is not suitable for our task. Let’s work on fixing it.

    Fortunately, this problem has already been solved, and we don’t need to parse the SQL queries from the text manually. We can use the chat model ChatOllama. Unfortunately, it doesn’t support structured output, but we can leverage tool calling to achieve the same result.

    To do this, we will define a dummy tool to execute the query and instruct the model in the system prompt always to call this tool. I’ve kept the comments in the output to give the model some space for reasoning, following the chain-of-thought pattern.

    from langchain_ollama import ChatOllama
    from langchain_core.tools import tool

    @tool
    def execute_query(comments: str, query: str) -> str:
    """Excutes SQL query.

    Args:
    comments (str): 1-2 sentences describing the result SQL query
    and what it does to answer the question,
    query (str): SQL query
    """
    pass

    chat_llm = ChatOllama(model="llama3.1:8b").bind_tools([execute_query])
    result = chat_llm.invoke(prompt)
    print(result.tool_calls)

    # [{'name': 'execute_query',
    # 'args': {'comments': 'SQL query returns number of customers who made a purchase in December 2024. The query joins the sessions and users tables based on user ID to filter out inactive customers and find those with non-zero revenue in December 2024.',
    # 'query': 'SELECT COUNT(DISTINCT T2.user_id) FROM ecommerce.sessions AS T1 INNER JOIN ecommerce.users AS T2 ON T1.user_id = T2.user_id WHERE YEAR(T1.action_date) = 2024 AND MONTH(T1.action_date) = 12 AND T2.is_active = 1 AND T1.revenue > 0'},
    # 'type': 'tool_call'}]

    With the tool calling, we can now get the SQL query directly from the model. That’s an excellent result. However, the generated query is not entirely accurate:

    • It includes a filter for is_active = 1, even though we didn’t specify the need to filter out inactive customers.
    • The LLM missed specifying the format despite our explicit request in the system prompt.

    Clearly, we need to focus on improving the model’s accuracy. But as Peter Drucker famously said, “You can’t improve what you don’t measure.” So, the next logical step is to build a system for evaluating the model’s quality. This system will be a cornerstone for performance improvement iterations. Without it, we’d essentially be navigating in the dark.

    Evaluating the accuracy

    Evaluation basics

    To ensure we’re improving, we need a robust way to measure accuracy. The most common approach is to create a “golden” evaluation set with questions and correct answers. Then, we can compare the model’s output with these “golden” answers and calculate the share of correct ones. While this approach sounds simple, there are a few nuances worth discussing.

    First, you might feel overwhelmed at the thought of creating a comprehensive set of questions and answers. Building such a dataset can seem like a daunting task, potentially requiring weeks or months. However, we can start small by creating an initial set of 20–50 examples and iterating on it.

    As always, quality is more important than quantity. Our goal is to create a representative and diverse dataset. Ideally, this should include:

    • Common questions. In most real-life cases, we can take the history of actual questions and use it as our initial evaluation set.
    • Challenging edge cases. It’s worth adding examples where the model tends to hallucinate. You can find such cases either while experimenting yourself or by gathering feedback from the first prototype.

    Once the dataset is ready, the next challenge is how to score the generated results. We can consider several approaches:

    • Comparing SQL queries. The first idea is to compare the generated SQL query with the one in the evaluation set. However, it might be tricky. Similarly-looking queries can yield completely different results. At the same time, queries that look different can lead to the same conclusions. Additionally, simply comparing SQL queries doesn’t verify whether the generated query is actually executable. Given these challenges, I wouldn’t consider this approach the most reliable solution for our case.
    • Exact matches. We can use old-school exact matching when answers in our evaluation set are deterministic. For example, if the question is, “How many customers are there?” and the answer is “592800”, the model’s response must match precisely. However, this approach has its limitations. Consider the example above, and the model responds, “There are 592,800 customers”. While the answer is absolutely correct, an exact match approach would flag it as invalid.
    • Using LLMs for scoring. A more robust and flexible approach is to leverage LLMs for evaluation. Instead of focusing on query structure, we can ask the LLM to compare the results of SQL executions. This method is particularly effective in cases where the query might differ but still yields correct outputs.

    It’s worth keeping in mind that evaluation isn’t a one-time task; it’s a continuous process. To push our model’s performance further, we need to expand the dataset with examples causing the model’s hallucinations. In production mode, we can create a feedback loop. By gathering input from users, we can identify cases where the model fails and include them in our evaluation set.

    In our example, we will be assessing only whether the result of execution is valid (SQL query can be executed) and correct. Still, you can look at other parameters as well. For example, if you care about efficiency, you can compare the execution times of generated queries against those in the golden set.

    Evaluation set and validation

    Now that we’ve covered the basics, we’re ready to put them into practice. I spent about 20 minutes putting together a set of 10 examples. While small, this set is sufficient for our toy task. It consists of a list of questions paired with their corresponding SQL queries, like this:

    [
    {
    "question": "How many customers made purchase in December 2024?",
    "sql_query": "select uniqExact(user_id) as customers from ecommerce.sessions where (toStartOfMonth(action_date) = '2024-12-01') and (revenue > 0) format TabSeparatedWithNames"
    },
    {
    "question": "What was the fraud rate in 2023, expressed as a percentage?",
    "sql_query": "select 100*uniqExactIf(user_id, is_fraud = 1)/uniqExact(user_id) as fraud_rate from ecommerce.sessions where (toStartOfYear(action_date) = '2023-01-01') format TabSeparatedWithNames"
    },
    ...
    ]

    You can find the full list on GitHub — link.

    We can load the dataset into a DataFrame, making it ready for use in the code.

    import json
    with open('golden_set.json', 'r') as f:
    golden_set = json.loads(f.read())

    golden_df = pd.DataFrame(golden_set)
    golden_df['id'] = list(range(golden_df.shape[0]))

    First, let’s generate the SQL queries for each question in the evaluation set.

    def generate_query(question):
    prompt = get_llama_prompt(question, generate_query_system_prompt)
    result = chat_llm.invoke(prompt)
    try:
    generated_query = result.tool_calls[0]['args']['query']
    except:
    generated_query = ''
    return generated_query

    import tqdm

    tmp = []
    for rec in tqdm.tqdm(golden_df.to_dict('records')):
    generated_query = generate_query(rec['question'])
    tmp.append(
    {
    'id': rec['id'],
    'generated_query': generated_query
    }
    )

    eval_df = golden_df.merge(pd.DataFrame(tmp))

    Before moving on to the LLM-based scoring of query outputs, it’s important to first ensure that the SQL query is valid. To do this, we need to execute the queries and examine the database output.

    I’ve created a function that runs a query in ClickHouse. It also ensures that the output format is correctly specified, as this may be critical in business applications.

    CH_HOST = 'http://localhost:8123' # default address 
    import requests
    import io

    def get_clickhouse_data(query, host = CH_HOST, connection_timeout = 1500):
    # pushing model to return data in the format that we want
    if not 'format tabseparatedwithnames' in query.lower():
    return "Database returned the following error:n Please, specify the output format."

    r = requests.post(host, params = {'query': query},
    timeout = connection_timeout)
    if r.status_code == 200:
    return r.text
    else:
    return 'Database returned the following error:n' + r.text
    # giving feedback to LLM instead of raising exception

    The next step is to execute both the generated and golden queries and then save their outputs.

    tmp = []

    for rec in tqdm.tqdm(eval_df.to_dict('records')):
    golden_output = get_clickhouse_data(rec['sql_query'])
    generated_output = get_clickhouse_data(rec['generated_query'])

    tmp.append(
    {
    'id': rec['id'],
    'golden_output': golden_output,
    'generated_output': generated_output
    }
    )

    eval_df = eval_df.merge(pd.DataFrame(tmp))

    Next, let’s check the output to see whether the SQL query is valid or not.

    def is_valid_output(s):
    if s.startswith('Database returned the following error:'):
    return 'error'
    if len(s.strip().split('n')) >= 1000:
    return 'too many rows'
    return 'ok'

    eval_df['golden_output_valid'] = eval_df.golden_output.map(is_valid_output)
    eval_df['generated_output_valid'] = eval_df.generated_output.map(is_valid_output)

    Then, we can evaluate the SQL validity for both the golden and generated sets.

    The initial results are not very promising; the LLM was unable to generate even a single valid query. Looking at the errors, it’s clear that the model failed to specify the right format despite it being explicitly defined in the system prompt. So, we definitely need to work more on the accuracy.

    Checking the correctness

    However, validity alone is not enough. It’s crucial that we not only generate valid SQL queries but also produce the correct results. Although we already know that all our queries are invalid, let’s now incorporate output evaluation into our process.

    As discussed, we will use LLMs to compare the outputs of the SQL queries. I typically prefer using more powerful model for evaluation, following the day-to-day logic where a senior team member reviews the work. For this task, I’ve chosen OpenAI GPT 4o-mini.

    Similar to our generation flow, I’ve set up all the building blocks necessary for accuracy assessment.

    from langchain_openai import ChatOpenAI

    accuracy_system_prompt = '''
    You are a senior and very diligent QA specialist and your task is to compare data in datasets.
    They are similar if they are almost identical, or if they convey the same information.
    Disregard if column names specified in the first row have different names or in a different order.
    Focus on comparing the actual information (numbers). If values in datasets are different, then it means that they are not identical.
    Always execute tool to provide results.
    '''

    @tool
    def compare_datasets(comments: str, score: int) -> str:
    """Stores info about datasets.
    Args:
    comments (str): 1-2 sentences about the comparison of datasets,
    score (int): 0 if dataset provides different values and 1 if it shows identical information
    """
    pass

    accuracy_chat_llm = ChatOpenAI(model="gpt-4o-mini", temperature = 0.0)
    .bind_tools([compare_datasets])

    accuracy_question_tmp = '''
    Here are the two datasets to compare delimited by ####
    Dataset #1:
    ####
    {dataset1}
    ####
    Dataset #2:
    ####
    {dataset2}
    ####
    '''

    def get_openai_prompt(question, system):
    messages = [
    ("system", system),
    ("human", question)
    ]
    return messages

    Now, it’s time to test the accuracy assessment process.

    prompt = get_openai_prompt(accuracy_question_tmp.format(
    dataset1 = 'customersn114032n', dataset2 = 'customersn114031n'),
    accuracy_system_prompt)

    accuracy_result = accuracy_chat_llm.invoke(prompt)
    accuracy_result.tool_calls[0]['args']
    # {'comments': 'The datasets contain different customer counts: 114032 in Dataset #1 and 114031 in Dataset #2.',
    # 'score': 0}

    prompt = get_openai_prompt(accuracy_question_tmp.format(
    dataset1 = 'usersn114032n', dataset2 = 'customersn114032n'),
    accuracy_system_prompt)
    accuracy_result = accuracy_chat_llm.invoke(prompt)
    accuracy_result.tool_calls[0]['args']
    # {'comments': 'The datasets contain the same numerical value (114032) despite different column names, indicating they convey identical information.',
    # 'score': 1}

    Fantastic! It looks like everything is working as expected. Let’s now encapsulate this into a function.

    def is_answer_accurate(output1, output2):
    prompt = get_openai_prompt(
    accuracy_question_tmp.format(dataset1 = output1, dataset2 = output2),
    accuracy_system_prompt
    )

    accuracy_result = accuracy_chat_llm.invoke(prompt)

    try:
    return accuracy_result.tool_calls[0]['args']['score']
    except:
    return None

    Putting the evaluation approach together

    As we discussed, building an LLM application is an iterative process, so we’ll need to run our accuracy assessment multiple times. It will be helpful to have all this logic encapsulated in a single function.

    The function will take two arguments as input:

    • generate_query_func: a function that generates an SQL query for a given question.
    • golden_df: an evaluation dataset with questions and correct answers in the form of a pandas DataFrame.

    As output, the function will return a DataFrame with all evaluation results and a couple of charts displaying the main KPIs.


    def evaluate_sql_agent(generate_query_func, golden_df):

    # generating SQL
    tmp = []
    for rec in tqdm.tqdm(golden_df.to_dict('records')):
    generated_query = generate_query_func(rec['question'])
    tmp.append(
    {
    'id': rec['id'],
    'generated_query': generated_query
    }
    )

    eval_df = golden_df.merge(pd.DataFrame(tmp))

    # executing SQL queries
    tmp = []
    for rec in tqdm.tqdm(eval_df.to_dict('records')):
    golden_output = get_clickhouse_data(rec['sql_query'])
    generated_output = get_clickhouse_data(rec['generated_query'])

    tmp.append(
    {
    'id': rec['id'],
    'golden_output': golden_output,
    'generated_output': generated_output
    }
    )

    eval_df = eval_df.merge(pd.DataFrame(tmp))

    # checking accuracy
    eval_df['golden_output_valid'] = eval_df.golden_output.map(is_valid_output)
    eval_df['generated_output_valid'] = eval_df.generated_output.map(is_valid_output)

    eval_df['correct_output'] = list(map(
    is_answer_accurate,
    eval_df['golden_output'],
    eval_df['generated_output']
    ))

    eval_df['accuracy'] = list(map(
    lambda x, y: 'invalid: ' + x if x != 'ok' else ('correct' if y == 1 else 'incorrect'),
    eval_df.generated_output_valid,
    eval_df.correct_output
    ))

    valid_stats_df = (eval_df.groupby('golden_output_valid')[['id']].count().rename(columns = {'id': 'golden set'}).join(
    eval_df.groupby('generated_output_valid')[['id']].count().rename(columns = {'id': 'generated'}), how = 'outer')).fillna(0).T

    fig1 = px.bar(
    valid_stats_df.apply(lambda x: 100*x/valid_stats_df.sum(axis = 1)),
    orientation = 'h',
    title = '<b>LLM SQL Agent evaluation</b>: query validity',
    text_auto = '.1f',
    color_discrete_map = {'ok': '#00b38a', 'error': '#ea324c', 'too many rows': '#f2ac42'},
    labels = {'index': '', 'variable': 'validity', 'value': 'share of queries, %'}
    )
    fig1.show()

    accuracy_stats_df = eval_df.groupby('accuracy')[['id']].count()
    accuracy_stats_df['share'] = accuracy_stats_df.id*100/accuracy_stats_df.id.sum()

    fig2 = px.bar(
    accuracy_stats_df[['share']],
    title = '<b>LLM SQL Agent evaluation</b>: query accuracy',
    text_auto = '.1f', orientation = 'h',
    color_discrete_sequence = ['#0077B5'],
    labels = {'index': '', 'variable': 'accuracy', 'value': 'share of queries, %'}
    )

    fig2.update_layout(showlegend = False)
    fig2.show()

    return eval_df

    With that, we’ve completed the evaluation setup and can now move on to the core task of improving the model’s accuracy.

    Improving accuracy: Self-reflection

    Let’s do a quick recap. We’ve built and tested the first version of SQL Agent. Unfortunately, all generated queries were invalid because they were missing the output format. Let’s address this issue.

    One potential solution is self-reflection. We can make an additional call to the LLM, sharing the error and asking it to correct the bug. Let’s create a function to handle generation with self-reflection.

    reflection_user_query_tmpl = '''
    You've got the following question: "{question}".
    You've generated the SQL query: "{query}".
    However, the database returned an error: "{output}".
    Please, revise the query to correct mistake.
    '''

    def generate_query_reflection(question):
    generated_query = generate_query(question)
    print('Initial query:', generated_query)

    db_output = get_clickhouse_data(generated_query)
    is_valid_db_output = is_valid_output(db_output)
    if is_valid_db_output == 'too many rows':
    db_output = "Database unexpectedly returned more than 1000 rows."

    if is_valid_db_output == 'ok':
    return generated_query

    reflection_user_query = reflection_user_query_tmpl.format(
    question = question,
    query = generated_query,
    output = db_output
    )

    reflection_prompt = get_llama_prompt(reflection_user_query,
    generate_query_system_prompt)
    reflection_result = chat_llm.invoke(reflection_prompt)

    try:
    reflected_query = reflection_result.tool_calls[0]['args']['query']
    except:
    reflected_query = ''
    print('Reflected query:', reflected_query)
    return reflected_query

    Now, let’s use our evaluation function to check whether the quality has improved. Assessing the next iteration has become effortless.

    refl_eval_df = evaluate_sql_agent(generate_query_reflection, golden_df)

    Wonderful! We’ve achieved better results — 50% of the queries are now valid, and all format issues have been resolved. So, self-reflection is pretty effective.

    However, self-reflection has its limitations. When we examine the accuracy, we see that the model returns the correct answer for only one question. So, our journey is not over yet.

    Improving accuracy: RAG

    Another approach to improving accuracy is using RAG (retrieval-augmented generation). The idea is to identify question-and-answer pairs similar to the customer query and include them in the system prompt, enabling the LLM to generate a more accurate response.

    RAG consists of the following stages:

    • Loading documents: importing data from available sources.
    • Splitting documents: creating smaller chunks.
    • Storage: using vector stores to process and store data efficiently.
    • Retrieval: extracting documents that are relevant to the query.
    • Generation: passing a question and relevant documents to LLM to generate the final answer.

    If you’d like a refresher on RAG, you can check out my previous article, “RAG: How to Talk to Your Data.”

    We will use the Chroma database as a local vector storage — to store and retrieve embeddings.

    from langchain_chroma import Chroma
    vector_store = Chroma(embedding_function=embeddings)

    Vector stores are using embeddings to find chunks that are similar to the query. For this purpose, we will use OpenAI embeddings.

    from langchain_openai import OpenAIEmbeddings
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

    Since we can’t use examples from our evaluation set (as they are already being used to assess quality), I’ve created a separate set of question-and-answer pairs for RAG. You can find it on GitHub.

    Now, let’s load the set and create a list of pairs in the following format: Question: %s; Answer: %s.

    with open('rag_set.json', 'r') as f:
    rag_set = json.loads(f.read())
    rag_set_df = pd.DataFrame(rag_set)

    rag_set_df['formatted_txt'] = list(map(
    lambda x, y: 'Question: %s; Answer: %s' % (x, y),
    rag_set_df.question,
    rag_set_df.sql_query
    ))

    rag_string_data = 'nn'.join(rag_set_df.formatted_txt)

    Next, I used LangChain’s text splitter by character to create chunks, with each question-and-answer pair as a separate chunk. Since we are splitting the text semantically, no overlap is necessary.

    from langchain_text_splitters import CharacterTextSplitter

    text_splitter = CharacterTextSplitter(
    separator="nn",
    chunk_size=1, # to split by character without merging
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False,
    )

    texts = text_splitter.create_documents([rag_string_data])

    The final step is to load the chunks into our vector storage.

    document_ids = vector_store.add_documents(documents=texts)
    print(vector_store._collection.count())
    # 32

    Now, we can test the retrieval to see the results. They look quite similar to the customer question.

    question = 'What was the share of users using Windows yesterday?'
    retrieved_docs = vector_store.similarity_search(question, 3)
    context = "nn".join(map(lambda x: x.page_content, retrieved_docs))
    print(context)

    # Question: What was the share of users using Windows the day before yesterday?;
    # Answer: select 100*uniqExactIf(user_id, os = 'Windows')/uniqExact(user_id) as windows_share from ecommerce.sessions where (action_date = today() - 2) format TabSeparatedWithNames
    # Question: What was the share of users using Windows in the last week?;
    # Answer: select 100*uniqExactIf(user_id, os = 'Windows')/uniqExact(user_id) as windows_share from ecommerce.sessions where (action_date >= today() - 7) and (action_date < today()) format TabSeparatedWithNames
    # Question: What was the share of users using Android yesterday?;
    # Answer: select 100*uniqExactIf(user_id, os = 'Android')/uniqExact(user_id) as android_share from ecommerce.sessions where (action_date = today() - 1) format TabSeparatedWithNames

    Let’s adjust the system prompt to include the examples we retrieved.

    generate_query_system_prompt_with_examples_tmpl = '''
    You are a senior data analyst with more than 10 years of experience writing complex SQL queries.
    There are two tables in the database you're working with with the following schemas.

    Table: ecommerce.users
    Description: customers of the online shop
    Fields:
    - user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004
    - country (string) - country of residence, for example, "Netherlands" or "United Kingdom"
    - is_active (integer) - 1 if customer is still active and 0 otherwise
    - age (integer) - customer age in full years, for example, 31 or 72

    Table: ecommerce.sessions
    Description: sessions of usage the online shop
    Fields:
    - user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004
    - session_id (integer) - unique identifier of session, for example, 106 or 1023
    - action_date (date) - session start date, for example, "2021-01-03" or "2024-12-02"
    - session_duration (integer) - duration of session in seconds, for example, 125 or 49
    - os (string) - operation system that customer used, for example, "Windows" or "Android"
    - browser (string) - browser that customer used, for example, "Chrome" or "Safari"
    - is_fraud (integer) - 1 if session is marked as fraud and 0 otherwise
    - revenue (float) - income in USD (the sum of purchased items), for example, 0.0 or 1506.7


    Write a query in ClickHouse SQL to answer the following question.
    Add "format TabSeparatedWithNames" at the end of the query to get data from ClickHouse database in the right format.
    Answer questions following the instructions and providing all the needed information and sharing your reasoning.

    Examples of questions and answers:
    {examples}
    '''

    Once again, let’s create the generate query function with RAG.

    def generate_query_rag(question):
    retrieved_docs = vector_store.similarity_search(question, 3)
    context = context = "nn".join(map(lambda x: x.page_content, retrieved_docs))

    prompt = get_llama_prompt(question,
    generate_query_system_prompt_with_examples_tmpl.format(examples = context))
    result = chat_llm.invoke(prompt)

    try:
    generated_query = result.tool_calls[0]['args']['query']
    except:
    generated_query = ''
    return generated_query

    As usual, let’s use our evaluation function to test the new approach.

    rag_eval_df = evaluate_sql_agent(generate_query_rag, golden_df)

    We can see a significant improvement, increasing from 1 to 6 correct answers out of 10. It’s still not ideal, but we’re moving in the right direction.

    We can also experiment with combining two approaches: RAG and self-reflection.

    def generate_query_rag_with_reflection(question):
    generated_query = generate_query_rag(question)

    db_output = get_clickhouse_data(generated_query)
    is_valid_db_output = is_valid_output(db_output)
    if is_valid_db_output == 'too many rows':
    db_output = "Database unexpectedly returned more than 1000 rows."

    if is_valid_db_output == 'ok':
    return generated_query

    reflection_user_query = reflection_user_query_tmpl.format(
    question = question,
    query = generated_query,
    output = db_output
    )

    reflection_prompt = get_llama_prompt(reflection_user_query, generate_query_system_prompt)
    reflection_result = chat_llm.invoke(reflection_prompt)

    try:
    reflected_query = reflection_result.tool_calls[0]['args']['query']
    except:
    reflected_query = ''
    return reflected_query

    rag_refl_eval_df = evaluate_sql_agent(generate_query_rag_with_reflection,
    golden_df)

    We can see another slight improvement: we’ve completely eliminated invalid SQL queries (thanks to self-reflection) and increased the number of correct answers to 7 out of 10.

    That’s it. It’s been quite a journey. We started with 0 valid SQL queries and have now achieved 70% accuracy.

    You can find the complete code on GitHub.

    Summary

    In this article, we explored the iterative process of improving accuracy for LLM applications.

    • We built an evaluation set and the scoring criteria that allowed us to compare different iterations and understand whether we were moving in the right direction.
    • We leveraged self-reflection to allow the LLM to correct its mistakes and significantly reduce the number of invalid SQL queries.
    • Additionally, we implemented Retrieval-Augmented Generation (RAG) to further enhance the quality, achieving an accuracy rate of 60–70%.

    While this is a solid result, it still falls short of the 90%+ accuracy threshold typically expected for production applications. To achieve such a high bar, we need to use fine-tuning, which will be the topic of the next article.

    Thank you a lot for reading this article. I hope this article was insightful for you. If you have any follow-up questions or comments, please leave them in the comments section.

    Reference

    All the images are produced by the author unless otherwise stated.

    This article is inspired by the “Improving Accuracy of LLM Applications” short course from DeepLearning.AI.


    From Prototype to Production: Enhancing LLM Accuracy was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    From Prototype to Production: Enhancing LLM Accuracy

    Go Here to Read this Fast! From Prototype to Production: Enhancing LLM Accuracy

  • 2024 Highlights: The AI and Data Science Articles That Made a Splash

    TDS Editors

    Feeling inspired to write your first TDS post before the end of 2024? We’re always open to contributions from new authors.

    And just like that, 2024 is (almost) in the books. It was a year of exciting transitions — both for the TDS team and, in many meaningful ways, for the data science, machine learning, and AI communities at large. We’d like to thank all of you—readers, authors, and followers—for your support, and for keeping us busy and engaged with your excellent contributions and comments.

    Unlike in 2023, when a single event (ChatGPT’s launch just weeks before the beginning of the year) stopped everyone in their tracks and shaped conversations for months on end, this year we experienced a more cumulative and fragmented sense of transformation. Practitioners across industry and academia experimented with new tools and worked hard to find innovative ways to benefit from the rapid rise of LLMs; at the same time, they also had to navigate a challenging job market and a world where AI’s footprint inches ever closer to their own everyday workflows.

    Photo by Oskars Sylwan on Unsplash

    To help you make sense of these developments, we published more than 3,500 articles this past year, including hundreds from first-time contributors. Our authors have an incredible knack for injecting their unique perspective into any topic they cover—from big questions and timely topics to more focused technical challenges—and we’re proud of every post we published in 2024.

    Within this massive creative output, some articles manage to resonate particularly well with our readers, and we’re dedicating our final Variable edition to these: our most-read, -discussed, and -shared posts of the year. As you might expect, they cover a lot of ground, so we’ve decided to arrange them following the major themes we’ve detected this year: learning and building from scratch, RAG and AI agents, career growth, and breakthroughs and innovation.

    We hope you enjoy exploring our 2024 highlights, and we wish you a relaxing end of the year — see you in January!

    Learning and Building from Scratch

    The most reliably popular type of TDS post is the one that teaches readers how to do or study something interesting and productive on their own, and with minimal prerequisites. This year is no exception—our three most-read articles of 2024 fall under this category.

    RAG and AI Agents

    Once the initial excitement surrounding LLMs settled (a bit), data and ML professionals realized that these powerful models aren’t all that useful out of the box. Retrieval-augmented generation and agentic AI rose to prominence in the past year as the two leading approaches that bridge the gap between the models’ potential and real-world value; they also ended up being our most covered technical topics in recent months.

    Career Growth

    Data science and machine learning career paths continue to evolve, and the need to adapt to this changing terrain can generate nontrivial amounts of stress for many professionals, whether they’re deep into their career or are just starting out. We love publishing personal reflections on this topic when they also offer readers pragmatic advice—here are four that stood out to us (and to our readers).

    Breakthroughs and Innovation

    Staying up-to-date with cutting-edge research and new tools can feel overwhelming at times, which is why we have a particular soft spot for top-notch paper walkthroughs and primers on emerging libraries and models. Here are three such articles that particularly resonated with our audience.

    Thank you for supporting the work of our authors in 2024! If writing for TDS is one of your goals for 2025, why not get started now? Don’t hesitate to share your work with us.

    Until the next Variable, coming your way in the first week of January,

    TDS Team


    2024 Highlights: The AI and Data Science Articles That Made a Splash was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    2024 Highlights: The AI and Data Science Articles That Made a Splash

    Go Here to Read this Fast! 2024 Highlights: The AI and Data Science Articles That Made a Splash

  • Add a generative AI experience to your website or web application with Amazon Q embedded

    Add a generative AI experience to your website or web application with Amazon Q embedded

    Bobby Williams

    Amazon Q embedded is a feature that lets you embed a hosted Amazon Q Business assistant on your website or application to create more personalized experiences that boost end-users’ productivity. In this post, we demonstrate how to use the Amazon Q embedded feature to add an Amazon Q Business assistant to your website or web application using basic HTML or React.

    Originally appeared here:
    Add a generative AI experience to your website or web application with Amazon Q embedded

    Go Here to Read this Fast! Add a generative AI experience to your website or web application with Amazon Q embedded