Tag: artificial intelligence

Statistically Confirm Your Benchmark — Case Study Comparing Pandas and Polars with 1 Million Rows…

Boriharn K

Using the Independent samples t-test and Welch’s t-test to compare scores in benchmarking.

Continue reading on Towards Data Science »

Originally appeared here:
Statistically Confirm Your Benchmark — Case Study Comparing Pandas and Polars with 1 Million Rows…

Go Here to Read this Fast! Statistically Confirm Your Benchmark — Case Study Comparing Pandas and Polars with 1 Million Rows…

June 17, 2024
The Important Role of Memory in Agentic AI
Sandi Besen
Memory in AI: Key Benefits and Investment Considerations

Memory will be a critical component that dramatically improves the performance of AI systems — both in accuracy and efficiency

Just as humans depend on memory to make informed decisions and draw logical conclusions, AI relies on its ability to retrieve relevant information, understand contexts, and learn from past experiences. This article delves into why memory is pivotal for AI, exploring its role in recall, reasoning, and continuous learning.

colorful brain with microchip representing memory source: DALLE3

Memory’s Role in Recall

Some believe that enlarging the context window will enhance model performance, as it allows the model to ingest more information. While this is true to an extent, our current understanding of how language models prioritize context is still developing. In fact, studies have shown that “model performance is highest when when relevant information occurs at the beginning or end of its input context.”[1] The larger a context window, the more likely we are to encounter the infamous “lost in the middle” problem, where specific facts or text are not recalled by the model due to important information being buried in the middle [2].

To understand how memory impacts recall, consider how humans process information. When we travel, we passively listen to many announcements including airline advertisements, credit card offers, safety briefings, luggage collection details, etc. We may not realize how much information we absorb until it is time for us to recall relevant pieces. For instance, if a language model that is relying on retrieving relevant information to answer a question, rather than its inherent knowledge, is asked “What should I do in case of an emergency landing?” it might not be able to recall the pertinent details needed to answer this important question because too much information is retrieved. However, with a long-term memory, the model can store and recall the most critical information, enabling more effective reasoning with the proper context.

Memory’s Role in Reasoning and Continuous Learning

Memory provides essential context and allows models to understand past problem-solving approaches, identifying what worked and what needs improvement. It doesn’t just offer important context; it also equips models with the ability to recall the methods previously used to solve problems, recognize successful strategies, and pinpoint areas needing improvement. This improvement can in turn aid the model’s ability to effectively reason about complex multi-step tasks. Without adequate reasoning, language models struggle to understand tasks, think logically about objectives, solve multistep problems, or utilize appropriate tools. You can read more about the importance of reasoning and advanced reasoning techniques in my previous article here.

Consider the example of manually finding relevant data in a company’s data warehouse. There are thousands of tables, but because you possess an understanding of what data is needed it allows you to focus on a subset. After hours of searching, relevant data is found across five different tables. Three months later, when the data needs updating, the search process must be repeated but you can’t remember the 5 source tables you used to create this new report. The manual search process repeats again. Without long term memory, a language model might approach the problem the same way — with brute force — until it finds the relevant data to complete the task. However a language model equipped with long term memory could store its initial search plan, a description of each table, and a revised plan based on its search findings from each table. When the data needs refreshing, it can start from a previously successful approach, improving efficiency and performance.

This method allows systems to learn over time, continually revising the best approach to tasks and accumulating knowledge to produce more efficient, higher-performing autonomous systems.

Evaluating the Investment to Include Long-Term Memory in Your AI Solutions

Incorporating long-term memory into AI systems can significantly enhance their capabilities, but determining whether this capability is worthy of the necessary development investment involves consideration.

1. Understand the Nature of the Tasks
- Complexity and Duration: If your tasks involves complex, multi-step processes or requires information retention over long periods, long-term memory can improve efficiency and accuracy. For example, project management applications, where tasks span over months can benefit from AI’s ability to remember and adapt from previous context and iterations.
- Context Sensitivity: Tasks that heavily depend on contextual understanding, such as customer service interactions, personalization in marketing, or medical diagnostics, can leverage long-term memory to provide more personalized responses. For instance, an IT help desk assistant would benefit from remembering if a customer has already encountered this problem and how it was trouble shooted during previous interactions.
2. Assess the Volume and Variability of Data
- High Data Volume: If your application deals with large amounts of data that need to be referenced regularly, long-term memory can prevent the need for repeatedly processing the same information — saving time and computational resources.
- Data Variability: In environments where the data changes frequently, long-term memory helps in keeping the AI updated with the latest information, ensuring more accurate outputs without having to re-train.
3. Evaluate the Cost-Benefit Ratio
- Balance Cost with Performance Benefit: Implementing long-term memory can be resource-intensive and will continue to scale over time as more memories accumulate. It is important to weigh the financial investment of data storage against potential performance improvements. For small businesses or applications with limited resources, the efficiency of Small Language Models (SLMs) with long-term memory might offer a more balanced solution.
- Competitive Advantage: By improving the efficiency and effectiveness of AI applications, long-term memory can provide a significant competitive edge, enabling businesses to offer superior services compared to those using traditional models without memory capabilities.
4. Address Security and Compliance Concerns
- Data Privacy: Long-term memory involves storing more data, which can raise privacy concerns. Ensure that your system complies with data protection regulations and that sensitive information follows best security practice.
In Essence…

Incorporating long-term memory into AI systems presents a significant opportunity to enhance their capabilities by providing improvements in accuracy, efficiency, and contextual understanding. However, deciding whether to invest in this capability requires consideration and cost to benefit analysis. If implemented strategically, the inclusion of long term memory can delivery tangible benefits to your AI solutions.

Have questions or think that something needs to be further clarified? Drop me a DM on Linkedin! I‘m always eager to engage in food for thought and iterate on my work. My work does not represent the opinion of my employer.

The Important Role of Memory in Agentic AI was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
The Important Role of Memory in Agentic AI

Go Here to Read this Fast! The Important Role of Memory in Agentic AI
June 17, 2024
Data Privacy in AI Development: Data Localization
Stephanie Kirmer
Why should you care where your data lives?

Photo by Luke Stackpoole on Unsplash

In the process of writing my talk for the AI Quality Conference coming up on June 25 in San Francisco (tickets still available!) I have come across many topics that deserve more time than the brief mentions I will be able to give in my talk. In order to give everyone more detail and explain the topics better, I’m starting a small series of columns about things related to developing machine learning and AI while still being careful about data privacy and security. Today I’m going to start with data localization.

Before I begin, we should clarify what is covered by data privacy and security regulation. In short, this is applicable to “personal data”. But what counts as personal data? This depends on the jurisdiction, but it usually includes PII (name, phone number, etc.) PLUS data that could be combined together to make someone identifiable (zip code, birthday, gender, race, political affiliation, religion, and so on). This includes photos, video or audio recordings of someone, details about their computer or browser, search history, biometrics, and much more. GDPR’s rules about this are explained here.

With that covered, let’s dig in to data localization and what it has to do with us as machine learning developers.

What’s Data Localization?

Glad you asked! Data localization is essentially the question of what geographic place your data is stored in — if you localize your data, you are keeping it where it was created. (This is also sometimes known as “data residency”, and the opposite is “data portability”.) If your dataset is on AWS S3 in us-east-1, your data is actually living, physically (inasmuch as data lives anywhere), in the United States, somewhere in Northern Virginia. To get more precise, AWS has several specific data centers in Northern Virginia, and you can get their exact addresses online. But for most of us, knowing the general area at this grain is sufficient.

Why should I care where the datacenter is? Isn’t the cloud just ‘everywhere’?

There are good reasons to know where your data lives. For one thing, there can be real physical speed implications for loading/writing data to the cloud depending on how far you and your computer are from the region where the datacenter is located. But this is likely not a huge deal unless you’re doing crazy high-speed computations.

A more important reason to care (and the reason this is a part of data privacy), is that data privacy laws around the world (as well as your contracts with clients and consent forms filled out by your customers) have rules about data localization. Regulation on data localization involves requiring personal data about citizens or residents of a place to be stored on servers in that same place.

General caveats:
- It doesn’t always apply to all kinds of data (financial data is more often covered)
- It doesn’t always apply to all kinds of businesses (tech companies are more often covered)
- It may be triggered by a government request, or it might be automatic (see Vietnam)
- Sometimes there are ways you can get consent to move the data, sometimes not
- Sometimes you just need to initially store the data in country, and then you can move it around later (see Russia)
- Sometimes you can store it outside the country of origin but there are limits on where else it can go (see EU)
In addition, private companies sometimes impose data localization requirements in contracts, potentially to comply with these laws, or to reduce the risk of data breach or surveillance by other governments on the data.

This means, literally, that you may be legally limited on the locations of data centers where you can store certain data, primarily based on who the data is about, or who the original owner of the data was.

Example

It may be easier to understand this with a concrete (simplified) example.
- You run a website where people can make purchases. You collect data during these purchases, such as credit card details, address, name, IP address, and some other things. Your consent banner/fine print doesn’t say anything about data localization.
- You get customers from Russia, India, and the United Arab Emirates.
- Unless you got explicit consent, all of the personal data from these visitors are subject to different data localization rules.
What does this mean for you? All this data needs to be processed differently.
- The data from Russian customers needs to be initially stored in a Russian-based server, and then may be transferred depending on the applicable rules.
- The data from EU customers can be stored in countries that have sufficient data security laws (notably, not Russia).
- The UAE customer data needs to be stored in the UAE because you didn’t get consent from these customers to store it elsewhere.
This creates obvious problems for data engineering, since you need separate pipelines for all of the data. It’s also a challenge for modeling and training — how do you construct a dataset to actually use?

Get consent

If you had gotten consent from the UAE customers to move data, you’d probably be ok. Data engineering would still have to pipe the data from Russian customers through a special path, but you could combine the data for training. However, because you didn’t, you’re stuck! Make sure that you know what permissions and authorizations your consent tool includes, so you don’t get in this mess.

On the fly combination

Assuming it’s too late to do that, another solution is to have a compute platform that loads from different databases at time of training, combines the dataset in the moment, and trains the model without ever writing any of the data to disk in a single place. The general consensus (NOT LEGAL ADVICE) is that models are not themselves personal data, and thus not subject to the legal rules. But this takes work and infrastructure, so get your dev-ops hat on.

If you have extremely large data volumes, this can become computationally expensive very fast. If you generate features based on this data but personal data about cases is still interpretable, then you can’t save everything together in one place, but will either need to save de-identified/aggregated features separately, or write them back to the original region, or just recalculate them every time on the fly. All of these are tough challenges.

De-identify and/or aggregate

Fortunately, there’s another option. Once you have aggregated, summarized, or thoroughly (irreversibly) de-identified the data, it loses the personal data protections and you can then work with it more easily. This is a strong incentive for you not to be storing personal data that is identifiable! (Plus, this reduces your risk of data breaches and being hacked.) Once the data is no longer legally protected because it’s no longer high risk, you can do what you want and carry on with your work, saving the data where you like. Extract non-identifiable features and dispense with identifiable data if you possibly can.

However, deciding when the data is sufficiently aggregated or de-identified so that localization laws are no longer applicable can sometimes be a tough call, because as I described above, many kinds of demographic data are personal because in combination with other datapoints they could create identifiability. We are often accustomed to thinking that when PII (full names, SSNs, etc) are removed, then the data is fine to use as we like. This is not how the law sees it in many jurisdictions! Consult your legal department and be conscientious about what constitutes risk. Ideally, the safest thing is when the data is no longer personal data, e.g. not including names, demographics, addresses, phone numbers, and so on at the individual level or in unhashed, human readable plaintext. THIS IS NOT LEGAL ADVICE. TALK TO YOUR LEGAL DEPARTMENT.

We are very used to being able to take our data wherever and manipulate it and run calculations and then store the data — on laptops, or S3 or GCS, or wherever you want, but as we collect more personal data about people, and more data privacy laws take effect all around the world, we need to be more careful what we do.

FAQ

What if you don’t know where the data originated?

This is a tough situation. If you have some personal data about people and no idea where it came from or where those people were located (and probably also no idea what consent forms they filled out), I think the safe solution is to treat it like this data is sensitive, de-identify the heck out of it, aggregate it if that works for your use case, and make sure it wouldn’t be considered personal or sensitive data under data privacy laws. But if that’s not an option because of how you need to use the data, then it’s time to talk to lawyers.

What if your company can’t afford datacenters all over the world?

More or less, this is the same answer. Ideally, you’d get your consent solution in order, but barring that, I’d recommend finding ways to de-identify data immediately upon receiving it from a customer or user. When data comes in from a user, hash that stuff so that it is not reversible, and use that. Be extra cautious about demographics or other sensitive personal data, but definitely deidentify the PII right off. If you never store data that is sensitive or potentially could be reverse engineered to identify someone, then you don’t need to worry about localization. THIS IS NOT LEGAL ADVICE. TALK TO YOUR LEGAL DEPARTMENT.

Why do countries make these laws?

There are a few reasons, some better than others. First, if the data is actually stored in country, then you have something of a business presence there (or your data storage provider does) so it’s a lot easier for them to have jurisdiction to penalize you if you misuse their citizens’ data. Second, this supports economic development of the tech sector in whatever country, because someone needs to provide the power, cooling, staffing, construction, and so on to the data centers. Third, unfortunately some countries have surveillance regimes on their own citizens, and having data centers in country makes it easier for totalitarian governments to access this data.

What can I do to make this hurt less as a data scientist?

Plan ahead! Work with your company’s relevant parties to make sure the initial data processing is compliant while still getting you the data you need. And make sure you’re in the loop about the consent that customers are giving, and what permissions it enables. If you still find yourself in possession of data with localization rules, then you need to either find a way to manage this data so that it is never saved to a disk that is in the wrong location, or deidentify and/or aggregate the data in a way so that it is no longer sensitive, so the data privacy regulations no longer apply.

What are some of the major data localization laws I need to know about?

Here are some highlights, but this is not comprehensive because there are many such laws and new ones coming along all the time. (Again, none of this is legal advice):
- India: DPDP (Digital Personal Data Protection) is the national data privacy regulation. This law is not as restrictive as some, but individual agencies within the Indian government are permitted to make more restrictive policies about specific kinds of data. The Federal Reserve Bank of India is one example, and they impose data localization rules more restrictive than the national law. Financial companies like American Express have been fined for storing data on Indian financial transactions on servers outside of India.
- China: PIPL is their national data privacy regulation, and the data localization rules are a bit complex. It applies to entities that “provide products or services to individuals in China” and/or “analyze and assess the conduct of natural persons in China” so that’s pretty broad. If the data is what the law considers “important” or “information that identifies or can identify natural persons”, then there is a good chance is subject to data localization. As always, this is not legal advice, and you should ask your legal department.
- Russia: Russia has had data localization laws for some time, and many companies including Facebook and Twitter have been fined for violations. “Article 18(5) of the Data Localization Law requires that Russian and foreign data operators that collect personal data of Russian citizens, including over the internet, initially record, store, arrange, update, and extract that data using Russian databases.” There are more laws that also apply (see link for details). After the initial collection and storage of the data on a Russian server, then the data can be transferred elsewhere.
- Vietnam: Their 2018 law requires that certain data be stored in country for 24 months, upon request by the government. This applies to domestic companies and certain foreign companies in sectors around e-commerce, social networking, and other digital services. In addition, any transfer of data to a third party requires customer consent.
- EU (GDPR): The EU sets some specific rules about certain countries where their citizens’ data cannot be stored (Russia, for example) due to concerns about state surveillance and data privacy.
- UAE: For most data, you must get consent from the subject to transfer their data outside the UAE. In some select cases, this is not sufficient — for example, payment processing data must be kept inside the UAE.
- Japan: Data subjects must consent to their data being transferred out of country, unless the other country is part of a specific data sharing agreement with Japan.
There are other potential considerations, such as the size of your company (some places have less restrictive rules for small companies, some don’t), so none of this should be taken as the conclusive answer for your business.

Conclusion

If you made it this far, thanks! I know this can get dry, but I’ll reward you with a story. I once worked at a company where we had data localization provisions in contracts (not the law, but another business setting these rules), so any data generated in the EU needed to be in the EU, but we had already set up data storage for North America in the US.

For a variety of reasons, this meant that a new replica database containing just the EU stuff was created, based in the EU, and we kept these two versions of the entire Snowflake database in parallel. As you may expect, this was a nightmare, because if you created a new table, or changed fields, or basically did anything in the database, you had to remember to duplicate the work on the other. Naturally, most folks did not remember to do this, so the two databases diverged drastically, to the point where the schemas were significantly different. So we all had endless conditional code for queries and work that extracted data so we’d have the right column names, types, table names, etc depending on which database you were pulling from, so we could do “on the fly” combination without saving data to the wrong place. (Don’t even get me started on the duplicate dashboards for BI purposes.) I don’t recommend it!

These regulations pose a real challenge for data scientists in many sectors, but it’s important to keep up on your legal obligations and protect your work and your company from liabilities. Have you encountered localization challenges? Comment on this article if you’ve found solutions that I didn’t mention.

Further Reading

https://www.techpolicy.press/the-human-rights-costs-of-data-localization-around-the-world/

What is considered personal data under the EU GDPR? – GDPR.eu

https://carnegieendowment.org/research/2023/10/understanding-indias-new-data-protection-law?lang=en

https://irglobal.com/article/all-about-data-localisation-in-india-2/#:~:text=The%20RBI%20ordered%20all%20payment,to%20abide%20by%20this%20instruction.

https://m.rbi.org.in/Scripts/FAQView.aspx?Id=130

<a href="https://medium.com/media/6b155f9f2cbb74afade05b0c758f1fc0/href">https://medium.com/media/6b155f9f2cbb74afade05b0c758f1fc0/href</a><a href="https://medium.com/media/185127450c5b8a05a96713dfaf813d45/href">https://medium.com/media/185127450c5b8a05a96713dfaf813d45/href</a><a href="https://medium.com/media/4dce900edb5a81e37d1325c19d673557/href">https://medium.com/media/4dce900edb5a81e37d1325c19d673557/href</a>

Decree 53 Provides Long-Awaited Guidance on Implementation of Vietnam’s Cybersecurity Law

<a href="https://medium.com/media/ae5bc4b9f9dfb8b84423d73be5346091/href">https://medium.com/media/ae5bc4b9f9dfb8b84423d73be5346091/href</a>

Data Privacy in AI Development: Data Localization was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Data Privacy in AI Development: Data Localization

Go Here to Read this Fast! Data Privacy in AI Development: Data Localization
June 17, 2024
A/B Testing Like a Pro: Master the Art of Statistical Test Selection

Thauri Dattadeen

Real-world data can be tricky! Guide to choosing and applying the Right Statistical Test (using Python)

Continue reading on Towards Data Science »

Originally appeared here:
A/B Testing Like a Pro: Master the Art of Statistical Test Selection

Go Here to Read this Fast! A/B Testing Like a Pro: Master the Art of Statistical Test Selection

June 17, 2024
Deep Learning to Predict Engagement in Online Platforms

Daniel Miller

Training LSTMs to understand the usage of online platforms.

Continue reading on Towards Data Science »

Originally appeared here:
Deep Learning to Predict Engagement in Online Platforms

Go Here to Read this Fast! Deep Learning to Predict Engagement in Online Platforms

June 17, 2024
Benchmarking LLM Inference Backends
Sean Sheng
Comparing Llama 3 serving performance on vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI

Choosing the right inference backend for serving large language models (LLMs) is crucial. It not only ensures an optimal user experience with fast generation speed but also improves cost efficiency through a high token generation rate and resource utilization. Today, developers have a variety of choices for inference backends created by reputable research and industry teams. However, selecting the best backend for a specific use case can be challenging.

To help developers make informed decisions, the BentoML engineering team conducted a comprehensive benchmark study on the Llama 3 serving performance with vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and Hugging Face TGI on BentoCloud. These inference backends were evaluated using two key metrics:
- Time to First Token (TTFT): Measures the time from when a request is sent to when the first token is generated, recorded in milliseconds. TTFT is important for applications requiring immediate feedback, such as interactive chatbots. Lower latency improves perceived performance and user satisfaction.
- Token Generation Rate: Assesses how many tokens the model generates per second during decoding, measured in tokens per second. The token generation rate is an indicator of the model’s capacity to handle high loads. A higher rate suggests that the model can efficiently manage multiple requests and generate responses quickly, making it suitable for high-concurrency environments.
Key benchmark findings

We conducted the benchmark study with the Llama 3 8B and 70B 4-bit quantization models on an A100 80GB GPU instance (gpu.a100.1×80) on BentoCloud across three levels of inference loads (10, 50, and 100 concurrent users). Here are some of our key findings:

Llama 3 8B

Llama 3 8B: Time to First Token (TTFT) of Different Backends

Llama 3 8B: Token Generation Rate of Different Backends
- LMDeploy: Delivered the best decoding performance in terms of token generation rate, with up to 4000 tokens per second for 100 users. Achieved best-in-class TTFT with 10 users. Although TTFT gradually increases with more users, it remains low and consistently ranks among the best.
- MLC-LLM: Delivered similar decoding performance to LMDeploy with 10 users. Achieved best-in-class TTFT with 10 and 50 users. However, it struggles to maintain that efficiency under very high loads. When concurrency increases to 100 users, the decoding speed and TFTT does not keep up with LMDeploy.
- vLLM: Achieved best-in-class TTFT across all levels of concurrent users. But decoding performance is less optimal compared to LMDeploy and MLC-LLM, with 2300–2500 tokens per second similar to TGI and TRT-LLM.
Llama 3 70B with 4-bit quantization

Llama 3 70B Q4: Time to First Token (TTFT) of Different Backends

Llama 3 70B Q4: Token Generate Rate for Different Backends
- LMDeploy: Delivered the best token generation rate with up to 700 tokens when serving 100 users while keeping the lowest TTFT across all levels of concurrent users.
- TensorRT-LLM: Exhibited similar performance to LMDeploy in terms of token generation rate and maintained low TTFT at a low concurrent user count. However, TTFT increased significantly to over 6 seconds when concurrent users reach 100.
- vLLM: Demonstrated consistently low TTFT across all levels of concurrent users, similar to what we observed with the 8B model. Exhibited a lower token generation rate compared to LMDeploy and TensorRT-LLM, likely due to a lack of inference optimization for quantized models.
We discovered that the token generation rate is strongly correlated with the GPU utilization achieved by an inference backend. Backends capable of maintaining a high token generation rate also exhibited GPU utilization rates approaching 100%. Conversely, backends with lower GPU utilization rates appeared to be bottlenecked by the Python process.

Beyond performance

When choosing an inference backend for serving LLMs, considerations beyond just performance also play an important role in the decision. The following list highlights key dimensions that we believe are important to consider when selecting the ideal inference backend.

Quantization

Quantization trades off precision for performance by representing weights with lower-bit integers. This technique, combined with optimizations from inference backends, enables faster inference and a smaller memory footprint. As a result, we were able to load the weights of the 70B parameter Llama 3 model on a single A100 80GB GPU, whereas multiple GPUs would otherwise be necessary.
- LMDeploy: Supports 4-bit AWQ, 8-bit quantization, and 4-bit KV quantization.
- vLLM: Not fully supported as of now. Users need to quantize the model through AutoAWQ or find pre-quantized models on Hugging Face. Performance is under-optimized.
- TensorRT-LLM: Supports quantization via modelopt, and note that quantized data types are not implemented for all the models.
- TGI: Supports AWQ, GPTQ and bits-and-bytes quantization
- MLC-LLM: Supports 3-bit and 4-bit group quantization. AWQ quantization support is still experimental.
Model architectures

Being able to leverage the same inference backend for different model architectures offers agility for engineering teams. It allows them to switch between various large language models as new improvements emerge, without needing to migrate the underlying inference infrastructure.
- LMDeploy: About 20 models supported by TurboMind engine. Models that require sliding window attention, e.g. Mistral, Qwen 1.5, are not fully supported as of now.
- vLLM: 30+ models supported
- TensorRT-LLM: 30+ models supported
- TGI: 20+ models supported
- MLC-LLM: 20+ models supported
Hardware limitations

Having the ability to run on different hardware provides cost savings and the flexibility to select the appropriate hardware based on inference requirements. It also offers alternatives during the current GPU shortage, helping to navigate supply constraints effectively.
- LMDeploy: Only optimized for Nvidia CUDA
- vLLM: Nvidia CUDA, AMD ROCm, AWS Neuron, CPU
- TensorRT-LLM: Only supports Nvidia CUDA
- TGI: Nvidia CUDA, AMD ROCm, Intel Gaudi, AWS Inferentia
- MLC-LLM: Nvidia CUDA, AMD ROCm, Metal, Android, IOS, WebGPU
Developer experience

An inference backend designed for production environments should provide stable releases and facilitate simple workflows for continuous deployment. Additionally, a developer-friendly backend should feature well-defined interfaces that support rapid development and high code maintainability, essential for building AI applications powered by LLMs.
- Stable releases: LMDeploy, TensorRT-LLM, vLLM, and TGI all offer stable releases. MLC-LLM does not currently have stable tagged releases, with only nightly builds; one possible solution is to build from source.
- Model compilation: TensorRT-LLM and MLC-LLM require an explicit model compilation step before the inference backend is ready. This step could potentially introduce additional cold-start delays during deployment.
- Documentation: LMDeploy, vLLM, and TGI were all easy to learn with their comprehensive documentation and examples. MLC-LLM presented a moderate learning curve, primarily due to the necessity of understanding the model compilation steps. TensorRT-LLM was the most challenging to set up in our benchmark test. Without enough quality examples, we had to read through the documentation of TensorRT-LLM, tensorrtllm_backend and Triton Inference Server, convert the checkpoints, build the TRT engine, and write a lot of configurations.
Concepts

Llama 3

Llama 3 is the latest iteration in the Llama LLM series, available in various configurations. We used the following model sizes in our benchmark tests.
- 8B: This model has 8 billion parameters, making it powerful yet manageable in terms of computational resources. Using FP16, it requires about 16GB of RAM (excluding KV cache and other overheads), fitting on a single A100–80G GPU instance.
- 70B 4-bit Quantization: This 70 billion parameter model, when quantized to 4 bits, significantly reduces its memory footprint. Quantization compresses the model by reducing the bits per parameter, providing faster inference and lowering memory usage with minimal performance loss. With 4-bit AWQ quantization, it requires approximately 37GB of RAM for loading model weights, fitting on a single A100–80G instance. Serving quantized weights on a single GPU device typically achieves the best throughput of a model compared to serving on multiple devices.
Inference platform

We ensured that the inference backends served with BentoML added only minimal performance overhead compared to serving natively in Python. The overhead is due to the provision of functionality for scaling, observability, and IO serialization. Using BentoML and BentoCloud gave us a consistent RESTful API for the different inference backends, simplifying benchmark setup and operations.

Inference backends

Different backends provide various ways to serve LLMs, each with unique features and optimization techniques. All of the inference backends we tested are under Apache 2.0 License.
- LMDeploy: An inference backend focusing on delivering high decoding speed and efficient handling of concurrent requests. It supports various quantization techniques, making it suitable for deploying large models with reduced memory requirements.
- vLLM: A high-performance inference engine optimized for serving LLMs. It is known for its efficient use of GPU resources and fast decoding capabilities.
- TensorRT-LLM: An inference backend that leverages NVIDIA’s TensorRT, a high-performance deep learning inference library. It is optimized for running large models on NVIDIA GPUs, providing fast inference and support for advanced optimizations like quantization.
- Hugging Face Text Generation Inference (TGI): A toolkit for deploying and serving LLMs. It is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
- MLC-LLM: An ML compiler and high-performance deployment engine for LLMs. It is built on top of Apache TVM and requires compilation and weight conversion before serving models.
Integrating BentoML with various inference backends to self-host LLMs is straightforward. The BentoML community provides the following example projects on GitHub to guide you through the process.
- vLLM: https://github.com/bentoml/BentoVLLM
- MLC-LLM: https://github.com/bentoml/BentoMLCLLM
- LMDeploy: https://github.com/bentoml/BentoLMDeploy
- TRT-LLM: https://github.com/bentoml/BentoTRTLLM
- TGI: https://github.com/bentoml/BentoTGI
Benchmark setup

Models

We tested both the Meta-Llama-3–8B-Instruct and Meta-Llama-3–70B-Instruct 4-bit quantization models. For the 70B model, we performed 4-bit quantization so that it could run on a single A100–80G GPU. If the inference backend supports native quantization, we used the inference backend-provided quantization method. For example, for MLC-LLM, we used the q4f16_1 quantization scheme. Otherwise, we used the AWQ-quantized casperhansen/llama-3-70b-instruct-awq model from Hugging Face.

Note that other than enabling common inference optimization techniques, such as continuous batching, flash attention, and prefix caching, we did not fine-tune the inference configurations (GPU memory utilization, max number of sequences, paged KV cache block size, etc.) for each individual backend. This is because this approach is not scalable as the number of LLMs we serve gets larger. Providing an optimal set of inference parameters is an implicit measure of performance and ease-of-use of the backends.

Benchmark client

To accurately assess the performance of different LLM backends, we created a custom benchmark script. This script simulates real-world scenarios by varying user loads and sending generation requests under different levels of concurrency.

Our benchmark client can spawn up to the target number of users within 20 seconds, after which it stress tests the LLM backend by sending concurrent generation requests with randomly selected prompts. We tested with 10, 50, and 100 concurrent users to evaluate the system under varying loads.

Each stress test ran for 5 minutes, during which time we collected inference metrics every 5 seconds. This duration was sufficient to observe potential performance degradation, resource utilization bottlenecks, or other issues that might not be evident in shorter tests.

For more information, see the source code of our benchmark client.

Prompt dataset

The prompts for our tests were derived from the databricks-dolly-15k dataset. For each test session, we randomly selected prompts from this dataset. We also tested text generation with and without system prompts. Some backends might have additional optimizations regarding common system prompt scenarios by enabling prefix caching.

Library versions
- BentoML: 1.2.16
- vLLM: 0.4.2
- MLC-LLM: mlc-llm-nightly-cu121 0.1.dev1251 (No stable release yet)
- LMDeploy: 0.4.0
- TensorRT-LLM: 0.9.0 (with Triton v24.04)
- TGI: 2.0.4
Recommendations

The field of LLM inference optimization is rapidly evolving and heavily researched. The best inference backend available today might quickly be surpassed by newcomers. Based on our benchmarks and usability studies conducted at the time of writing, we have the following recommendations for selecting the most suitable backend for Llama 3 models under various scenarios.

Llama 3 8B

For the Llama 3 8B model, LMDeploy consistently delivers low TTFT and the highest decoding speed across all user loads. Its ease of use is another significant advantage, as it can convert the model into TurboMind engine format on the fly, simplifying the deployment process. At the time of writing, LMDeploy offers limited support for models that utilize sliding window attention mechanisms, such as Mistral and Qwen 1.5.

vLLM consistently maintains a low TTFT, even as user loads increase, making it suitable for scenarios where maintaining low latency is crucial. vLLM offers easy integration, extensive model support, and broad hardware compatibility, all backed by a robust open-source community.

MLC-LLM offers the lowest TTFT and maintains high decoding speeds at lower concurrent users. However, under very high user loads, MLC-LLM struggles to maintain top-tier decoding performance. Despite these challenges, MLC-LLM shows significant potential with its machine learning compilation technology. Addressing these performance issues and implementing a stable release could greatly enhance its effectiveness.

Llama 3 70B 4-bit quantization

For the Llama 3 70B Q4 model, LMDeploy demonstrates impressive performance with the lowest TTFT across all user loads. It also maintains a high decoding speed, making it ideal for applications where both low latency and high throughput are essential. LMDeploy also stands out for its ease of use, as it can quickly convert models without the need for extensive setup or compilation, making it ideal for rapid deployment scenarios.

TensorRT-LLM matches LMDeploy in throughput, yet it exhibits less optimal latency for TTFT under high user load scenarios. Backed by Nvidia, we anticipate these gaps will be quickly addressed. However, its inherent requirement for model compilation and reliance on Nvidia CUDA GPUs are intentional design choices that may pose limitations during deployment.

vLLM manages to maintain a low TTFT even as user loads increase, and its ease of use can be a significant advantage for many users. However, at the time of writing, the backend’s lack of optimization for AWQ quantization leads to less than optimal decoding performance for quantized models.

Acknowledgements

The article and accompanying benchmarks were collaboratively with my esteemed colleagues, Rick Zhou, Larme Zhao, and Bo Jiang. All images presented in this article were created by the authors.

Benchmarking LLM Inference Backends was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Benchmarking LLM Inference Backends

Go Here to Read this Fast! Benchmarking LLM Inference Backends
June 17, 2024
How Twilio used Amazon SageMaker MLOps pipelines with PrestoDB to enable frequent model retraining and optimized batch transform

Madhur Prashant

This post is co-written with Shamik Ray, Srivyshnav K S, Jagmohan Dhiman and Soumya Kundu from Twilio. Today’s leading companies trust Twilio’s Customer Engagement Platform (CEP) to build direct, personalized relationships with their customers everywhere in the world. Twilio enables companies to use communications and data to add intelligence and security to every step of […]

Originally appeared here:
How Twilio used Amazon SageMaker MLOps pipelines with PrestoDB to enable frequent model retraining and optimized batch transform

Go Here to Read this Fast! How Twilio used Amazon SageMaker MLOps pipelines with PrestoDB to enable frequent model retraining and optimized batch transform

June 17, 2024
Accelerate deep learning training and simplify orchestration with AWS Trainium and AWS Batch

Scott Perry

In large language model (LLM) training, effective orchestration and compute resource management poses a significant challenge. Automation of resource provisioning, scaling, and workflow management is vital for optimizing resource usage and streamlining complex workflows, thereby achieving efficient deep learning training processes. Simplified orchestration enables researchers and practitioners to focus more on model experimentation, hyperparameter tuning, […]

Originally appeared here:
Accelerate deep learning training and simplify orchestration with AWS Trainium and AWS Batch

Go Here to Read this Fast! Accelerate deep learning training and simplify orchestration with AWS Trainium and AWS Batch

June 17, 2024
Multi AI Agent Systems 101
Mariya Mansurova
Automating Routine Tasks in Data Source Management with CrewAI

Image by DALL-E 3

Initially, when ChatGPT just appeared, we used simple prompts to get answers to our questions. Then, we encountered issues with hallucinations and began using RAG (Retrieval Augmented Generation) to provide more context to LLMs. After that, we started experimenting with AI agents, where LLMs act as a reasoning engine and can decide what to do next, which tools to use, and when to return the final answer.

The next evolutionary step is to create teams of such agents that can collaborate with each other. This approach is logical as it mirrors human interactions. We work in teams where each member has a specific role:
- The product manager proposes the next project to work on.
- The designer creates its look and feel.
- The software engineer develops the solution.
- The analyst examines the data to ensure it performs as expected and identifies ways to improve the product for customers.
Similarly, we can create a team of AI agents, each focusing on one domain. They can collaborate and reach a final conclusion together. Just as specialization enhances performance in real life, it could also benefit the performance of AI agents.

Another advantage of this approach is increased flexibility. Each agent can operate with its own prompt, set of tools and even LLM. For instance, we can use different models for different parts of our system. You can use GPT-4 for the agent that needs more reasoning and GPT-3.5 for the one that does only simple extraction. We can even fine-tune the model for small specific tasks and use it in our crew of agents.

The potential drawbacks of this approach are time and cost. Multiple interactions and knowledge sharing between agents require more calls to LLM and consume additional tokens. This could result in longer wait times and increased expenses.

There are several frameworks available for multi-agent systems today.
Here are some of the most popular ones:
- AutoGen: Developed by Microsoft, AutoGen uses a conversational approach and was one of the earliest frameworks for multi-agent systems,
- LangGraph: While not strictly a multi-agent framework, LangGraph allows for defining complex interactions between actors using a graph structure. So, it can also be adapted to create multi-agent systems.
- CrewAI: Positioned as a high-level framework, CrewAI facilitates the creation of “crews” consisting of role-playing agents capable of collaborating in various ways.
I’ve decided to start experimenting with multi-agent frameworks from CrewAI since it’s quite widely popular and user friendly. So, it looks like a good option to begin with.

In this article, I will walk you through how to use CrewAI. As analysts, we’re the domain experts responsible for documenting various data sources and addressing related questions. We’ll explore how to automate these tasks using multi-agent frameworks.

Setting up the environment

Let’s start with setting up the environment. First, we need to install the CrewAI main package and an extension to work with tools.
```
pip install crewai
pip install 'crewai[tools]'
```
CrewAI was developed to work primarily with OpenAI API, but I would also like to try it with a local model. According to the ChatBot Arena Leaderboard, the best model you can run on your laptop is Llama 3 (8b parameters). It will be the most feasible option for our use case.

We can access Llama models using Ollama. Installation is pretty straightforward. You need to download Ollama from the website and then go through the installation process. That’s it.

Now, you can test the model in CLI by running the following command.
```
ollama run llama3
```
For example, you can ask something like this.

Let’s create a custom Ollama model to use later in CrewAI.

We will start with a ModelFile (documentation). I only specified the base model (llama3), temperature and stop sequence. However, you might add more features. For example, you can determine the system message using SYSTEM keyword.
```
FROM llama3

# set parameters
PARAMETER temperature 0.5
PARAMETER stop Result
```
I’ve saved it into a Llama3ModelFile file.

Let’s create a bash script to load the base model for Ollama and create the custom model we defined in ModelFile.
```
#!/bin/zsh

# define variables
model_name="llama3"
custom_model_name="crewai-llama3"

# load the base model
ollama pull $model_name

# create the model file
ollama create $custom_model_name -f ./Llama3ModelFile
```
Let’s execute this file.
```
chmod +x ./llama3_setup.sh
./llama3_setup.sh
```
You can find both files on GitHub: Llama3ModelFile and llama3_setup.sh

We need to initialise the following environmental variables to use the local Llama model with CrewAI.
```
os.environ["OPENAI_API_BASE"]='http://localhost:11434/v1'

os.environ["OPENAI_MODEL_NAME"]='crewai-llama3' 
# custom_model_name from the bash script

os.environ["OPENAI_API_KEY"] = "NA"
```
We’ve finished the setup and are ready to continue our journey.

Use cases: working with documentation

As analysts, we often play the role of subject matter experts for data and some data-related tools. In my previous team, we used to have a channel with almost 1K participants, where we were answering lots of questions about our data and the ClickHouse database we used as storage. It took us quite a lot of time to manage this channel. It would be interesting to see whether such tasks can be automated with LLMs.

For this example, I will use the ClickHouse database. If you’re interested, You can learn more about ClickHouse and how to set it up locally in my previous article. However, we won’t utilise any ClickHouse-specific features, so feel free to stick to the database you know.

I’ve created a pretty simple data model to work with. There are just two tables in our DWH (Data Warehouse): ecommerce_db.users and ecommerce_db.sessions. As you might guess, the first table contains information about the users of our service.

The ecommerce_db.sessions table stores information about user sessions.

Regarding data source management, analysts typically handle tasks like writing and updating documentation and answering questions about this data. So, we will use LLM to write documentation for the table in the database and teach it to answer questions about data or ClickHouse.

But before moving on to the implementation, let’s learn more about the CrewAI framework and its core concepts.

CrewAI basic concepts

The cornerstone of a multi-agent framework is an agent concept. In CrewAI, agents are powered by role-playing. Role-playing is a tactic when you ask an agent to adopt a persona and behave like a top-notch backend engineer or helpful customer support agent. So, when creating a CrewAI agent, you need to specify each agent’s role, goal, and backstory so that LLM knows enough to play this role.

The agents’ capabilities are limited without tools (functions that agents can execute and get results). With CrewAI, you can use one of the predefined tools (for example, to search the Internet, parse a website, or do RAG on a document), create a custom tool yourself or use LangChain tools. So, it’s pretty easy to create a powerful agent.

Let’s move on from agents to the work they are doing. Agents are working on tasks (specific assignments). For each task, we need to define a description, expected output (definition of done), set of available tools and assigned agent. I really like that these frameworks follow the managerial best practices like a clear definition of done for the tasks.

The next question is how to define the execution order for tasks: which one to work on first, which ones can run in parallel, etc. CrewAI implemented processes to orchestrate the tasks. It provides a couple of options:
- Sequential —the most straightforward approach when tasks are called one after another.
- Hierarchical — when there’s a manager (specified as LLM model) that creates and delegates tasks to the agents.
Also, CrewAI is working on a consensual process. In such a process, agents will be able to make decisions collaboratively with a democratic approach.

There are other levers you can use to tweak the process of tasks’ execution:
- You can mark tasks as “asynchronous”, then they will be executed in parallel, so you will be able to get an answer faster.
- You can use the “human input” flag on a task, and then the agent will ask for human approval before finalising the output of this task. It can allow you to add an oversight to the process.
We’ve defined all the primary building blocks and can discuss the holly grail of CrewAI — crew concept. The crew represents the team of agents and the set of tasks they will be working on. The approach for collaboration (processes we discussed above) can also be defined at the crew level.

Also, we can set up the memory for a crew. Memory is crucial for efficient collaboration between the agents. CrewAI supports three levels of memory:
- Short-term memory stores information related to the current execution. It helps agents to work together on the current task.
- Long-term memory is data about the previous executions stored in the local database. This type of memory allows agents to learn from earlier iterations and improve over time.
- Entity memory captures and structures information about entities (like personas, cities, etc.)
Right now, you can only switch on all types of memory for a crew without any further customisation. However, it doesn’t work with the Llama models.

We’ve learned enough about the CrewAI framework, so it’s time to start using this knowledge in practice.

Use case: writing documentation

Let’s start with a simple task: putting together the documentation for our DWH. As we discussed before, there are two tables in our DWH, and I would like to create a detailed description for them using LLMs.

First approach

In the beginning, we need to think about the team structure. Think of this as a typical managerial task. Who would you hire for such a job?

I would break this task into two parts: retrieving data from a database and writing documentation. So, we need a database specialist and a technical writer. The database specialist needs access to a database, while the writer won’t need any special tools.

Now, we have a high-level plan. Let’s create the agents.

For each agent, I’ve specified the role, goal and backstory. I’ve tried my best to provide agents with all the needed context.
```
database_specialist_agent = Agent(
  role = "Database specialist",
  goal = "Provide data to answer business questions using SQL",
  backstory = '''You are an expert in SQL, so you can help the team 
  to gather needed data to power their decisions. 
  You are very accurate and take into account all the nuances in data.''',
  allow_delegation = False,
  verbose = True
)

tech_writer_agent = Agent(
  role = "Technical writer",
  goal = '''Write engaging and factually accurate technical documentation 
    for data sources or tools''',
  backstory = ''' 
  You are an expert in both technology and communications, so you can easily explain even sophisticated concepts.
  You base your work on the factual information provided by your colleagues.
  Your texts are concise and can be easily understood by a wide audience. 
  You use professional but rather an informal style in your communication.
  ''',
  allow_delegation = False,
  verbose = True
)
```
We will use a simple sequential process, so there’s no need for agents to delegate tasks to each other. That’s why I specified allow_delegation = False.

The next step is setting the tasks for agents. But before moving to them, we need to create a custom tool to connect to the database.

First, I put together a function to execute ClickHouse queries using HTTP API.
```
CH_HOST = 'http://localhost:8123' # default address 

def get_clickhouse_data(query, host = CH_HOST, connection_timeout = 1500):
  r = requests.post(host, params = {'query': query}, 
    timeout = connection_timeout)
  if r.status_code == 200:
      return r.text
  else: 
      return 'Database returned the following error:n' + r.text
```
When working with LLM agents, it’s important to make tools fault-tolerant. For example, if the database returns an error (status_code != 200), my code won’t throw an exception. Instead, it will return the error description to the LLM so it can attempt to resolve the issue.

To create a CrewAI custom tool, we need to derive our class from crewai_tools.BaseTool, implement the _run method and then create an instance of this class.
```
from crewai_tools import BaseTool

class DatabaseQuery(BaseTool):
  name: str = "Database Query"
  description: str = "Returns the result of SQL query execution"

  def _run(self, sql_query: str) -> str:
      # Implementation goes here
      return get_clickhouse_data(sql_query)

database_query_tool = DatabaseQuery()
```
Now, we can set the tasks for the agents. Again, providing clear instructions and all the context to LLM is crucial.
```
table_description_task = Task(
  description = '''Provide the comprehensive overview for the data 
  in table {table}, so that it's easy to understand the structure 
  of the data. This task is crucial to put together the documentation 
  for our database''',
  expected_output = '''The comprehensive overview of {table} in the md format. 
  Include 2 sections: columns (list of columns with their types) 
  and examples (the first 30 rows from table).''',
  tools = [database_query_tool],
  agent = database_specialist_agent
)

table_documentation_task = Task(
  description = '''Using provided information about the table, 
  put together the detailed documentation for this table so that 
  people can use it in practice''',
  expected_output = '''Well-written detailed documentation describing 
  the data scheme for the table {table} in markdown format, 
  that gives the table overview in 1-2 sentences then then 
  describes each columm. Structure the columns description 
  as a markdown table with column name, type and description.''',
  tools = [],
  output_file="table_documentation.md",
  agent = tech_writer_agent
)
```
You might have noticed that I’ve used {table} placeholder in the tasks’ descriptions. We will use table as an input variable when executing the crew, and this value will be inserted into all placeholders.

Also, I’ve specified the output file for the table documentation task to save the final result locally.

We have all we need. Now, it’s time to create a crew and execute the process, specifying the table we are interested in. Let’s try it with the users table.
```
crew = Crew(
  agents = [database_specialist_agent, tech_writer_agent],
  tasks = [table_description_task,  table_documentation_task],
  verbose = 2
)

result = crew.kickoff({'table': 'ecommerce_db.users'})
```
It’s an exciting moment, and I’m really looking forward to seeing the result. Don’t worry if execution takes some time. Agents make multiple LLM calls, so it’s perfectly normal for it to take a few minutes. It took 2.5 minutes on my laptop.

We asked LLM to return the documentation in markdown format. We can use the following code to see the formatted result in Jupyter Notebook.
```
from IPython.display import Markdown
Markdown(result)
```
At first glance, it looks great. We’ve got the valid markdown file describing the users’ table.

But wait, it’s incorrect. Let’s see what data we have in our table.

The columns listed in the documentation are completely different from what we have in the database. It’s a case of LLM hallucinations.

We’ve set verbose = 2 to get the detailed logs from CrewAI. Let’s read through the execution logs to identify the root cause of the problem.

First, the database specialist couldn’t query the database due to complications with quotes.

The specialist didn’t manage to resolve this problem. Finally, this chain has been terminated by CrewAI with the following output: Agent stopped due to iteration limit or time limit.

This means the technical writer didn’t receive any factual information about the data. However, the agent continued and produced completely fake results. That’s how we ended up with incorrect documentation.

Fixing the issues

Even though our first iteration wasn’t successful, we’ve learned a lot. We have (at least) two areas for improvement:
- Our database tool is too difficult for the model, and the agent struggles to use it. We can make the tool more tolerant by removing quotes from the beginning and end of the queries. This solution is not ideal since valid SQL can end with a quote, but let’s try it.
- Our technical writer isn’t basing its output on the input from the database specialist. We need to tweak the prompt to highlight the importance of providing only factual information.
So, let’s try to fix these problems. First, we will fix the tool — we can leverage strip to eliminate quotes.
```
CH_HOST = 'http://localhost:8123' # default address 

def get_clickhouse_data(query, host = CH_HOST, connection_timeout = 1500):
  r = requests.post(host, params = {'query': query.strip('"').strip("'")}, 
    timeout = connection_timeout)
  if r.status_code == 200:
    return r.text
  else: 
    return 'Database returned the following error:n' + r.text
```
Then, it’s time to update the prompt. I’ve included statements emphasizing the importance of sticking to the facts in both the agent and task definitions.
```
tech_writer_agent = Agent(
  role = "Technical writer",
  goal = '''Write engaging and factually accurate technical documentation 
  for data sources or tools''',
  backstory = ''' 
  You are an expert in both technology and communications, so you 
  can easily explain even sophisticated concepts.
  Your texts are concise and can be easily understood by wide audience. 
  You use professional but rather informal style in your communication.
  You base your work on the factual information provided by your colleagues. 
  You stick to the facts in the documentation and use ONLY 
  information provided by the colleagues not adding anything.''',
  allow_delegation = False,
  verbose = True
)

table_documentation_task = Task(
  description = '''Using provided information about the table, 
  put together the detailed documentation for this table so that 
  people can use it in practice''',
  expected_output = '''Well-written detailed documentation describing 
  the data scheme for the table {table} in markdown format, 
  that gives the table overview in 1-2 sentences then then 
  describes each columm. Structure the columns description 
  as a markdown table with column name, type and description.
  The documentation is based ONLY on the information provided 
  by the database specialist without any additions.''',
  tools = [],
  output_file = "table_documentation.md",
  agent = tech_writer_agent
)
```
Let’s execute our crew once again and see the results.

We’ve achieved a bit better result. Our database specialist was able to execute queries and view the data, which is a significant win for us. Additionally, we can see all the relevant fields in the result table, though there are lots of other fields as well. So, it’s still not entirely correct.

I once again looked through the CrewAI execution log to figure out what went wrong. The issue lies in getting the list of columns. There’s no filter by database, so it returns some unrelated columns that appear in the result.
```
SELECT column_name 
FROM information_schema.columns 
WHERE table_name = 'users'
```
Also, after looking at multiple attempts, I noticed that the database specialist, from time to time, executes select * from <table> query. It might cause some issues in production as it might generate lots of data and send it to LLM.

More specialised tools

We can provide our agent with more specialised tools to improve our solution. Currently, the agent has a tool to execute any SQL query, which is flexible and powerful but prone to errors. We can create more focused tools, such as getting table structure and top-N rows from the table. Hopefully, it will reduce the number of mistakes.
```
class TableStructure(BaseTool):
  name: str = "Table structure"
  description: str = "Returns the list of columns and their types"

  def _run(self, table: str) -> str:
    table = table.strip('"').strip("'")
    return get_clickhouse_data(
      'describe {table} format TabSeparatedWithNames'
        .format(table = table)
    )

class TableExamples(BaseTool):
  name: str = "Table examples"
  description: str = "Returns the first N rows from the table"

  def _run(self, table: str, n: int = 30) -> str:
    table = table.strip('"').strip("'")
    return get_clickhouse_data(
      'select * from {table} limit {n} format TabSeparatedWithNames'
        .format(table = table, n = n)
    )

table_structure_tool = TableStructure()
table_examples_tool = TableExamples()
```
Now, we need to specify these tools in the task and re-run our script. After the first attempt, I got the following output from the Technical Writer.
```
Task output: This final answer provides a detailed and factual description 
of the ecommerce_db.users table structure, including column names, types, 
and descriptions. The documentation adheres to the provided information 
from the database specialist without any additions or modifications.
```
More focused tools helped the database specialist retrieve the correct table information. However, even though the writer had all the necessary information, we didn’t get the expected result.

As we know, LLMs are probabilistic, so I gave it another try. And hooray, this time, the result was pretty good.

It’s not perfect since it still includes some irrelevant comments and lacks the overall description of the table. However, providing more specialised tools has definitely paid off. It also helped to prevent issues when the agent tried to load all the data from the table.

Quality assurance specialist

We’ve achieved pretty good results, but let’s see if we can improve them further. A common practice in multi-agent setups is quality assurance, which adds the final review stage before finalising the results.

Let’s create a new agent — a Quality Assurance Specialist, who will be in charge of review.
```
qa_specialist_agent = Agent(
  role = "Quality Assurance specialist",
  goal = """Ensure the highest quality of the documentation we provide 
  (that it's correct and easy to understand)""",
  backstory = '''
  You work as a Quality Assurance specialist, checking the work 
  from the technical writer and ensuring that it's inline 
  with our highest standards.
  You need to check that the technical writer provides the full complete 
  answers and make no assumptions. 
  Also, you need to make sure that the documentation addresses 
  all the questions and is easy to understand.
  ''',
  allow_delegation = False,
  verbose = True
)
```
Now, it’s time to describe the review task. I’ve used the context parameter to specify that this task requires outputs from both table_description_task and table_documentation_task.
```
qa_review_task = Task(
  description = '''
  Review the draft documentation provided by the technical writer.
  Ensure that the documentation fully answers all the questions: 
  the purpose of the table and its structure in the form of table. 
  Make sure that the documentation is consistent with the information 
  provided by the database specialist. 
  Double check that there are no irrelevant comments in the final version 
  of documentation.
  ''',
  expected_output = '''
  The final version of the documentation in markdown format 
  that can be published. 
  The documentation should fully address all the questions, be consistent 
  and follow our professional but informal tone of voice.
  ''',
  tools = [],
  context = [table_description_task, table_documentation_task],
  output_file="checked_table_documentation.md",
  agent = qa_specialist_agent
)
```
Let’s update our crew and run it.
```
full_crew = Crew(
  agents=[database_specialist_agent, tech_writer_agent, qa_specialist_agent],
  tasks=[table_description_task,  table_documentation_task, qa_review_task],
  verbose = 2,
  memory = False # don't work with Llama
)

full_result = full_crew.kickoff({'table': 'ecommerce_db.users'})
```
We now have more structured and detailed documentation thanks to the addition of the QA stage.

Delegation

With the addition of the QA specialist, it would be interesting to test the delegation mechanism. The QA specialist agent might have questions or requests that it could delegate to other agents.

I tried using the delegation with Llama 3, but it didn’t go well. Llama 3 struggled to call the co-worker tool correctly. It couldn’t specify the correct co-worker’s name.

We achieved pretty good results with a local model that can run on any laptop, but now it’s time to switch gears and use a way more powerful model — GPT-4o.

To do it, we just need to update the following environment variables.
```
os.environ["OPENAI_MODEL_NAME"] = 'gpt-4o'  
os.environ["OPENAI_API_KEY"] = config['OPENAI_API_KEY'] # your OpenAI key
```
To switch on the delegation, we should specify allow_delegation = True for the QA specialist agent.

Also, we can use handy memory functionality for OpenAI models, as mentioned above. The memory will allow our agents to share their knowledge with each other during execution and leverage long-term memory to get information from previous executions.
```
full_crew = Crew(
  agents = [database_specialist_agent, tech_writer_agent, qa_specialist_agent],
  tasks = [table_description_task,  table_documentation_task, qa_review_task],
  verbose = 2,
  memory = True 
)

full_result = full_crew.kickoff({'table': 'ecommerce_db.users'})
```
Let’s see the CrewAI execution logs to understand how delegation works. Here are all the logs for the QA specialist. We can see that it reached out to the database specialist to double-check the information.
```
Thought: I need to review the provided draft documentation to ensure it 
fully answers all the questions and is complete, correct, and easy 
to understand. Specifically, I need to ensure that it explains the purpose 
of the table, its structure, and includes relevant examples without 
any irrelevant comments.

First, I will compare the two drafts to identify any discrepancies and 
missing information. Then, I will verify the accuracy and completeness of 
the information provided.

**Comparison:**
1. **Introduction:**
   - First draft: Includes an introduction and examples.
   - Second draft: Adds a more detailed introduction about the purpose and importance of the table.

2. **Columns:**
   - Both drafts have the same column structure and descriptions.

3. **Additional Information:**
   - Both drafts include similar additional information, but the second draft is more concise.

4. **Examples:**
   - The first draft includes examples, which are missing in the second draft.

5. **Final Checks:**
   - Ensure there are no irrelevant comments.
   - Verify consistency with information provided by the database specialist.

Action: Ask question to co-worker
Action Input: {
"question": "Can you confirm the accuracy of the provided column descriptions 
  and examples for the `ecommerce_db.users` table?", 
"context": "We are finalizing the documentation for the `ecommerce_db.users`
 table. The table includes columns `user_id`, `country`, `is_active`, and 
  `age`. Here are the examples provided:

  | user_id | country        | is_active | age |
  |---------|----------------|-----------|-----|
  | 1000001 | United Kingdom | 0         | 70  |
  | 1000002 | France         | 1         | 87  |
  | 1000003 | France         | 1         | 88  |
  | 1000004 | Germany        | 1         | 25  |
  | 1000005 | Germany        | 1         | 48  |
  | 1000006 | France         | 1         | 78  |
  | 1000007 | Netherlands    | 1         | 65  |
  | 1000008 | Germany        | 1         | 31  |
  | 1000009 | Germany        | 1         | 66  |
  | 1000010 | Germany        | 1         | 73  |", 
"coworker": "database specialist"}

<-- omitted logs from the database specialist -->

Based on the provided context and examples for the `ecommerce_db.users` table, 
the column descriptions and examples appear accurate. Here are the details:
- `user_id`: This column contains a unique identifier for each user. 
  The provided examples (e.g., 1000001, 1000002) align with this description 
  as they are unique numeric identifiers.

- `country`: This column specifies the country of the user. The provided 
  examples (e.g., United Kingdom, France, Germany, Netherlands) are 
  all valid country names.

- `is_active`: This column indicates whether the user is active (1) or not (0). The provided examples (e.g., 0, 1) correctly represent this binary status.

- `age`: This column shows the age of the user. The provided examples (e.g., 70, 
  87, 88, 25, 48, 78, 65, 31, 66, 73) are all valid ages.

Therefore, the column descriptions and examples provided for the 
 `ecommerce_db.users` table are accurate and meet the expected criteria.

Thought: Based on the observation from the database specialist, 
the column descriptions and examples provided for the `ecommerce_db.users` 
table are accurate. Now, I will consolidate the best elements from both drafts,
 ensuring the documentation is complete, clear, and free of irrelevant comments.

Action: Finalize the documentation in markdown format, incorporating 
the detailed introduction, examples, and ensuring the tone is professional 
but informal.
```
When I tried the delegation for the first time, I didn’t enable memory, which led to incorrect results. The data specialist and the technical writer initially returned the correct information. However, when the QA specialist returned with the follow-up questions, they started to hallucinate. So, it looks like delegation works better when memory is enabled.

Here’s the final output from GPT-4o. The result looks pretty nice now. We definitely can use LLMs to automate documentation.

So, the first task has been solved!

I used the same script to generate documentation for the ecommerce_db.sessions table as well. It will be handy for our next task. So, let’s not waste any time and move on.

Use case: answering questions

Our next task is answering questions based on the documentation since it’s common for many data analysts (and other specialists).

We will start simple and will create just two agents:
- The documentation support specialist will be answering questions based on the docs,
- The support QA agent will review the answer before sharing it with the customer.
We will need to empower the documentation specialist with a couple of tools that will allow them to see all the files stored in the directory and read the files. It’s pretty straightforward since CrewAI has implemented such tools.
```
from crewai_tools import DirectoryReadTool, FileReadTool

documentation_directory_tool = DirectoryReadTool(
    directory = '~/crewai_project/ecommerce_documentation')

base_file_read_tool = FileReadTool()
```
However, since Llama 3 keeps struggling with quotes when calling tools, I had to create a custom tool on top of the FileReaderTool to overcome this issue.
```
from crewai_tools import BaseTool

class FileReadToolUPD(BaseTool):
    name: str = "Read a file's content"
    description: str = "A tool that can be used to read a file's content."

    def _run(self, file_path: str) -> str:
        # Implementation goes here
        return base_file_read_tool._run(file_path = file_path.strip('"').strip("'"))
        
file_read_tool = FileReadToolUPD()
```
Next, as we did before, we need to create agents, tasks and crew.
```
data_support_agent = Agent(
  role = "Senior Data Support Agent",
  goal = "Be the most helpful support for you colleagues",
  backstory = '''You work as a support for data-related questions 
  in the company. 
  Even though you're a big expert in our data warehouse, you double check 
  all the facts in documentation. 
  Our documentation is absolutely up-to-date, so you can fully rely on it 
  when answering questions (you don't need to check the actual data 
  in database).
  Your work is very important for the team success. However, remember 
  that examples of table rows don't show all the possible values. 
  You need to ensure that you provide the best possible support: answering 
  all the questions, making no assumptions and sharing only the factual data.
  Be creative try your best to solve the customer problem. 
  ''',
  allow_delegation = False,
  verbose = True
)

qa_support_agent = Agent(
  role = "Support Quality Assurance Agent",
  goal = """Ensure the highest quality of the answers we provide 
  to the customers""",
  backstory = '''You work as a Quality Assurance specialist, checking the work 
  from support agents and ensuring that it's inline with our highest standards.
  You need to check that the agent provides the full complete answers 
  and make no assumptions. 
  Also, you need to make sure that the documentation addresses all 
  the questions and is easy to understand.
  ''',
  allow_delegation = False,
  verbose = True
)

draft_data_answer = Task(
  description = '''Very important customer {customer} reached out to you 
  with the following question:
  ```
  {question}
  ```
  
  Your task is to provide the best answer to all the points in the question 
  using all available information and not making any assumprions. 
  If you don't have enough information to answer the question, just say 
  that you don't know.''',
  expected_output = '''The detailed informative answer to the customer's 
  question that addresses all the point mentioned. 
  Make sure that answer is complete and stict to facts 
  (without any additional information not based on the factual data)''',
  tools = [documentation_directory_tool, file_read_tool], 
  agent = data_support_agent
)

answer_review = Task(
  description = '''
  Review the draft answer provided by the support agent.
  Ensure that the it fully answers all the questions mentioned 
  in the initial inquiry. 
  Make sure that the answer is consistent and doesn't include any assumptions.
  ''',
  expected_output = '''
  The final version of the answer in markdown format that can be shared 
  with the customer. 
  The answer should fully address all the questions, be consistent 
  and follow our professional but informal tone of voice. 
  We are very chill and friendly company, so don't forget to include 
  all the polite phrases.
  ''',
  tools = [], 
  agent = qa_support_agent
)

qna_crew = Crew(
  agents = [data_support_agent, qa_support_agent],
  tasks = [draft_data_answer,  answer_review],
  verbose = 2,
  memory = False # don't work with Llama
)
```
Let’s see how it works in practice.
```
result = qna_crew.kickoff(
  {'customer': "Max", 
   'question': """Hey team, I hope you're doing well. I need to find 
    the numbers before our CEO presentation tomorrow, so I will really 
    appreciate your help.
    I need to calculate the number of sessions from our Windows users in 2023. I've tried to find the table with such data in our data warehouse, but wasn't able to. 
    Do you have any ideas whether we store the needed data somewhere, 
    so that I can query it? """
  }
)
```
We’ve got a polite, practical and helpful answer in return. That’s really great.
```
**Hello Max,**

Thank you for reaching out with your question! I'm happy to help you 
find the number of sessions from Windows users in 2023. 
After reviewing our documentation, I found that we do store data 
related to sessions and users in our ecommerce database, specifically in 
the `ecommerce_db.sessions` table.

To answer your question, I can provide you with a step-by-step guide 
on how to query this table using SQL. First, you can use the `session_id` 
column along with the `os` column filtering for "Windows" and 
the `action_date` column filtering for dates in 2023. 
Then, you can group the results by `os` using the `GROUP BY` clause 
to count the number of sessions that meet these conditions.

Here's a sample SQL query that should give you the desired output:

```sql
SELECT COUNT(*) 
FROM ecommerce_db.sessions 
WHERE os = 'Windows' 
AND action_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY os;
```

This query will return the total number of sessions from Windows 
users in 2023. I hope this helps! If you have any further questions or 
need more assistance, please don't hesitate to ask.
```
Let’s complicate the task a bit. Suppose we can get not only questions about our data but also about our tool (ClickHouse). So, we will have another agent in the crew — ClickHouse Guru. To give our CH agent some knowledge, I will share a documentation website with it.
```
from crewai_tools import ScrapeWebsiteTool, WebsiteSearchTool
ch_documenation_tool = ScrapeWebsiteTool(
  'https://clickhouse.com/docs/en/guides/creating-tables')
```
If you need to work with a lengthy document, you might try using RAG (Retrieval Augmented generation) — WebsiteSearchTool. It will calculate embeddings and store them locally in ChromaDB. In our case, we will stick to a simple website scraper tool.

Now that we have two subject matter experts, we need to decide who will be working on the questions. So, it’s time to use a hierarchical process and add a manager to orchestrate all the tasks.

CrewAI provides the manager implementation, so we only need to specify the LLM model. I’ve picked the GPT-4o.
```
from langchain_openai import ChatOpenAI
from crewai import Process

complext_qna_crew = Crew(
  agents = [ch_support_agent, data_support_agent, qa_support_agent],
  tasks = [draft_ch_answer, draft_data_answer, answer_review],
  verbose = 2,
  manager_llm = ChatOpenAI(model='gpt-4o', temperature=0),  
  process = Process.hierarchical,  
  memory = False 
)
```
At this point, I had to switch from Llama 3 to OpenAI models again to run a hierarchical process since it hasn’t worked for me with Llama (similar to this issue).

Now, we can try our new crew with different types of questions (either related to our data or ClickHouse database).
```
ch_result = complext_qna_crew.kickoff(
  {'customer': "Maria", 
   'question': """Good morning, team. I'm using ClickHouse to calculate 
   the number of customers. 
   Could you please remind whether there's an option to add totals 
   in ClickHouse?"""
  }
)

doc_result = complext_qna_crew.kickoff(
  {'customer': "Max", 
   'question': """Hey team, I hope you're doing well. I need to find 
    the numbers before our CEO presentation tomorrow, so I will really 
    appreciate your help.
    I need to calculate the number of sessions from our Windows users 
    in 2023. I've tried to find the table with such data 
    in our data warehouse, but wasn't able to. 
    Do you have any ideas whether we store the needed data somewhere, 
    so that I can query it. """
  }
)
```
If we look at the final answers and logs (I’ve omitted them here since they are quite lengthy, but you can find them and full logs on GitHub), we will see that the manager was able to orchestrate correctly and delegate tasks to co-workers with relevant knowledge to address the customer’s question. For the first (ClickHouse-related) question, we got a detailed answer with examples and possible implications of using WITH TOTALS functionality. For the data-related question, models returned roughly the same information as we’ve seen above.

So, we’ve built a crew that can answer various types of questions based on the documentation, whether from a local file or a website. I think it’s an excellent result.

You can find all the code on GitHub.

Summary

In this article, we’ve explored using the CrewAI multi-agent framework to create a solution for writing documentation based on tables and answering related questions.

Given the extensive functionality we’ve utilised, it’s time to summarise the strengths and weaknesses of this framework.

Overall, I find CrewAI to be an incredibly useful framework for multi-agent systems:
- It’s straightforward, and you can build your first prototype quickly.
- Its flexibility allows to solve quite sophisticated business problems.
- It encourages good practices like role-playing.
- It provides many handy tools out of the box, such as RAG and a website parser.
- The support of different types of memory enhances the agents’ collaboration.
- Built-in guardrails help prevent agents from getting stuck in repetitive loops.
However, there are areas that could be improved:
- While the framework is simple and easy to use, it’s not very customisable. For instance, you currently can’t create your own LLM manager to orchestrate the processes.
- Sometimes, it’s quite challenging to get the full detailed information from the documentation. For example, it’s clear that CrewAI implemented some guardrails to prevent repetitive function calls, but the documentation doesn’t fully explain how it works.
- Another improvement area is transparency. I like to understand how frameworks work under the hood. For example, in Langchain, you can use langchain.debug = True to see all the LLM calls. However, I haven’t figured out how to get the same level of detail with CrewAI.
- The full support for the local models would be a great addition, as the current implementation either lacks some features or is difficult to get working properly.
The domain and tools for LLMs are evolving rapidly, so I’m hopeful that we’ll see a lot of progress in the near future.

Thank you a lot for reading this article. I hope this article was insightful for you. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

This article is inspired by the “Multi AI Agent Systems with CrewAI” short course from DeepLearning.AI.

Multi AI Agent Systems 101 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Multi AI Agent Systems 101

Go Here to Read this Fast! Multi AI Agent Systems 101
June 16, 2024
Responsible LLMOps

Debmalya Biswas

Integrating Responsible AI practices into LLMOps

Continue reading on Towards Data Science »

Originally appeared here:
Responsible LLMOps

Go Here to Read this Fast! Responsible LLMOps

June 16, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Tag: artificial intelligence

Memory in AI: Key Benefits and Investment Considerations

Memory will be a critical component that dramatically improves the performance of AI systems — both in accuracy and efficiency

Memory’s Role in Recall

Memory’s Role in Reasoning and Continuous Learning

Evaluating the Investment to Include Long-Term Memory in Your AI Solutions

In Essence…

Why should you care where your data lives?

What’s Data Localization?

Why should I care where the datacenter is? Isn’t the cloud just ‘everywhere’?

Example

Get consent

On the fly combination

De-identify and/or aggregate

FAQ

What if you don’t know where the data originated?

What if your company can’t afford datacenters all over the world?

Why do countries make these laws?

What can I do to make this hurt less as a data scientist?

What are some of the major data localization laws I need to know about?

Conclusion

Further Reading

Comparing Llama 3 serving performance on vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI

Key benchmark findings

Llama 3 8B

Llama 3 70B with 4-bit quantization

Beyond performance

Quantization

Model architectures

Hardware limitations

Developer experience

Concepts

Llama 3

Inference platform

Inference backends

Benchmark setup

Models

Benchmark client

Prompt dataset

Library versions

Recommendations

Llama 3 8B

Llama 3 70B 4-bit quantization

Acknowledgements

Automating Routine Tasks in Data Source Management with CrewAI

Setting up the environment

Use cases: working with documentation

CrewAI basic concepts

Use case: writing documentation

First approach

Fixing the issues

More specialised tools

Quality assurance specialist

Delegation

Use case: answering questions

Summary

Reference