Tag: AI

Coverage vs. Accuracy: Striking a Balance in Data Science
Nadav Har-Tuv
The art of getting quick gains with agile model production

Cover image by chatGPT

This post was written together with and inspired by Yuval Cohen

Introduction

Every day, numerous data science projects are discarded due to insufficient prediction accuracy. It’s a regrettable outcome, considering that often these models could be exceptionally well-suited for some subsets of the dataset.

Data Scientists often try to improve their models by using more complex models and by throwing more and more data at the problem. But many times there is a much simpler and more productive approach: Instead of trying to make all of our predictions better all at once, we could start by making good predictions for the easy parts of the data, and only then work on the harder parts.

This approach can greatly affect our ability to solve real-world problems. We start with the quick gain on the easy problems and only then focus our effort on the harder problems.

Thinking Agile

Agile production means focusing on the easy data first, and only after it has been properly modelled, moving on the the more complicated tasks. This allows a workflow that is iterative, value-driven, and collaborative.

It allows for quicker results, adaptability to changing circumstances, and continuous improvement, which are core ideas of agile production.
1. Iterative and incremental approach: work in short, iterative cycles. Start by achieving high accuracy for the easy problems and then move on to the harder parts.
2. Focus on delivering value: work on the problem with the highest marginal value for your time.
3. Flexibility and adaptability: Allow yourself to adapt to changing circumstances. For example, a client might need you to focus on a certain subset of the data — once you’ve solved that small problem, the circumstances have changed and you might need to work on something completely different. Breaking the problem into small parts allows you to adapt to the changing circumstances.
4. Feedback and continuous improvement: By breaking up a problem you allow yourself to be in constant and continuous improvement, rather than waiting for big improvements in large chunks.
5. Collaboration: Breaking the problem into small pieces promotes parallelization of the work and collaboration between team members, rather than putting all of the work on one person.
Breaking down the complexity

In real-world datasets, complexity is the rule rather than the exception. Consider a medical diagnosis task, where subtle variations in symptoms can make the difference between life-threatening conditions and minor ailments. Achieving high accuracy in such scenarios can be challenging, if not impossible, due to the inherent noise and nuances in the data.

This is where the idea of coverage comes into play. Coverage refers to the portion of the data that a model successfully predicts or classifies with high confidence or high precision. Instead of striving for high accuracy across the entire dataset, researchers can choose to focus on a subset of the data where prediction is relatively straightforward. By doing so, they can achieve high accuracy on this subset while acknowledging the existence of a more challenging, uncovered portion.

For instance, consider a trained model with a 50% accuracy rate on a test dataset. In this scenario, it’s possible that if we could identify and select only the predictions we are very sure about (although we should decide what “very sure” means), we could end up with a model that covers fewer cases, let’s say around 60%, but with significantly improved accuracy, perhaps reaching 85%.

I don’t know any product manager who would say no in such a situation. Especially if there is no model in production, and this is the first model.

The two-step model

We want to divide our data into two distinct subsets: the covered and the uncovered. The covered data is the part of the data where the initial model achieves high accuracy and confidence. The uncovered data is the part of the data where our model does not give confident predictions and does not achieve high accuracy.

In the first step, a model is trained on the data. Once we identify a subset of data where the model achieves high accuracy, we deploy that model and let it run on that subset — the covered data.

In the second step, we move our focus to the uncovered data. We try to develop a better model for this data by collecting more data, using more advanced algorithms, feature engineering, and incorporating domain-specific knowledge to find patterns in the data.

At this step, the first thing you should do is look at the errors by eye. Many times you will easily identify many patterns this way before using any fancy tricks.

An example

This example will show how the concept of agile workflow can create great value. This is a very simple example that is meant to visualize this concept. Real-life examples will be a lot less obvious but the idea that you will see here is just as relevant.

Let’s look at this two-dimensional data that I simulated from three equally sized classes.
```
num_samples_A = 500
num_samples_B = 500
num_samples_C = 500


# Class A
mean_A = [3, 2]
cov_A = [[0.1, 0], [0, 0.1]]  # Low variance
class_A = np.random.multivariate_normal(mean_A, cov_A, num_samples_A)

# Class B
mean_B = [0, 0]
cov_B = [[1, 0.5], [0.5, 1]]  # Larger variance with some overlap with class C
class_B = np.random.multivariate_normal(mean_B, cov_B, num_samples_B)

# Class C
mean_C = [0, 1]
cov_C = [[2, 0.5], [0.5, 2]]  # Larger variance with some overlap with class B
class_C = np.random.multivariate_normal(mean_C, cov_C, num_samples_C)
```
Two-dimensional data from three classes

Now we try to fit a machine learning classifier to this data, it looks like an SVM classifier with a Gaussian (‘rbf’) kernel might do the trick:
```
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Creating DataFrame
data = np.concatenate([class_A, class_B, class_C])
labels = np.concatenate([np.zeros(num_samples_A), np.ones(num_samples_B), np.ones(num_samples_C) * 2])
df = pd.DataFrame(data, columns=['x', 'y'])
df['label'] = labels.astype(int)

# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df[['x', 'y']], df['label'], test_size=0.2, random_state=42)

# Training SVM model with RBF kernel
svm_rbf = SVC(kernel='rbf', probability= True)
svm_rbf.fit(X_train, y_train)

# Predict probabilities for each class
svm_rbf_probs = svm_rbf.predict_proba(X_test)

# Get predicted classes and corresponding confidences
svm_rbf_predictions = [(X_test.iloc[i]['x'], X_test.iloc[i]['y'], true_class, np.argmax(probs), np.max(probs)) for i, (true_class, probs) in enumerate(zip(y_test, svm_rbf_probs))]

svm_predictions_df = pd.DataFrame(svm_rbf_predictions).rename(columns={0:'x',1:'y' ,2: 'true_class', 3: 'predicted_class', 4: 'confidence'})
```
How does this model perform on our data?
```
accuracy = (svm_predictions_df['true_class'] == svm_predictions_df['predicted_class']).mean()*100
print(f'Accuracy = {round(accuracy,2)}%')
```
Accuracy = 75.33%

75% percent accuracy is disappointing, but does this mean that this model is useless?

Now we want to look at the most confident predictions and see how the model performs on them. How do we define the most confident predictions? We can try out different confidence (predict_proba) thresholds and see what coverage and accuracy we get for each threshold and then decide which threshold meets our business needs.
```
thresholds = [.5, .55, .6, .65, .7, .75, .8, .85, .9]
results = []

for threshold in thresholds:
    svm_df_covered = svm_predictions_df.loc[svm_predictions_df['confidence'] > threshold]
    coverage = len(svm_df_covered) / len(svm_predictions_df) * 100
    accuracy_covered = (svm_df_covered['true_class'] == svm_df_covered['predicted_class']).mean() * 100

    results.append({'Threshold': threshold, 'Coverage (%)': round(coverage,2), 'Accuracy on covered data (%)': round(accuracy_covered,2)})

results_df = pd.DataFrame(results)
print(results_df)
```
And we get

Coverage and accuracy by threshold table

Or if we want a more detailed look we can create a plot of the coverage and accuracy by threshold:

Accuracy and coverage as function as threshold

We can now select the threshold that fits our business logic. For example, if our company’s policy is to guarantee at least 90% accuracy, then we can choose a threshold of 0.75 and get an accuracy of 90% for 62% of the data. This is a huge improvement to throwing out the model, especially if we don’t have any model in production!

Now that our model is happily working in production on 60% of the data, we can shift our focus to the rest of the data. We can collect more data, do more feature engineering, try more complex models, or get help from a domain expert.

Balancing act

The two-step model allows us to aim for accuracy while acknowledging that it is perfectly fine to start with a high accuracy for only a subset of the data. It is counterproductive to insist that a model will have high accuracy on all the data before deploying it to production.

The agile approach presented in this post aims for resource allocation and efficiency. Instead of spending computational resources on getting high accuracy all across. Focus your resources on where the marginal gain is highest.

Conclusion

In data science, we try to achieve high accuracy. However, in the reality of messy data, we need to find a clever approach to utilize our resources in the best way. Agile model production teaches us to focus on the parts of the data where our model works best, deploy the model for those subsets, and only then start working on a new model for the more complicated part. This strategy will help you make the best use of your resources in the face of real data science problems.

Think production, Think Agile.

Coverage vs. Accuracy: Striking a Balance in Data Science was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Coverage vs. Accuracy: Striking a Balance in Data Science

Go Here to Read this Fast! Coverage vs. Accuracy: Striking a Balance in Data Science
April 16, 2024
The Definitive Guide to Structured Data Parsing with OpenAI GPT3.5

Marie Stephen Leo

Systematically comparing Instructor, Fructose, and Langchain for three complex real-world structured data parsing tasks.

Continue reading on Towards Data Science »

Originally appeared here:
The Definitive Guide to Structured Data Parsing with OpenAI GPT3.5

Go Here to Read this Fast! The Definitive Guide to Structured Data Parsing with OpenAI GPT3.5

April 16, 2024
Distributed training and efficient scaling with the Amazon SageMaker Model Parallel and Data Parallel Libraries

Xinle Sheila Liu

In this post, we explore the performance benefits of Amazon SageMaker (including SMP and SMDDP), and how you can use the library to train large models efficiently on SageMaker. We demonstrate the performance of SageMaker with benchmarks on ml.p4d.24xlarge clusters up to 128 instances, and FSDP mixed precision with bfloat16 for the Llama 2 model.

Originally appeared here:
Distributed training and efficient scaling with the Amazon SageMaker Model Parallel and Data Parallel Libraries

Go Here to Read this Fast! Distributed training and efficient scaling with the Amazon SageMaker Model Parallel and Data Parallel Libraries

April 16, 2024
Manage your Amazon Lex bot via AWS CloudFormation templates

Thomas Rindfuss

Amazon Lex is a fully managed artificial intelligence (AI) service with advanced natural language models to design, build, test, and deploy conversational interfaces in applications. It employs advanced deep learning technologies to understand user input, enabling developers to create chatbots, virtual assistants, and other applications that can interact with users in natural language. Managing your […]

Originally appeared here:
Manage your Amazon Lex bot via AWS CloudFormation templates

Go Here to Read this Fast! Manage your Amazon Lex bot via AWS CloudFormation templates

April 16, 2024
A secure approach to generative AI with AWS

Anthony Liguori

Generative artificial intelligence (AI) is transforming the customer experience in industries across the globe. Customers are building generative AI applications using large language models (LLMs) and other foundation models (FMs), which enhance customer experiences, transform operations, improve employee productivity, and create new revenue channels. The biggest concern we hear from customers as they explore the advantages of generative AI is how to protect their highly sensitive data and investments. At AWS, our top priority is safeguarding the security and confidentiality of our customers’ workloads. We think about security across the three layers of our generative AI stack …

Originally appeared here:
A secure approach to generative AI with AWS

Go Here to Read this Fast! A secure approach to generative AI with AWS

April 16, 2024
The Limitations and Advantages of Retrieval Augmented Generation (RAG)

Sandi Besen

The Practical Limitations and Advantages of Retrieval Augmented Generation (RAG)

The Value of RAG

Imagine RAG as highly intelligent librarian who can sift through a digital library in seconds to answer your questions. Sometimes the librarian finds relevant and useful information to answer your questions , but other times they miss the mark.

Source: Dalle3

Let’s explore situations in which RAG excels and those in which it falls short. In a future work, I will explore a series of approaches that can be used individually or in combination to improve RAGs capabilities — which will support better responses when used with a language model.

Where RAG Falls Short

Even the most intelligent librarian has their challenges , some of which include the ability to reason iteratively, ensuring that they are retrieving the most useful documents, and ensure that the information they are sourcing from is relevant and unbiased.

Piecing Together the Puzzle with Iterative Reasoning: One of the key limitations of current RAG is its lack of iterative reasoning capabilities. RAG is unable to fully understand whether the data that is being retrieved is the most relevant information the language model needs to effectively solve the problem.

For example, if you were to pose a question such as “What does the impact of new environmental regulations passed in 2024 have on my latest white paper?” a RAG-enabled system would attempt to retrieve the data most semantically similar to the query. It might return the top X documents that have information on new policies, but are they the relevant policies for the specific paper the user is referencing?

As humans, we would approach this problem with reasoning skills. We would first read the white paper to understand its content and then determine what type of environmental policies best apply. Then based on that knowledge we would perform a search for those white papers. This iterative reasoning process — understanding the problem, formulating a more targeted search strategy, and then retrieving the most useful information — is a capability that current RAG implementations lack.

Organization Matters: The performance and effectiveness of RAG is heavily dependent on the organization and structure of the underlying data it is accessing. The ability for the retrieval algorithm to identify and surface the most useful documents is greatly influenced by how that information is cataloged and stored as well as how semantically similar the query is to the data retrieved.

In our library analogy, imagine a scenario where 500 books on various subjects are simply placed haphazardly on a single shelf, without any categorization or tagging. Trying to find the most relevant resources to answer a specific query would be a feat. You may stumble across some potentially useful books, but have no reliable way to assess which ones contain the most pertinent information. If those same 500 books were organized by genre, with clear metadata and subject tags, the retrieval process becomes significantly more efficient and effective. Rather than blindly scanning the entire shelf, the RAG implementation could quickly zero in on the most relevant section(s).

The same principles apply to how data is stored and indexed for RAG implementations in real-world applications. If the underlying datasets lack coherent organization, categorization, and metadata, the retrieval algorithms will struggle to identify the most valuable information. Ensuring data is properly structured, cataloged, and accessible is a critical.

The Good, the Bad, and the Biased : The quality of the data retrieved by a RAG implementation is only as good as the data it has access to. If the information in the underlying source systems, be it databases, online file storage, or other data repositories, contains outdated, incomplete, or biased content, the RAG implementation will have no way to discern this. It will simply retrieve and pass along this flawed information to the language model responsible for generating the final output.

Where RAG Models Shine

Accessing Domain Specific and Confidential Information: One of the key advantages of RAG is the ability to leverage domain-specific and even confidential information that may not be included in a language model’s standard training data. This can be particularly beneficial for organizations working on proprietary, cutting-edge research and projects. For example, if a company is conducting groundbreaking research in quantum computing that has not yet been publicly released, a RAG implementation could be granted access to these internal data sources. This would allow the language model to access specialized knowledge to engage in discussions about the company’s latest developments, without needing to be trained on that confidential information.

However, exposing sensitive, internal data to externally hosted language models (such as GPT, LLAMA, etc.) is not risk free. Organizations must exercise due diligence to ensure proper data security measures are in place to protect their intellectual property and confidential information.

Bringing the Latest News to Your Conversation: One of the key advantages of RAG is its ability to provide language models with access to the most up-to-date information, going beyond the fixed cutoff date of the language model’s original training data.If a language model were to rely solely on its inherent knowledge, its information would be limited to what was available at the time it was trained.

RAG implementations, on the other hand, can be integrated with live data sources such as the internet, constantly updating databases, news feeds, etc. This allows the language model to utilize current information when generating responses.

Conclusion

Retrieval Augmented Generation (RAG) is a powerful technique that can enhance language models by providing access to a wealth of information beyond their initial training. However, it is important to be aware of the limitations of RAG, such as the need for iterative reasoning, the importance of well organized data, and the potential for biased or outdated information. In a future work, I will explore a series of approaches to improve the capabilities of RAG — enhancing the quality of responses generated by a language model.

The Limitations and Advantages of Retrieval Augmented Generation (RAG) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Originally appeared here:
The Limitations and Advantages of Retrieval Augmented Generation (RAG)

Go Here to Read this Fast! The Limitations and Advantages of Retrieval Augmented Generation (RAG)

April 16, 2024
AI Mapping: Using Neural Networks to Identify House Numbers

John Lenehan

Comparing artificial and convolutional neural networks in classifying Google Street View house numbers

Continue reading on Towards Data Science »

Originally appeared here:
AI Mapping: Using Neural Networks to Identify House Numbers

Go Here to Read this Fast! AI Mapping: Using Neural Networks to Identify House Numbers

April 16, 2024
8 Plots for Explaining Linear Regression to a Layman

Conor O’Sullivan

Explain regression to a non-technical audience with residual, weight, effect and SHAP plots

Continue reading on Towards Data Science »

Originally appeared here:
8 Plots for Explaining Linear Regression to a Layman

Go Here to Read this Fast! 8 Plots for Explaining Linear Regression to a Layman

April 16, 2024
Deploying Large Language Models: vLLM and QuantizationStep by Step Guide on How to Accelerate…
Ayoola Olafenwa
Deploying Large Language Models: vLLM and Quantization

Step-by-step guide on how to accelerate large language models

source

Deployment of Large Language Models (LLMs)

We live in an amazing time of Large Language Models like ChatGPT, GPT-4, and Claude that can perform multiple amazing tasks. In practically every field, ranging from education, healthcare to arts and business, Large Language Models are being used to facilitate efficiency in delivering services. Over the past year, many brilliant open-source Large Language Models, such as Llama, Mistral, Falcon, and Gemma, have been released. These open-source LLMs are available for everyone to use, but deploying them can be very challenging as they can be very slow and require a lot of GPU compute power to run for real-time deployment. Different tools and approaches have been created to simplify the deployment of Large Language Models.

Many deployment tools have been created for serving LLMs with faster inference, such as vLLM, c2translate, TensorRT-LLM, and llama.cpp. Quantization techniques are also used to optimize GPUs for loading very large Language Models. In this article, I will explain how to deploy Large Language Models with vLLM and quantization.

Latency and Throughput

Some of the major factors that affect the speed performance of a Large Language Model are GPU hardware requirements and model size. The larger the size of the model, the more GPU compute power is required to run it. Common benchmark metrics used in measuring the speed performance of a Large Language Model are Latency and Throughput.

Latency: This is the time required for a Large Language Model to generate a response. It is usually measured in seconds or milliseconds.

Throughput: This is the number of tokens generated per second or millisecond from a Large Language Model.

Install Required Packages

Below are the two required packages for running a Large Language Model: Hugging Face transformers and accelerate.
```
pip3 install transformers
pip3 install accelerate
```
What is Phi-2?

Phi-2 is a state-of-the-art foundation model from Microsoft with 2.7 billion parameters. It was pre-trained with a variety of data sources, ranging from code to textbooks. Learn more about Phi-2 from here.

Benchmarking LLM Latency and Throughput with Hugging Face Transformers

<a href="https://medium.com/media/a4855fe6ebb6cdbbf36270434739c7c0/href">https://medium.com/media/a4855fe6ebb6cdbbf36270434739c7c0/href</a>

Generated Output
```
Latency: 2.739394464492798 seconds
Throughput: 32.36171766303386 tokens/second
Generate a python code that accepts a list of numbers and returns the sum. [1, 2, 3, 4, 5]
A: def sum_list(numbers):
    total = 0
    for num in numbers:
        total += num
    return total

print(sum_list([1, 2, 3, 4, 5]))
```
Step By Step Code Breakdown

Line 6–10: Loaded Phi-2 model and tokenized the prompt “Generate a python code that accepts a list of numbers and returns the sum.”

Line 12- 18: Generated a response from the model and obtained the latency by calculating the time required to generate the response.

Line 21–23: Obtained the total length of tokens in the response generated, divided it by the latency and calculated the throughput.

This model was run on an A1000 (16GB GPU), and it achieves a latency of 2.7 seconds and a throughput of 32 tokens/second.

Deployment of A Large Language Model with vLLM

vLLM is an open source LLM library for serving Large Language Models at low latency and high throughput.

How vLLM works

The transformer is the building block of Large Language Models. The transformer network uses a mechanism called the attention mechanism, which is used by the network to study and understand the context of words. The attention mechanism is made up of a bunch of mathematical calculations of matrices known as attention keys and values. The memory used by the interaction of these attention keys and values affects the speed of the model. vLLM introduced a new attention mechanism called PagedAttention that efficiently manages the allocation of memory for the transformer’s attention keys and values during the generation of tokens. The memory efficiency of vLLM has proven very useful in running Large Language Models at low latency and high throughput.

This is a high-level explanation of how vLLM works. To learn more in-depth technical details, visit the vLLM documentation.

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Install vLLM
```
pip3 install vllm==0.3.3
```
Run Phi-2 with vLLM

<a href="https://medium.com/media/d33d24e4ba032c4eed86a0b25de7f726/href">https://medium.com/media/d33d24e4ba032c4eed86a0b25de7f726/href</a>

Generated Output
```
Latency: 1.218436622619629seconds
Throughput: 63.15334836428132tokens/second
 [1, 2, 3, 4, 5]
A: def sum_list(numbers):
    total = 0
    for num in numbers:
        total += num
    return total

numbers = [1, 2, 3, 4, 5]
print(sum_list(numbers))
```
Step By Step Code Breakdown

Line 1–3: Imported required packages from vLLM for running Phi-2.

Line 5–8: Loaded Phi-2 with vLLM, defined the prompt and set important parameters for running the model.

Line 10–16: Generated the model’s response using llm.generate and computed the latency.

Line 19–21: Obtained the length of total tokens generated from the response, divided the length of tokens by the latency to get the throughput.

Line 23–24: Obtained the generated text.

I ran Phi-2 with vLLM on the same prompt, “Generate a python code that accepts a list of numbers and returns the sum.” On the same GPU, an A1000 (16GB GPU), vLLM produces a latency of 1.2 seconds and a throughput of 63 tokens/second, compared to Hugging Face transformers’ latency of 2.85 seconds and a throughput of 32 tokens/second. Running a Large Language Model with vLLM produces the same accurate result as using Hugging Face, with much lower latency and higher throughput.

Note: The metrics (latency and throughput) I obtained for vLLM are estimated benchmarks for vLLM performance. The model generation speed depends on many factors, such as the length of the input prompt and the size of the GPU. According to the official vLLM report, running an LLM model on a powerful GPU like the A100 in a production setting with vLLM achieves 24x higher throughput than Hugging Face Transformers.

Benchmarking Latency and Throughput in Real Time

The way I calculated the latency and throughput for running Phi-2 is experimental, and I did this to explain how vLLM accelerates a Large Language Model’s performance. In the real-world use case of LLMs, such as a chat-based system where the model outputs a token as it is generated, measuring the latency and throughput is more complex.

A chat-based system is based on streaming output tokens. Some of the major factors that affect the LLM metrics are Time to First Token (the time required for a model to generate the first token), Time Per Output Token (the time spent per output token generated), the input sequence length, the expected output, the total expected output tokens, and the model size. In a chat-based system, the latency is usually a combination of Time to First Token and Time Per Output Token multiplied by the total expected output tokens.

The longer the input sequence length passed into a model, the slower the response. Some of the approaches used in running LLMs in real-time involve batching users’ input requests or prompts to perform inference on the requests concurrently, which helps in improving the throughput. Generally, using a powerful GPU and serving LLMs with efficient tools like vLLM improves both the latency and throughput in real-time.

Run the vLLM deployment on Google Colab

Google Colaboratory

Quantization of Large Language Models

Quantization is the conversion of a machine learning model from a higher precision to a lower precision by shrinking the model’s weights into smaller bits, usually 8-bit or 4-bit. Deployment tools like vLLM are very useful for inference serving of Large Language Models at very low latency and high throughput. We are able to run Phi-2 with Hugging Face and vLLM conveniently on the T4 GPU on Google Colab because it is a smaller LLM with 2.7 billion parameters. For example, a 7-billion-parameter model like Mistral 7B cannot be run on Colab with either Hugging Face or vLLM. Quantization is best for managing GPU hardware requirements for Large Language Models. When GPU availability is limited and we need to run a very large Language Model, quantization is the best approach to load LLMs on constrained devices.

BitsandBytes

It is a python library built with custom quantization functions for shrinking model’s weights into lower bits(8-bit and 4-bit).

Install BitsandBytes
```
pip3 install bitsandbytes
```
Quantization of Mistral 7B Model

Mistral 7B, a 7-billion-parameter model from MistralAI, is one of the best state-of-the-art open-source Large Language Models. I will go through a step-by-step process of running Mistral 7B with different quantization techniques that can be run on the T4 GPU on Google Colab.

Quantization with 8bit Precision: This is the conversion of a machine learning model’s weight into 8-bit precision. BitsandBytes has been integrated with Hugging Face transformers to load a language model using the same Hugging Face code, but with minor modifications for quantization.

<a href="https://medium.com/media/947fe4716f0249ae34c5d00ee04dcf51/href">https://medium.com/media/947fe4716f0249ae34c5d00ee04dcf51/href</a>

Line 1: Imported the needed packages for running model, including the BitsandBytesConfig library.

Line 3–4: Defined the quantization config and set the parameter load_in_8bit to true for loading the model’s weights in 8-bit precision.

Line 7–9: Passed the quantization config into the function for loading the model, set the parameter device_map for bitsandbytes to automatically allocate appropriate GPU memory for loading the model. Finally loaded the tokenizer weights.

Quantization with 4bit Precision: This is the conversion of a machine learning model’s weight into 4-bit precision.

<a href="https://medium.com/media/04aa13f8fdce2320312841b1a69c8b4e/href">https://medium.com/media/04aa13f8fdce2320312841b1a69c8b4e/href</a>

The code for loading Mistral 7B in 4-bit precision is similar to that of 8-bit precision except for a few changes:
- changed load_in_8bit to load_in_4bit.
- A new parameter bnb_4bit_compute_dtype is introduced into the BitsandBytesConfig to perform the model’s computation in bfloat16. bfloat16 is computation data type for loading model’s weights for faster inference. It can be used with both 4-bit and 8-bit precisions. If it is in 8-bit you just need to change the parameter from bnb_4bit_compute_dtype to bnb_8bit_compute_dtype.
NF4(4-bit Normal Float) and Double Quantization

NF4 (4-bit Normal Float) from QLoRA is an optimal quantization approach that yields better results than the standard 4-bit quantization. It is integrated with double quantization, where quantization occurs twice; quantized weights from the first stage of quantization are passed into the next stage of quantization, yielding optimal float range values for the model’s weights. According to the report from the QLoRA paper, NF4 with double quantization does not suffer from a drop in accuracy performance. Read more in-depth technical details about NF4 and Double Quantization from the QLoRA paper:

QLoRA: Efficient Finetuning of Quantized LLMs

<a href="https://medium.com/media/53c58cc4c1a8da28667c67f0a480c6af/href">https://medium.com/media/53c58cc4c1a8da28667c67f0a480c6af/href</a>

Line 4–9: Extra parameters were set the BitsandBytesConfig:
- load_4bit: loading model in 4-bit precision is set to true.
- bnb_4bit_quant_type: The type of quantization is set to nf4.
- bnb_4bit_use_double_quant: Double quantization is set to True.
- bnb_4_bit_compute_dtype: bfloat16 computation data type is used for faster inference.
Line 11–13: Loaded the model’s weights and tokenizer.

Full Code for Model Quantization

<a href="https://medium.com/media/3d47591856a5834d3dfd016be8667e54/href">https://medium.com/media/3d47591856a5834d3dfd016be8667e54/href</a>

Generated Output
```
<s> [INST] What is Natural Language Processing? [/INST] Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and
computer science that deals with the interaction between computers and human language. Its main objective is to read, decipher, 
understand, and make sense of the human language in a valuable way. It can be used for various tasks such as speech recognition, 
text-to-speech synthesis, sentiment analysis, machine translation, part-of-speech tagging, name entity recognition, 
summarization, and question-answering systems. NLP technology allows machines to recognize, understand,
 and respond to human language in a more natural and intuitive way, making interactions more accessible and efficient.</s>
```
Quantization is a very good approach for optimizing the running of very Large Language Models on smaller GPUs and can be applied to any model, such as Llama 70B, Falcon 40B, and mpt-30b. According to reports from the LLM.int8 paper, very Large Language Models suffer less from accuracy drops when quantized compared to smaller ones. Quantization is best applied to very Large Language Models and does not work well for smaller models because of the loss in accuracy performance.

Run Mixtral 7B Quantization on Google Colab

Google Colaboratory

Conclusion

In this article, I provided a step-by-step approach to measuring the speed performance of a Large Language Model, explained how vLLM works, and how it can be used to improve the latency and throughput of a Large Language Model. Finally, I explained quantization and how it is used to load Large Language Models on small-scale GPUs.

Reach to me via:

Email: [email protected]

Linkedin: https://www.linkedin.com/in/ayoola-olafenwa-003b901a9/

References
Deploying Large Language Models: vLLM and QuantizationStep by Step Guide on How to Accelerate… was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Deploying Large Language Models: vLLM and QuantizationStep by Step Guide on How to Accelerate…

Go Here to Read this Fast! Deploying Large Language Models: vLLM and QuantizationStep by Step Guide on How to Accelerate…
April 16, 2024
Towards Reliable Synthetic Control
Hang Yu
Making the estimated treatment effect close to the truth

Photo by Jørgen Håland on Unsplash

Introduction

In recent years, the Synthetic Control (SC) approach has gained increasing adoption in industry for measuring the the Average Treatment Effect (ATE) of interventions when Randomized Control Trials (RCTs) are not available. One such example is measuring the financial impact of outdoor advertisements on billboards whereby we cannot conduct random treatment assignment in practice.

The basic idea of SC is to estimate ATE by comparing the treatment group against the predicted counterfactual. However, applying SC in practice is usually challenged by the limited knowledge of its validity due to the absence of the true counterfactual in the real world. To mitigate the concern, in this article, I would like to discuss the actionable best practices that help to maximise the reliability of the SC estimation.

The insights and conclusions are obtained through experiments based on diverse synthetic data. The code for data generation, causal inference modeling, and analysis is available in the Jupyter notebook hosted on Github.

Synthetic Control in a Nutshell

The key to measure the ATE of such events is to identify the counterfactual of the treatment group, which is the treatment group in the absence of the treatment, and quantify the post-treatment difference between the two. It is simple for RCTs as the randomised control statistically approximates the counterfactual. However, it’s challenging otherwise due to the unequal pre-experiment statistics between the treatment and control.

As a causal inference technique, SC represents the counterfactual by a synthetic control group created based on some untreated control units. This synthetic control group statistically equals the treatment group pre treatment and is expected to approximate the untreated behaviour of the treatment group post treatment. Mathematically presented below, it is created using the function f whose parameters are obtained by minimising the pre-treatment difference between the treated group and the control synthesised by f [1]:

In the experiment, there are J groups whereby group 1 is the treatment group and others are controls. Each group has its observed outcome at time t denoted by Yjt. f is the model and Y1t^N refers to the counterfactual. Image by author.

In practice, the popular options for the function f include but are not limited to the weighted sum [1], Bayesian Structural Time Series (BSTS) [2], etc.

Actions towards Reliable Synthetic Control

Despite the solid theoretical foundation, applying SC in practice usually faces the challenge that we don’t know how accurate the estimated ATE is because there exists no post-treatment counterfactual in reality to validate the synthesised one. However, there are some actions we can take to optimise the modeling process and maximise the reliability. Next, I will describe these actions and demonstrate how they influence the estimated ATE via a range of experiments based on the synthetic time-series data with diverse temporal characteristics.

Experiment Setup

All the experiments presented in this article are based on synthetic time-series data. These data are generated using the timeseries-generator package that produces time series capturing the real-world factors including GDP, holidays, weekends, and so on.

GitHub – Nike-Inc/timeseries-generator: A library to generate synthetic time series data by easy-to-use factors and generator

The data generation aims to simulate the campaign performance of the stores in New Zealand from 01/01/2019 to 31/12/2019. To make the potential conclusions statistically significant, 500 time series are generated to represent the stores. Each time series has the statistically randomised linear trend, white noise, store factor, holiday factor, weekday factor, and seasonality. A random sample of 10 stores are presented below.

Randomly sampled synthetic time series for 10 stores in New Zealand. Image by author.

Store1 is selected to be the treatment group whereas others play the role of control groups. Next, the outcome of store1 is uplifted by 20% from 2019-09-01 onwards to simulate the treated behaviour whereas its original outcome serves as the real counterfactual. This 20% uplift establishes the actual ATE to validate the actions later on.
```
cutoff_date_sc = '2019-09-01'
df_sc.loc[cutoff_date_sc:] = df_sc.loc[cutoff_date_sc:]*1.2
```
The figure below visualises the simulated treatment effect and the true counterfactual of the treatment group.

The simulated ATE of +20% and the true counterfactual of store1. Image by author.

Given the synthetic data, the BSTS in Causalimpact is adopted to estimate the synthesised ATE. Then, the estimation is compared against the actual ATE using Mean Absolute Percentage Error (MAPE) to evaluate the corresponding action.

GitHub – jamalsenouci/causalimpact: Python port of CausalImpact R library

Next, let’s go through the actions along with the related experiments to see how to produce reliable ATE estimation.

Treatment-control Correlation

The first action to achieve reliable ATE estimation is selecting the control groups that exhibit high pre-treatment correlations with the treatment group. The rationale is that a highly correlated control is likely to consistently resemble the untreated treatment group over time.

To validate this hypothesis, let’s evaluate the ATE estimation produced using every single control with its full data since 01/01/2019 to understand the impact of correlation. Firstly, the correlation coefficients between the treatment group (store1) and the control groups (store2 to 499) are calculated [3].
```
def correlation(x, y):
    shortest = min(x.shape[0], y.shape[0])
    return np.corrcoef(x.iloc[:shortest].values, y.iloc[:shortest].values)[0, 1]
```
As shown in the figure below, the distribution of the correlations range from -0.1 to 0.9, which provides a comprehensive understanding about the impact across various scenarios.

Distribution of the pre-treatment correlation. Image by author.

Then, every individual control is used to predict the counterfactual, estimate the ATE, and report the MAPE. In the figure below, the averaged MAPE of ATE with its 95% confidence interval is plotted against the corresponding pre-treatment correlation. Here, the correlation coefficients are rounded to one decimal place to facilitate aggregation and improve the statistical significance in the analysis. Looking at the results, it is obvious that the estimation shows a higher reliability when the control gets more correlated with the treatment group.

The MAPE of ATE for different correlation levels. Image by author.

Now let’s see some examples that demonstrate the impact of pre-treatment correlation: store88 with a correlation of 0.88 delivers a MAPE of 0.12 that is superior to 0.62 given by store3 with a correlation of 0.43. Besides the promising accuracy, the probabilistic intervals are correspondingly narrow, which implies high prediction certainty.

Example to demonstrate the impact of correlation. Image by author.

Model Fitting Window

Next, the fitting window, which is the length of the pre-treatment interval used for fitting the model, needs to be properly configured. This is because too much context could result in a loss of recency while insufficient context might lead to overfitting.

To understand how fitting window impacts the accuracy of ATE estimation, a wide range of values from 1 month to 8 months before the treatment date are experimented. For each fitting window, every single unit of the 499 control groups is evaluated individually and then aggregated to calculate the averaged MAPE with the 95% confidence interval. As depicted in the figure below, there exists a sweet spot nearby 2 and 3 months that optimise the reliability. Identifying the optimal point is outside the scope of this discussion but it’s worth noting that the training window needs to be carefully selected.

The MAPE of ATE for different training windows. Image by author.

The figure shows two examples: the MAPE of control group 199 is reduced from 0.89 to 0.68 when its fitting window is increased from 1 month to 3 months because the short window contains insufficient knowledge to produce the counterfactual.

Example to demonstrate the impact of training window. Image by author.

Number of Control Units

Lastly, the number of the selected control groups matters.

This hypothesis is validated by investigating the estimation accuracy for different numbers of controls ranging from 1 to 10. In detail, for each control count, the averaged MAPE is calculated based on the estimations produced by 50 random control sets with each containing the corresponding number of control groups. This operation avoids unnecessarily enumerating every possible combination of controls while statistically controls for correlation. In addition, the fitting window is set to 3 months for every estimation.

Looking at the results below, increasing the number of controls is overall leading towards a more reliable ATE estimation.

The MAPE of ATE for different number of controls. Image by author.

The examples below demonstrate the effect. The first estimation is generated using store311 whereas the second one further adds store301 and store312.

Example to demonstrate the impact of number of controls. Image by author.

Conclusions

In this article, I discussed the possible actions that make the SC estimation more reliable. Based on the experiments with diverse synthetic data, the pre-treatment correlation, fitting window, and number of control units are identified as compelling directions to optimise the estimation. Finding the optimal value for each action is out of the scope of this discussion. However, if you feel interested, parameter search using an isolated blank period for validation [4] is one possible solution.

All the images are produced by the author unless otherwise noted. The discussions are inspired by the great work “Synthetic controls in action” [1].

References

[1] Abadie, Alberto, and Jaume Vives-i-Bastida. “Synthetic controls in action.” arXiv preprint arXiv:2203.06279 (2022).

[2]Brodersen, Kay H., et al. “Inferring causal impact using Bayesian structural time-series models.” (2015): 247–274.

[3]https://medium.com/@dreamferus/how-to-synchronize-time-series-using-cross-correlation-in-python-4c1fd5668c7a

[4]Abadie, Alberto, and Jinglong Zhao. “Synthetic controls for experimental design.” arXiv preprint arXiv:2108.02196 (2021).

Towards Reliable Synthetic Control was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Towards Reliable Synthetic Control

Go Here to Read this Fast! Towards Reliable Synthetic Control
April 16, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Tag: AI

The art of getting quick gains with agile model production

Introduction

Thinking Agile

Breaking down the complexity

The two-step model

An example

Balancing act

Conclusion

The Practical Limitations and Advantages of Retrieval Augmented Generation (RAG)

The Value of RAG

Where RAG Falls Short

Where RAG Models Shine

Conclusion

Deploying Large Language Models: vLLM and Quantization

Step-by-step guide on how to accelerate large language models

Deployment of Large Language Models (LLMs)

Latency and Throughput

Install Required Packages

What is Phi-2?

Benchmarking LLM Latency and Throughput with Hugging Face Transformers

Generated Output

Deployment of A Large Language Model with vLLM

How vLLM works

Run Phi-2 with vLLM

Generated Output

Benchmarking Latency and Throughput in Real Time

Quantization of Large Language Models

BitsandBytes

Quantization of Mistral 7B Model

NF4(4-bit Normal Float) and Double Quantization

Generated Output

Conclusion

Making the estimated treatment effect close to the truth

Introduction

Synthetic Control in a Nutshell

Actions towards Reliable Synthetic Control

Conclusions

References