Category: AI

A Real World, Novel Approach to Enhance Diversity in Recommender Systems

Christabelle Pabalan

Addressing the long tail problem and enhancing the recommendation experience for users on the Headspace App

Continue reading on Towards Data Science »

Originally appeared here:
A Real World, Novel Approach to Enhance Diversity in Recommender Systems

Go Here to Read this Fast! A Real World, Novel Approach to Enhance Diversity in Recommender Systems

March 1, 2024
Markov Chain Monte Carlo: Made Simple Once and For All

Pol Marin

Introduction to MCMC, dividing it into its simplest terms

Continue reading on Towards Data Science »

Originally appeared here:
Markov Chain Monte Carlo: Made Simple Once and For All

Go Here to Read this Fast! Markov Chain Monte Carlo: Made Simple Once and For All

March 1, 2024
How to Design a Data-Driven Story?

Michal Szudejko

A step-by-step guide

Continue reading on Towards Data Science »

Originally appeared here:
How to Design a Data-Driven Story?

Go Here to Read this Fast! How to Design a Data-Driven Story?

March 1, 2024
The stacking ensemble method

Yoann Mocquin

Understand stacking using scikit-learn.

Continue reading on Towards Data Science »

Originally appeared here:
The stacking ensemble method

Go Here to Read this Fast! The stacking ensemble method

February 29, 2024
Improving LLM Inference Latency on CPUs with Model Quantization
Eduardo Alvarez
Image Property of Author — Create with Nightcafe

Improving LLM Inference Speeds on CPUs with Model Quantization

Discover how to significantly improve inference latency on CPUs using quantization techniques for mixed, int8, and int4 precisions

One of the most significant challenges the AI space faces is the need for computing resources to host large-scale production-grade LLM-based applications. At scale, LLM applications require redundancy, scalability, and reliability, which have historically been only possible on general computing platforms like CPUs. Still, the prevailing narrative today is that CPUs cannot handle LLM inference at latencies comparable with high-end GPUs.

One open-source tool in the ecosystem that can help address inference latency challenges on CPUs is the Intel Extension for PyTorch (IPEX), which provides up-to-date feature optimizations for an extra performance boost on Intel hardware. IPEX delivers a variety of easy-to-implement optimizations that make use of hardware-level instructions. This tutorial will dive into the theory of model compression and the out-of-the-box model compression techniques IPEX provides. These compression techniques directly impact LLM inference performance on general computing platforms, like Intel 4th and 5th-generation CPUs.

Inference Latency in Application Development

Second only to application safety and security, inference latency is one of the most critical parameters of an AI application in production. Regarding LLM-based applications, latency or throughput is often measured in tokens/second. As illustrated in the simplified inference processing sequence below, tokens are processed by the language model and then de-tokenized into natural language.

GIF 1. of inference processing sequence — Image by Author

Interpreting inference this way can sometimes lead us astray because we analyze this component of AI applications in abstraction of the traditional production software paradigm. Yes, AI apps have their nuances, but at the end of the day, we are still talking about transactions per unit of time. If we start to think about inference as a transaction, like any other, from an application design point of view, the problem becomes less complex. For example, let’s say we have a chat application that has the following requirements:
- Average of 300 user sessions per hour
- Average of 5 transactions (LLM inference requests) per user per session
- Average 100 tokens generated per transaction
- Each session has an average of 10,000ms (10s) overhead for user authentication, guardrailing, network latency, and pre/post-processing.
- Users take an average of 30,000ms (30s) to respond when actively engaged with the chatbot.
- The average total active session time goal is 3 minutes or less.
Below, you can see that with some simple napkin math, we can get some approximate calculations for the required latency of our LLM inference engine.

Figure 1. A simple equation to calculate the required transaction and token latency based on various application requirements. — Image by Author

Achieving required latency thresholds in production is a challenge, especially if you need to do it without incurring additional compute infrastructure costs. In the remainder of this article, we will explore one way that we can significantly improve inference latency through model compression.

Model Compression

Model compression is a loaded term because it addresses a variety of techniques, such as model quantization, distillation, pruning, and more. At their core, the chief aim of these techniques is to reduce the computational complexity of neural networks.

GIF 2. Illustration of inference processing sequence — Image by Author

The method we will focus on today is model quantization, which involves reducing the byte precision of the weights and, at times, the activations, reducing the computational load of matrix operations and the memory burden of moving around larger, higher precision values. The figure below illustrates the process of quantifying fp32 weights to int8.

Fig 2. Visual representation of model quantization going from full precision at FP32 down to quarter precision at INT8, theoretically reducing the model complexity by a factor of 4. — Image by Author

It is worth mentioning that the reduction of complexity by a factor of 4 that results from quantizing from fp32 (full precision) to int8 (quarter precision) does not result in a 4x latency reduction during inference because inference latency involves more factors beyond just model-centric properties.

Like with many things, there is no one-size-fits-all approach, and in this article, we will explore three of my favorite techniques for quantizing models using IPEX:

Mixed-Precision (bf16/fp32)

This technique quantizes some but not all of the weights in the neural network, resulting in a partial compression of the model. This technique is ideal for smaller models, like the <1B LLMs of the world.

Fig 3. Simple illustration of mixed previsions, showing FP32 weights in orange and half-precision quantized bf16 weights in green. — Image by Author

The implementation is quite straightforward: using hugging face transformers, a model can be loaded into memory and optimized using the IPEX llm-specific optimization function ipex.llm.optimize(model, dtype=dtype) by setting dtype = torch.bfloat16, we can activate the mixed precision inference capability, which improves the inference latency over full-precision (fp32) and stock.
```
import sys
import os
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# PART 1: Model and tokenizer loading using transformers
tokenizer = AutoTokenizer.from_pretrained("Intel/neural-chat-7b-v3-3")
model = AutoModelForCausalLM.from_pretrained("Intel/neural-chat-7b-v3-3")

# PART 2: Use IPEX to optimize the model
#dtype = torch.float # use for full precision FP32
dtype = torch.bfloat16 # use for mixed precision inference
model = ipex.llm.optimize(model, dtype=dtype)

# PART 3: Create a hugging face inference pipeline and generate results
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
st = time.time()
results = pipe("A fisherman at sea...",  max_length=250)
end = time.time()
generation_latency = end-st

print('generation latency: ', generation_latency)
print(results[0]['generated_text'])
```
Of the three compression techniques we will explore, this is the easiest to implement (measured by unique lines of code) and offers the smallest net improvement over a non-quantized baseline.

SmoothQuant (int8)

This technique addresses the core challenges of quantizing LLMs, which include handling large-magnitude outliers in activation channels across all layers and tokens, a common issue that traditional quantization techniques struggle to manage effectively. This technique employs a joint mathematical transformation on both weights and activations within the model. The transformation strategically reduces the disparity between outlier and non-outlier values for activations, albeit at the cost of increasing this ratio for weights. This adjustment renders the Transformer layers “quantization-friendly,” enabling the successful application of int8 quantization without degrading model quality.

Fig 4. Simple illustration of SmoothQuant showing weights as circles and activations as triangles. The diagram depicts the two main steps: (1) the application of scaler for smoothing and (2) the quantization to int8 — Image by Author

Below, you’ll find a simple SmoothQuant implementation — omitting the code for creating the DataLoader, which is a common and well-documented PyTorch principle. SmoothQuant is an accuracy-aware post-training quantization recipe, meaning that by providing a calibration dataset and model you will be able to provide a baseline and limit the language modeling degradation. The calibration model generates a quantization configuration, which is then passed to ipex.llm.optimize() along with the SmoothQuant mapping. Upon execution, the SmoothQuant is applied, and the model can be tested using the .generate() method.
```
import torch
import intel_extension_for_pytorch as ipex
from intel_extension_for_pytorch.quantization import prepare
import transformers

# PART 1: Load model and tokenizer from Hugging Face + Load SmoothQuant config mapping
tokenizer = AutoTokenizer.from_pretrained("Intel/neural-chat-7b-v3-3")
model = AutoModelForCausalLM.from_pretrained("Intel/neural-chat-7b-v3-3")
qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping()

# PART 2: Configure calibration
# prepare your calibration dataset samples
calib_dataset = DataLoader({Your dataloader parameters})
example_inputs = # provide a sample input from your calib_dataset
calibration_model = ipex.llm.optimize(
  model.eval(),
  quantization_config=qconfig,
)
prepared_model = prepare(
  calibration_model.eval(), qconfig, example_inputs=example_inputs
)
with torch.no_grad():
  for calib_samples in enumerate(calib_dataset):
    prepared_model(calib_samples)
prepared_model.save_qconf_summary(qconf_summary=qconfig_summary_file_path)

# PART 3: Model Quantization using SmoothQuant
model = ipex.llm.optimize(
  model.eval(),
  quantization_config=qconfig,
  qconfig_summary_file=qconfig_summary_file_path,
)

# generation inference loop
with torch.inference_mode():
    model.generate({your generate parameters})
```
SmoothQuant is a powerful model compression technique and helps significantly improve inference latency over full-precision models. Still, it requires a little upfront work to prepare a calibration dataset and model.

Weight-Only Quantization (int8 and int4)

Compared to traditional int8 quantization applied to both activation and weight, weight-only quantization (WOQ) offers a better balance between performance and accuracy. It is worth noting that int4 WOQ requires dequantizing to bf16/fp16 before computation (Figure 4), which introduces an overhead in compute. A basic WOQ technique, tensor-wise asymmetric Round To Nearest (RTN) quantization, presents challenges and often leads to reduced accuracy (source). However, literature (Zhewei Yao, 2022) suggests that groupwise quantizing the model’s weights helps maintain accuracy. Since the weights are only dequantized for computation, a significant memory advantage remains despite this extra step.

Fig 5. Simple illustration of weight-only quantization, with pre-quantized weights in orange and the quantized weights in green. Note that this depicts the initial quantization to int4/int8 and dequantization to fp16/bf16 for the computation step. — Image by Author

The WOQ implementation below showcases the few lines of code required to quantize a model from Hugging Face with this technique. As with the previous implementations, we start by loading a model and tokenizer from Hugging Face. We can use the get_weight_only_quant_qconfig_mapping() method to configure the WOQ recipe. The recipe is then passed to the ipex.llm.optimize() function along with the model for optimization and quantization. The quantized model can then be used for inference with the .generate() method.
```
import torch
import intel_extension_for_pytorch as ipex
import transformers

# PART 1: Model and tokenizer loading
tokenizer = AutoTokenizer.from_pretrained("Intel/neural-chat-7b-v3-3")
model = AutoModelForCausalLM.from_pretrained("Intel/neural-chat-7b-v3-3")

# PART 2: Preparation of quantization config
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping(
  weight_dtype=torch.qint8, # or torch.quint4x2
  lowp_mode=ipex.quantization.WoqLowpMode.NONE, # or FP16, BF16, INT8
)
checkpoint = None # optionally load int4 or int8 checkpoint

# PART 3: Model optimization and quantization
model = ipex.llm.optimize(model, quantization_config=qconfig, low_precision_checkpoint=checkpoint)

# PART 4: Generation inference loop
with torch.inference_mode():
    model.generate({your generate parameters})
```
As you can see, WOQ provides a powerful way to compress models down to a fraction of their original size with limited impact on language modeling capabilities.

Conclusion and Discussion

As an engineer at Intel, I’ve worked closely with the IPEX engineering team at Intel. This has afforded me a unique insight into its advantages and development roadmap, making IPEX a preferred tool. However, for developers seeking simplicity without the need to manage an extra dependency, PyTorch offers three quantization recipes: Eager Mode, FX Graph Mode (under maintenance), and PyTorch 2 Export Quantization, providing strong, less specialized alternatives.

No matter what technique you choose, model compression techniques will result in some degree of language modeling performance loss, albeit in <1% in many cases. For this reason, it’s essential to evaluate the application’s fault tolerance and establish a baseline for model performance at full (FP32) and/or half-precision (BF16/FP16) before pursuing quantization.

In applications that leverage some degree of in-context learning, like Retrieval Augmented Generation (RAG), model compression might be an excellent choice. In these cases, the mission-critical knowledge is spoon-fed to the model at the time of inference, so the risk is heavily reduced even with low-fault-tolerant applications.

Quantization is an excellent way to address LLM inference latency concerns without upgrading or expanding compute infrastructure. It is worth exploring regardless of your use case, and IPEX provides a good option to start with just a few lines of code.

A few exciting things to try would be:
- Test the sample code in this tutorial on the Intel Developer Cloud’s free Jupyter Environment.
- Take an existing model that you’re running on an accelerator at complete precision and test it out on a CPU at int4/int8
- Explore all three techniques and determine which works best for your use case. Make sure to compare the loss of language modeling performance, not just latency.
- Upload your quantized model to the Hugging Face Model Hub! If you do, let me know — I’d love to check it out!
Thank you for reading! Don’t forget to follow my profile for more articles like this!

Improving LLM Inference Latency on CPUs with Model Quantization was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Improving LLM Inference Latency on CPUs with Model Quantization

Go Here to Read this Fast! Improving LLM Inference Latency on CPUs with Model Quantization
February 29, 2024
Introduction to Apache Iceberg

Pier Paolo Ippolito

Exploring Apache Iceberg benefits/drawbacks and how it can be used to build your own Lakehouse.

Continue reading on Towards Data Science »

Originally appeared here:
Introduction to Apache Iceberg

Go Here to Read this Fast! Introduction to Apache Iceberg

February 29, 2024
Use RAG for drug discovery with Knowledge Bases for Amazon Bedrock

Mark Roy

Amazon Bedrock provides a broad range of models from Amazon and third-party providers, including Anthropic, AI21, Meta, Cohere, and Stability AI, and covers a wide range of use cases, including text and image generation, embedding, chat, high-level agents with reasoning and orchestration, and more. Knowledge Bases for Amazon Bedrock allows you to build performant and […]

Originally appeared here:
Use RAG for drug discovery with Knowledge Bases for Amazon Bedrock

Go Here to Read this Fast! Use RAG for drug discovery with Knowledge Bases for Amazon Bedrock

February 29, 2024
Unlock personalized experiences powered by AI using Amazon Personalize and Amazon OpenSearch Service

Reagan Rosario

OpenSearch is a scalable, flexible, and extensible open source software suite for search, analytics, security monitoring, and observability applications, licensed under the Apache 2.0 license. Amazon OpenSearch Service is a fully managed service that makes it straightforward to deploy, scale, and operate OpenSearch in the AWS Cloud. OpenSearch uses a probabilistic ranking framework called BM-25 […]

Originally appeared here:
Unlock personalized experiences powered by AI using Amazon Personalize and Amazon OpenSearch Service

Go Here to Read this Fast! Unlock personalized experiences powered by AI using Amazon Personalize and Amazon OpenSearch Service

February 29, 2024
Automate Amazon SageMaker Pipelines DAG creation

Luis Felipe Yepez Barrios

Creating scalable and efficient machine learning (ML) pipelines is crucial for streamlining the development, deployment, and management of ML models. In this post, we present a framework for automating the creation of a directed acyclic graph (DAG) for Amazon SageMaker Pipelines based on simple configuration files. The framework code and examples presented here only cover […]

Originally appeared here:
Automate Amazon SageMaker Pipelines DAG creation

Go Here to Read this Fast! Automate Amazon SageMaker Pipelines DAG creation

February 29, 2024
Accelerating large-scale neural network training on CPUs with ThirdAI and AWS Graviton

Vihan Lakshman

This guest post is written by Vihan Lakshman, Tharun Medini, and Anshumali Shrivastava from ThirdAI. Large-scale deep learning has recently produced revolutionary advances in a vast array of fields. Although this stunning progress in artificial intelligence remains remarkable, the financial costs and energy consumption required to train these models has emerged as a critical bottleneck […]

Originally appeared here:
Accelerating large-scale neural network training on CPUs with ThirdAI and AWS Graviton

Go Here to Read this Fast! Accelerating large-scale neural network training on CPUs with ThirdAI and AWS Graviton

February 29, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Category: AI

Improving LLM Inference Speeds on CPUs with Model Quantization

Discover how to significantly improve inference latency on CPUs using quantization techniques for mixed, int8, and int4 precisions

Inference Latency in Application Development

Model Compression

Mixed-Precision (bf16/fp32)

SmoothQuant (int8)

Weight-Only Quantization (int8 and int4)

Conclusion and Discussion