Category: Artificial Intelligence

Principal Component Analysis — Hands-On Tutorial

Farzad Mahmoodinobar

Dimensionality reduction through Principal Component Analysis (PCA).

Continue reading on Towards Data Science »

Originally appeared here:
Principal Component Analysis — Hands-On Tutorial

Go Here to Read this Fast! Principal Component Analysis — Hands-On Tutorial

September 18, 2024
Uncertainty in Markov Decisions Processes: a Robust Linear Programming approach
Hussein Fellahi
Theoretical derivation of the Robust Counterpart of Markov Decision Processes (MDPs) as a Linear Program (LP)

Photo by ZHENYU LUO on Unsplash

Introduction

Markov Decision Processes are foundational to sequential decision-making problems and serve as the building block for reinforcement learning. They model the dynamic interaction between an agent having to make a series of actions and their environment. Due to their wide applicability in fields such as robotics, finance, operations research and AI, MDPs have been extensively studied in both theoretical and practical contexts.

Yet, much of the existing MDP literature focuses on the idealized scenarios where model parameters — such as transition probabilities and reward functions — are assumed to be known with certainty. In practice, applying popular methods such as Policy Iteration and Value Iteration require precise estimates of these parameters, often obtained from real-world data. This reliance on data introduces significant challenges: the estimation process is inherently noisy and sensitive to limitations such as data scarcity, measurement errors and variability in the observed environment. Consequently, the performance of standard MDP methods can degrade substantially when applied to problems with uncertain or incomplete data.

In this article, we build on the Robust Optimization (RO) literature to propose a generic framework to address these issues. We provide a Robust Linear Programming (RLP) formulation of MDPs that is capable of handling various sources of uncertainty and adversarial perturbations.

MDP definition and LP formulation

Let’s start by giving a formal definition of MDPs:

A Markov Decision Process is a 5-tuple (S, A, R, P, γ) such that:
- S is the set of states the agent can be in
- A is the set of actions the agent can take
- R : S x A → R the reward function
- P is the set of probability distributions defined such that P(s’|s,a) is the probability of transitioning to state s’ if the agent takes action a in state s. Note that MDPs are Markov processes, meaning that the Markov property holds on the transition probabilities: P(Sₜ₊₁|S₀, A₀, …, Sₜ, Aₜ) = P(Sₜ₊₁|Sₜ, Aₜ)
- γ ∈ (0, 1] is a discount factor. While we usually deal with discounted problems (i.e. γ < 1), the formulations presented are also valid for undiscounted MDPs (γ = 1)
We then define the policy, i.e. what dictates the agent’s behavior in an MDP:

A policy π is a probability measure over the action space defined as: π(a|s) is the probability of taking action a when the agent is in state s.

We finally introduce the value function, i.e. the agent’s objective in an MDP:

The value function of a policy π is the expected discounted reward under this policy, when starting at a given state s:

In particular, the value function of the optimal policy π* satisfies the Bellman optimality equation:

Which yields the deterministic optimal policy:

Deriving the LP formulation of MDPs:

Given the above definitions, we can start by noticing that any value function V that satisfies

is an upper bound on the optimal value function. To see it, we can start by noticing that such value function also satisfies:

We recognize the value iteration operator applied to V:

i.e.

Also noticing that the H*operator is increasing, we can apply it iteratively to have:

where we used the property of V* being the fixed point of H*.

Therefore, finding V* comes down to finding the tightest upper bound V that obeys the above equation, which yields the following formulation:

Here we added a weight term corresponding to the probability of starting in state s. We can see that the above problem is linear in V and can be rewritten as follows:

Further details can be found in [1] and [2].

Robust Optimization for Linear Programming

Given the above linear program in standard form, the RO framework assumes an adversarial noise in the inputs (i.e. cost vector and constraints). To model this uncertainty, we define an uncertainty set:

In short, we want to find the minimum of all linear programs, i.e. for each occurrence in the uncertainty set. Naturally this yields a completely intractable model (potentially an infinite number of LPs) since we did not make any assumption on the form of U.

Before addressing these issues, we make the following assumptions — without loss of generality:
- Uncertainty in w and b is equivalent to uncertainty in the constraints for a slightly modified LP — for this reason we consider uncertainty only in c
- Adversarial noise is applied constraint-wise, i.e. to each constraint individually
- The constraint of the robust problem are in the form:
where: bar{c} is known as the nominal constraint vector (e.g. gotten from some estimation), z the uncertain factor and Q a fixed matrix intuitively corresponding to how the noise is applied to each coefficient of the constraint vector. Q can be used for instance to model correlation between the noise on difference components of c. See [3] for more details and proofs.

Note: we made a slight abuse of notation and dropped the (s, a) subscripts for readability — yet c, bar{c}, Q and z are all for a given state and action couple.
- Rather than optimizing for each entry of the uncertainty set, we optimize the worst case over U. In the context of uncertainty on the constraints only, this mean that the worst case over U must also be feasible
All this leads to the following formulation of the problem:

At this stage, we can make some assumptions on the form of U in order to further simplify the problem:
- While z can be a vector of arbitrary dimension L — as Q will be a |S| x L matrix — we make the simplifying assumption that z is of size |S| and Q is a square diagonal matrix of size |S| as well. This will allow to model separately an adversarial noise on each coefficient on the constraint vector (and no correlation between noises)
- We assume that the uncertainty set is a box of size 2d, i.e. that each coordinate of z can take any value from the interval [-d, d]. This is equivalent to saying that the L∞ norm of z is less than d
The optimization problem becomes:

which is equivalent to:

Finally, looking closer to the maximization problem in the constraint, we see that it has a closed form. Therefore the final problem can be written as (robust counterpart of a linear program with box uncertainty):

A few comments on the above formulation:
- The uncertainty term disappeared — robustness is brought by an additional safety term
- As the L1 norm can be linearized, this is a linear program
- The above formulation does not depend on the form of Q — the assumption made will be useful in the next section
For more details, the interested reader can refer to [3].

The RLP formulation of MDPs

Starting from the above formulation:

And finally, linearizing the absolute value in the constraints gives:

We notice that robustness translates into an additional safety term in the constraints — given the uncertainty on c (which mainly translates into uncertainty in the MDP’s transition probabilities).

As stated previously, considering uncertainty on the rewards as well can be easily done with a similar derivation. Coming back to the linear program in standard form, we add another noise term on the right hand side of the constraints:

After a similar reasoning as previously done, we have the all-in linear program:

Again similarly to before, additional robustness with regards to the reward function translates into another safety term in the constraints, which ultimately can yield a less optimal value function and policy but fills the constraints with margin. This tradeoff is controlled both by Q and the size of the uncertainty box d.

Conclusion

While this completes the derivation of the Robust MDP as a linear program, other robust MDP approaches have been developed in the literature, see for instance [4]. These approaches usually take a different route, for instance directly deriving a robust Policy Evaluation operator — which for instance has the advantage of having a better complexity as the LP approach. This proves particularly important when state and action spaces are large. Therefore why would we use such formulation?
- The RLP formulation allows to benefit from all theoretical properties of linear programming. This entails guarantees of a solution (when the problem is feasible and bounded), as well as known results from duality theory and sensitivity analysis
- The LP approach allows to easily use different geometries of uncertainty sets — see [3] for details
- This formulation allows to integrate naturally additional constraints on the MDP, while keeping the robustness properties
- We can furthermore apply some projection or approximation methods (see for instance [5]) to improve the LP complexity
References:

[1] M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming (1996), Wiley

[2] P. Pouppart, Sequential Decision Making and Reinforcement Learning (2013), University of Waterloo

[3] D. Bertsimas, D. Den Hertog, Robust and Adaptive Optimization (2022), Dynamic Ideas

[4] W. Wiesemann, D. Kuhn, B. Rustem, Robust Markov Decision Processes (2013), INFORMS

[5] K. Vu, P.-L. Poirion, L. Liberti, Random Projections for Linear Programming (2018), INFORMS

Uncertainty in Markov Decisions Processes: a Robust Linear Programming approach was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Uncertainty in Markov Decisions Processes: a Robust Linear Programming approach

Go Here to Read this Fast! Uncertainty in Markov Decisions Processes: a Robust Linear Programming approach
September 18, 2024
Asking for Feedback as a Data Scientist Individual Contributor

Jose Parreño

Receive clear and useful feedback. Ditch generic questions. More than 60 example questions for you to use.

Continue reading on Towards Data Science »

Originally appeared here:
Asking for Feedback as a Data Scientist Individual Contributor

Go Here to Read this Fast! Asking for Feedback as a Data Scientist Individual Contributor

September 18, 2024
Calibrating Classification Probabilities the Right Way

Jonte Dancker

Or why you should not trust predict_proba methods

Continue reading on Towards Data Science »

Originally appeared here:
Calibrating Classification Probabilities the Right Way

Go Here to Read this Fast! Calibrating Classification Probabilities the Right Way

September 18, 2024
Unlocking Business Potential Through Effective Customer Segmentation

Shirley Bao, Ph.D.

Transform your data into actionable insights with customer segmentation for improved engagement and profitability

Continue reading on Towards Data Science »

Originally appeared here:
Unlocking Business Potential Through Effective Customer Segmentation

Go Here to Read this Fast! Unlocking Business Potential Through Effective Customer Segmentation

September 18, 2024
Introduction to Maximum Likelihood Estimates

Rishabh Misra

Learn about Maximum Likelihood Estimates via their application for next-word prediction

Continue reading on Towards Data Science »

Originally appeared here:
Introduction to Maximum Likelihood Estimates

Go Here to Read this Fast! Introduction to Maximum Likelihood Estimates

September 18, 2024
The Mystery Behind the PyTorch Automatic Mixed Precision Library
Mengliu Zhao
How to get 2X speed up model training using three lines of code

Have you ever wished your deep-learning model could run faster?

The GPU is expensive. The dataset is enormous, and the training session seems endless; you have a million experiments to run and a deadline to hit — all these are good reasons to expect a particular form of training acceleration.

But which one to choose?

There are already good references on performance tuning for model training from PyTorch, HuggingFace, and Nvidia, including asynchronous data loading, buffer checkpointing, distributed data parallelization, and automatic mixed precision.

In this article, I’ll introduce the automatic mixed precision technique. I’ll start with a brief introduction to Nvidia’s tensor core design, then the groundbreaking work “Mixed Precision Training” paper published in ICLR 2018, and lastly, proceed to a simple example of training a ResNet50 on FashionMNIST and how to speed up the training by 2X while loading 2X a batch size, with only three extra lines of code.

Image source: https://pxhere.com/en/photo/872846

Hardware Fundamentals — Nvidia Tensor Cores

First, let’s recap some of the fundamentals of the GPU design. One of Nvidia GPUS’s most popular commercial products is the Volta family, e.g., V100 GPUs, based on the GV100 GPU design. So, we’ll base our discussions around the GV100 architecture below.

For GV100, the Streaming Multiprocessor (SM) is the core design for computation. Each GPU contains 6 GPU Processing Clusters (GPCs) and S84 SMs (or 80 SMs for V100). The overall design looks like the one below.

Volta GV100 GPU design. Each GPU contains 6 GPCs, and each GPC contains 14 SMs. Image source: https://arxiv.org/pdf/1803.04014

For each SM, it contains two types of cores: CUDA cores and Tensor cores. CUDA cores were Nvidia’s original design introduced in 2006, which was an essential part of the CUDA platform. The CUDA cores can be divided into three types: FP64 core/unit, FP32 core/unit, and Int32 core/unit. Each GV100 SM contains 32 FP64 cores, 64 FP32 cores, and 64 Int32 cores. Tensor cores were introduced in the Volta/Turing (2017) series GPUs to separate from the previous Pascal (2016) series. Each SM on a GV100 contains 8 Tensor cores. A full list of details is given here for V100 GPUs. A detailed look at the SM design is below.

A sub-ward of a Streaming Processor (SM). Each SM would contain four sub-warps. Image source: https://arxiv.org/pdf/1903.03640

Why Tensor cores? Nvidia Tensor cores are dedicated to performing general matrix multiplication (GEMM) and half-precision matrix multiplication and accumulation (HMMA) operations. In short, GEMM performs matrix operations in the format of A*B + C, and HMMA converts the operation into the half-precision format. A detailed discussion can be found here. Since deep learning involves MMA heavily, the Tensor cores are essential in today’s model training and speed-up.

Example of a GEMM operation. For HMMA, A and B are usually converted to FP16, while C and D could be FP16 or FP32. Image source: https://arxiv.org/pdf/1811.08309

Of course, when switching to mixed precision training, always check the specification of the GPU you’re using. Only the latest GPU series support Tensor cores, and mixed precision training can only be used on these machines.

Data Format Fundamentals — Single Precision (FP32) vs Half Precision (FP16)

Now, let’s take a closer look at FP32 and FP16 formats. The FP32 and FP16 are IEEE formats that represent floating numbers using 32-bit binary storage and 16-bit binary storage. Both formats comprise three parts: a) a sign bit, b) exponent bits, and c) mantissa bits. The FP32 and FP16 differ in the number of bits allocated to exponent and mantissa, which result in different value ranges and precisions.

Difference between FP16 (IEEE standard), BF16 (Google Brain-standard), FP32 (IEEE-standard), and TF32 (Nvidia-standard). Image source: https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

How do you convert FP16 and FP32 to real values? According to IEEE-754 standards, the decimal value for FP32 = (-1)^(sign) × 2^(decimal exponent —127 ) × (implicit leading 1 + decimal mantissa), where 127 is the biased exponent value. For FP16, the formula becomes (-1)^(sign) × 2^(decimal exponent — 15) × (implicit leading 1 + decimal mantissa), where 15 is the corresponding biased exponent value. See further details of the biased exponent value here.

In this sense, the value range for FP32 is approximately [-2¹²⁷, 2¹²⁷] ~[-1.7*1e38, 1.7*1e38], and the value range for FP16 is approximately [-2¹⁵, 2¹⁵]=[-32768, 32768]. Note that the decimal exponent for FP32 is between 0 and 255, and we’re excluding the largest value 0xFF as it represents NAN. That’s why the largest decimal exponent is 254–127 = 127. A similar rule applies to FP16.

For the precision, note that both the exponent and mantissa contributes to the precision limits (which is also called denormalization, see detailed discussion here), so FP32 can represent precision up to 2^(-23)*2^(-126)=2^(-149), and FP16 can represent precision up to 2^(10)*2^(-14)=2^(-24).

The difference between FP32 and FP16 representations brings the key concerns of mixed precision training, as different layers/operations of deep learning models are either insensitive or sensitive to value ranges and precision and need to be addressed separately.

Mixed Precision Training

Now that we have learnt the hardware foundation for MMA, the concept of Tensor cores, and the key difference between FP32 and FP16, we can further discuss the details for mixed precision training.

The idea of mixed precision training was first proposed in the 2018 ICLR paper “Mixed Precision Training”, which converts deep learning models into half-precision floating point during training without losing model accuracy or modifying hyper-parameters. As mentioned above, since the key difference between FP32 and FP16 are the value ranges and precisions, the paper discussed in detail why the FP16 causes the gradients to vanish and how to fix the issue by loss scaling. Besides, the paper proposes tricks like using FP32 master weight copy and using FP32 for specific operations like reductions and vector dot-production accumulations.

Loss scaling. The paper gives an example of training a Multibox SSD detector network using FP32 precision, as shown below. Without any scaling, the exponent range of the FP16 gradients ≥ 2^(-24), and everything below would become zero, which is insufficient compared to FP32. However, with an experiment, scaling the gradients simply by 2³=8 times can bring the half-precision training accuracy back to match with FP32. In this sense, the authors argue that the extra few percent of gradients between [2^(-27), 2^(-24)] are still important in the training process, while the value below 2^(-27) is not important.

Gradient value range using FP32 precision in the Multibox SSD training example. Note that the values between [2^(-27), 2^(-24)] are beyond the FP16 denormalization range and only take a few percent of the total gradients but are still important in the overall training. Image source: https://arxiv.org/pdf/1710.03740

The way to address this scale difference is to apply loss scaling. According to the chain’s rule, scaling the loss will ensure the same amount scales all the gradients. The gradients need to be unscaled before the final weight update.

Automatic Mixed Precision Training

Nvidia first developed Automatic Mixed Precision Training as a PyTorch extension called APEX, which was then widely adopted by mainstream frameworks like PyTorch, TensorFlow, MXNet, etc. See Nvidia docs here. We’ll only introduce PyTorch’s automatic mixed precision library for simplicity: https://pytorch.org/docs/stable/amp.html.

The amp library can automatically handle most of the mixed precision training techniques, like the FP32 master weight copy. The users are mainly exposed to ops autocast and gradient/loss scaling.

Ops autocast. Although we mentioned that tensor cores could largely improve the performance of GEMM operations, certain operations are unsuitable for half-precision representations.

The amp library gives out a list of CUDA ops eligible for half precision. Most matrix multiplication, convolutions, and linear activations are fully covered by the amp.autocast, however, for reduction/sum, softmax, and loss calculations, the calculations are still performed in FP32 as they are more sensitive to data range and precision.

Gradient/loss scaling. The amp library provides automatic gradient scaling techniques so the user doesn’t have to adjust the scaling during training manually. A more detailed algorithm for the scaling factor can be found here.

Once the gradient is scaled, it needs to be scaled back before gradient clipping and regularization. More details can be found here.

A FashionMNIST Training Example

The torch.amp library is relatively easy to use and only requires three lines of code to boost your training speed by 2X.

We start with a very simple task training a ResNet50 model on the FashionMNIST dataset (MIT licence) using FP32; we can see the training time is 333 seconds for ten epochs:

ResNet50 training on FashionMNIST. Image by author.

The ratio between gradients less than 2**(-24) and the total gradients. We can see that FP16 would turn almost 1/4 of the total gradients into zero. Image by author.

Evaluation results. Image by author.

Now that we use the amp library. The amp library only requires three extra lines of code for mixed precision training. We can see the training finished within 141 seconds, which is 2.36X speed up than the FP32 training, while achieving the same precision, recall and F1-score.
```
scaler = torch.cuda.amp.GradScaler()
# start your training code
# ...
with torch.autocast(device_type="cuda"):
  # training code

# wrapping loss and optimizer
scaler.scale(loss).backward()
scaler.step(optimizer)

scaler.update()
```
Training code with amp. Image by author.

The scaling factor during training. The scaling factor only changed atthe first step and kept unchanged. Image by author.

Final result comparable to the FP32 training result. Image by author.

The github link for the code above is here.

Summary

Mixed Precision Training is a valuable technique for accelerating deep learning model training. It not only speed up the floating point operations, but also saves the GPU memories as the training batch can be converted to FP16, which saves half the GPU memory. With PyTorch’s amp library, the extra code could be minimized to three additional lines, as the weight copy, loss scaling, operation type casts are all handled by the library internally.

However, mixed precision training doesn’t really resolve the GPU memory issue if the model weight size is much larger than the data batch. For one thing, only certain layers of the model is casted into FP16 while the rest are still calculated in FP32; second, weight update need FP32 copies, which still takes much GPU memory; third, parameters from optimizers like Adam takes much GPU memory during training and the mixed precision training keeps the optimizer parameters unchanged. In that sense, more advanced techniques like DeepSpeed’s ZERO algorithm is needed.

References
- Micikevicius et al., Mixed precision training. ICLR 2018
- PyTorch AMP library: https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html
- Nvidia CUDA floating point: https://docs.nvidia.com/cuda/floating-point/index.html
The Mystery Behind the PyTorch Automatic Mixed Precision Library was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
The Mystery Behind the PyTorch Automatic Mixed Precision Library

Go Here to Read this Fast! The Mystery Behind the PyTorch Automatic Mixed Precision Library
September 17, 2024
Setting Up and Monitoring RDS Proxy

Ross Rhodes

Reflecting on lessons learned using a DB proxy managed by AWS

Continue reading on Towards Data Science »

Originally appeared here:
Setting Up and Monitoring RDS Proxy

Go Here to Read this Fast! Setting Up and Monitoring RDS Proxy

September 17, 2024
Building RAGs Without A Retrieval Model Is a Terrible Mistake

Thuwarakesh Murallie

Here are my favorite techniques — one is faster, the other is more accurate.

Continue reading on Towards Data Science »

Originally appeared here:
Building RAGs Without A Retrieval Model Is a Terrible Mistake

Go Here to Read this Fast! Building RAGs Without A Retrieval Model Is a Terrible Mistake

September 17, 2024
Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

Kanishk Mahajan

In this post, we demonstrate a solution using Amazon FSx for NetApp ONTAP with Amazon Bedrock to provide a RAG experience for your generative AI applications on AWS by bringing company-specific, unstructured user file data to Amazon Bedrock in a straightforward, fast, and secure way.

Originally appeared here:
Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

Go Here to Read this Fast! Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

September 17, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Category: Artificial Intelligence

Theoretical derivation of the Robust Counterpart of Markov Decision Processes (MDPs) as a Linear Program (LP)

Introduction

MDP definition and LP formulation

Deriving the LP formulation of MDPs:

Robust Optimization for Linear Programming

The RLP formulation of MDPs

Conclusion

References:

How to get 2X speed up model training using three lines of code

Hardware Fundamentals — Nvidia Tensor Cores

Data Format Fundamentals — Single Precision (FP32) vs Half Precision (FP16)

Mixed Precision Training

Automatic Mixed Precision Training

A FashionMNIST Training Example

Summary

References