Originally appeared here:
Streamline financial workflows with generative AI for email automation
Go Here to Read this Fast! Streamline financial workflows with generative AI for email automation
Originally appeared here:
Streamline financial workflows with generative AI for email automation
Go Here to Read this Fast! Streamline financial workflows with generative AI for email automation
It has become apparent that AI projects are hard. Some estimate that 80% of the AI projects fail. Still, generative AI is here to stay, and companies are searching for how to apply it to their operations. AI projects fail, because they fail to deliver value. The root cause of failure is applying AI to the wrong use cases. The solution for finding the right use cases is with three measures:
These steps should be investigated sequentially. If the problem magnitude is not big enough to bring the needed value, do not build. If the solution accuracy on historical data is not high enough, do not deploy. If the real-time accuracy is not high enough, adjust the solution.
I will be discussing generative AI instead of AI in general. Generative AI is a small subfield of AI. With general AI projects the goal is to find a model to approximate how the data is generated — you must have high proficiency in understanding different machine learning algorithms and data processing. With generative AI the model is given, LLM (e.g. chatGPT), and the goal is to use the existing model to solve some business problem. The latter requires less technical skills, and more problem solving knowledge. Generative AI use cases are much easier to implement and validate, as the step of creating the algorithm is left out, and the data (text) is (relatively) standardized.
Every person can identify problems in their daily work. The challenge lies in determining which issues are significant enough to be solved and where AI could and should be applied.
Instead of going through all subjective problems and finding data to validate their existence, we can focus on processes that generate textual data. This approach narrows the scope to measurable problems, where AI and automation can add demonstrable value.
Concretely, instead of asking the customer support specialist “What problems are there in your work?”, we should measure where the employee spends the most time. Let’s go through this with an example.
Paperclips & Friends (P&F) is a company that makes paper clips. They have a support channel #P&FSupport, where customers discuss issues around paper clips. P&F responds to customer questions on time, but the channel keeps getting busier. The customer support specialists hear about ChatGPT and want it to help with customer questions.
Before the data science team of P&F starts solving the issue, they measure the number of incoming questions to understand the magnitude of the problem. They notice hundreds and hundreds of inquiries, daily.
The data science team develops a RAG chatbot using ChatGPT and P&F internal documentation. They release the chatbot to be tested by the customer support specialists and receive mixed feedback. Some experts love the solution and mention that it solves most of the issues, while others criticize it, claiming it provides no value.
The P&F data science team faces a challenge — who speaks the truth? Is the chatbot any good? Then, they remember the second Measure:
“Measure the solution accuracy retrospectively”
NOTE: it is crucial to verify can the root cause be solved. If Paperclips & Friends found out, that 90% of the support channel messages were related to unclear usage instructions, P&F could create a chatbot for answering those messages. However, the customer wouldn’t have sent the question in the first place, if P&F included a simple instruction guide with the paperclip shipment.
The P&F data science team faces a challenge: They must weigh each expert opinion equally, but can’t satisfy everyone. Instead of focusing on expert subjective opinions, they decide to evaluate the chatbot on historical customer questions. Now experts do not need to come up with questions to test the chatbot, bringing the evaluation closer to real-world conditions. The initial reason for involving experts, after all, was their better understanding of real customer questions compared to the P&F data science team.
It turns out that commonly asked questions for P&F are related to paper clip technical instructions. P&F customers want to know detailed technical specifications of the paper clips. P&F has thousands of different paper clip types, and it takes a long time for customer support to answer the questions.
Understanding the test-driven development, the data science team creates a dataset from the conversation history, including the customer question and customer support reply:
Having a dataset of questions and answers, P&F can test and evaluate the chatbot’s performance retrospectively. They create a new column, “Chatbot reply”, and store the chatbot example replies to the questions.
We can have the experts and GPT-4 evaluate the quality of the chatbot’s replies. The ultimate goal is to automate the chatbot accuracy evaluation by utilizing GPT-4. This is possible if experts and GPT-4 evaluate the replies similarly.
Experts create a new Excel sheet with each expert’s evaluation, and the data science team adds the GPT-4 evaluation.
There are conflicts on how different experts evaluate the same chatbot replies. GPT-4 evaluates similarly to expert majority voting, which indicates that we could do automatic evaluations with GPT-4. However, each expert’s opinion is valuable, and it’s important to address the conflicting evaluation preferences among the experts.
P&F organizes a workshop with the experts to create golden standard responses to the historical question dataset
and evaluation best practice guidelines, to which all experts agree.
With the insights from the workshop, the data science team can create a more detailed evaluation prompt for the GPT-4 that covers edge cases (i.e. “chatbot should not ask to raise support tickets”). Now the experts can use time to improve the paper clip documentation and define best practices, instead of laborious chatbot evaluations.
By measuring the percentage of correct chatbot replies, P&F can decide whether they want to deploy the chatbot to the support channel. They approve the accuracy and deploy the chatbot.
Finally, it’s time to save all the chatbot responses and calculate how well the chatbot performs to solve real customer inquiries. As the customer can directly respond to the chatbot, it is also important to record the response from the customer, to understand the customer’s sentiment.
The same evaluation workflow can be used to measure the chatbot’s success factually, without the ground truth replies. But now the customers are getting the initial reply from a chatbot, and we do not know if the customers like it. We should investigate how customers react to the chatbot’s replies. We can detect negative sentiment from the customer’s replies automatically, and assign customer support specialists to handle angry customers.
In this short article, I explained three steps to avoid failing your AI project:
The first two steps are by far the most crucial and are the primary reasons why many projects fail. While it is possible to succeed without measuring the problem’s magnitude or the solution’s accuracy, subjective estimates are generally flawed due to hundreds of human biases. Correctly designed data-driven approaches almost always give better results.
If you are curious about how to implement chatbots like this, check out my blog post on RAG, and blog post on advanced RAG.
Lastly, feel free to network with me on LinkedIn and follow me here if you wish to read similar articles 🙂
Linkedin: @sormunenteemu
Unless otherwise noted, all images and data are by the author.
How to Find and Solve Valuable Generative AI Use Cases was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
How to Find and Solve Valuable Generative AI Use Cases
Go Here to Read this Fast! How to Find and Solve Valuable Generative AI Use Cases
How can we use Power BI to plan for the future using Stock-Measures? Here, I show you a possible approach.
Originally appeared here:
Using Power BI for Planning (Warehouse) with Stock Values
Go Here to Read this Fast! Using Power BI for Planning (Warehouse) with Stock Values
Two useful functions to nest and un-nest data sets in PySpark
Originally appeared here:
PySpark Explained: The explode and collect_list Functions
Go Here to Read this Fast! PySpark Explained: The explode and collect_list Functions
How to structure your Python package projects to ensure efficiency, effectiveness and future-proofing
Originally appeared here:
A Proposed Perfect Package Prototype for Python Projects
Go Here to Read this Fast! A Proposed Perfect Package Prototype for Python Projects
Have you ever wondered why the Vision Transformer (ViT) uses 16*16 size patches as input tokens?
It all dates back to the earlier days of the Transformers. The original Transformer model was proposed in 2017 and only works with natural language data. When the BERT model was released in 2018, it could only handle a max token sequence of length 512. Later in 2020, when GPT-3 was released, it could handle a sequence of lengths 2048 and 4096 at 3.5. All these models showed amazing performance in handling sequence-to-sequence and text-generation tasks.
However, these sequences were too short for images when tokens were taken at the pixel level. For example, in Cifar-100, the image size is 32* 32 = 1024 pixels. In ImageNet, the image size is 224* 224 = 50176 pixels. The sequence length would be an immediate barrier if the transformer were directly applied to the pixel level.
The ViT paper was released in 2020. It proposed using patches rather than pixels as the input tokens. For an image of size 224* 224, using a patch size of 16* 16, the sequence length would be largely reduced to 196, which perfectly solved the issue.
However, the issue was only partially solved. For tasks requiring features of finer details, different approaches have to be utilized to get the pixel-level accuracy back. Segformer proposed to fuse features from hierarchical transformer encoders of different resolutions. SwinIR had to combine CNN with multiple levels of skip connection around the transformer module for fine-grained feature extraction. The Swin Transformer, a model for universal computer vision tasks, started with a patch size 4*4 in each local window and then gradually built toward the global 16*16 patch size to obtain both globality and granularity. Intrinsically, these efforts pointed to one fact — simply using the 16*16 size patch is insufficient.
The natural question is, can “pixel” be used as a direct token for transformers? The question further splits into two: 1. Is it possible to feed an ultralong sequence (e.g., 50k) to a transformer? 2. does feeding pixels as tokens provide more information than patches?
In this article, I will summarize two recent papers: 1. Pixel Transformer, a technical report released by Meta AI last week, comparing pixel-wise tokens and patch-wise tokens to transformer models from the perspective of reducing the inductive bias of locality on three different tasks: image classification, pre-training, and generation. 2. Ultra-long sequence distributed transformer: by distributing the query vector, the authors showed the possibility to scale an input sequence of length 50k on 3k GPUs.
Pixel Transformer (PiT) — from an inductive bias perspective
Meta AI released The technical report last week on arXiv: “An image is worth more than 16*16 patches”. Instead of proposing a novel method, the technical report answered a long-lasting question: Does it make sense to use pixels instead of patches as input tokens? If so, why?
The paper took the perspective of the Inductive Bias of Locality. According to K. Murphy’s well-known machine learning book, inductive bias is the “assumptions about the nature of the data distribution.” In the early “non-deep learning” era, the inductive bias was more “feature-related,” coming from the manual feature engineered for specific tasks. This inductive bias was not a bad thing, especially for specific tasks in which very good prior knowledge from human experts is gained, making the engineered features very useful. However, from the generalization perspective, the engineered features are very hard to generalize to universal tasks, like general image classification and segmentation.
But beyond feature bias, the architecture itself contains inductive bias as well. The ViT is a great example showing less inductive bias than CNN models in terms of architecture hierarchy, propagation uniformness, representation scale, and attention locality. See my previous medium post for a detailed discussion. But still, ViT remains a special type of inductive bias — locality. When the ViT processes a sequence of patch tokens, the pixels within the same patch are naturally treated by the model differently than those from different patches. And that’s where the locality comes from.
So, is it possible to remove the inductive bias of locality further? The answer is yes. The PiT proposed using the “pixel set” as input with different position embedding (PE) strategies: sin-cos, learnt, and none. It showed superior performance over ViT on supervised, self-supervised, and generation tasks. The proposed pipeline is shown in the figure below.
The idea seems simple and straightforward, and the authors claim they are “not introducing a new method” here. But still, the PiT shows great potential. On CIFAR-100 and ImageNet (reduced input size to 28*28) supervised classification tasks, the classification accuracy increased by more than 2% over ViT. See the table below.
Similar improvement was also observed in self-supervised learning tasks and image generation tasks. What’s more, the authors also showed the trend of a performance increase when reducing the patch size from 8*8 to 1*1 (single pixel) as below:
In terms of positional encoding.
As pointed out in this research paper, positional encoding is a prerequisite in transformer-based models for input token sequence ordering and improving accuracy. However, the PiT shows that even after dropping the PE, the model performance drops is minimal:
Why drop the positional encoding? It is not only because dropping the positional encoding means a good reduction of the locality bias. If we think of self-attention computation in a distributed manner, it will largely reduce the cross-device communication effort, which we’ll discuss in detail in the next section.
Ultra-long Sequences Transformers — a distributed query vector solution
The inductive bias of locality only told part of the story. If we look closely at the results in the PiT paper, we see that the experiments were limited to 28*28 resized images due to the computational limit. But the real world rarely uses images of such a small size. So the natural question is, even though PiT might be useful and outperform ViT, could it work on natural images of standard resolution, e.g., 244*244?
The paper “Ultra-long sequence distributed transformer,” released in 2023, answers the question. The paper proposed a solution to scale the transformer computation of a 50k-long sequence onto 3k GPUS.
The idea is simple: The transformer’s bottleneck is self-attention computation. To ensure global attention computation, the proposed method only distributes the query vectors across different devices while maintaining the same copies of key and value vectors on all devices.
Positional Encoding-aware double gradient averaging. For architectures with learnable positional encoding parameters, gradient backpropagation is related to the positional encoding distribution. So, the authors proposed the double gradient averaging technique: when a gradient average is performed on two different segments from the same sequence, no positional encoding is involved, but when the corresponding segments from two sequences need to average gradients, the positional encoding parameters will be synced.
When we combine these two papers, things become interesting. Reducing the inductive bias not only helps with model performance but also plays a crucial role in distributed computation.
References:
A Patch is More than 16*16 Pixels was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
A Patch is More than 16*16 Pixels
Go Here to Read this Fast! A Patch is More than 16*16 Pixels
Keras is Back!! First released in 2015 as a high-level Python library for training ML models, Keras grew in popularity due to its clean and simple APIs. Contrary to the ML frameworks of the time, with their awkward and clunky APIs, Keras lowered the entry bar for many incumbent ML developers (the author included). But somewhere along the way the use of Keras became virtually synonymous with TensorFlow development. Consequently, when developers began to turn to alternative frameworks, the relative popularity of Keras began to decline. But now, following a “complete rewrite”, Keras has returned. And with its shiny new engine and its renewed commitment to multi-backend support, it vies to return to its former glory.
In this post we will take a new look at Keras and assess its value offering in the current era of AI/ML development. We will demonstrate through example its ease of use and make note of its shortcomings. Importantly, this post is not intended to be an endorsement for or against the adoption of Keras (or any other framework, library, service, etc.). As usual, the best decision for your project development will depend on a great many details, many of which are beyond the scope of this post.
The recent release of Google’s family of open sourced NLP models known as Gemma, and the inclusion of Keras 3 as a core component of the API, offers us an opportunity to evaluate Keras’s goodness and could serve as a great opportunity for its resurgence.
In our view, the most valuable feature offered by Keras 3 is its multi-framework support. This may surprise some readers who may recall Keras’s distinctiveness to be its user experience. Keras 3 advertises itself, as “simple”, “flexible”, and being “designed for human beings, not machines”. And indeed, it owes its early successes and meteoric rise in popularity to its user experience. But it is now 2024 and there are many high-level deep learning APIs offering “reduced cognitive load”. In our view, the user experience, as good as it may be, is no longer a sufficient motivator to consider Keras over its alternatives. Its multi-framework support is.
Keras 3 supports multiple backends for training and running its models. At the time of this writing, these include JAX, TensorFlow, and PyTorch. The Keras 3 announcement does a pretty good job of explaining the advantages of this feature. We will expand on the documented benefits and add some of our own flavor.
Avoid the difficulty of choosing an AI/ML framework:
Choosing an AI/ML framework is probably one of the most important decisions you will need to make as an ML developer. It is also one of the hardest. There are many considerations that need to factor into this decision. These include user experience, API coverage, programmability, debuggability, the formats and types of input data that are supported, conformance with other components on the development pipeline (e.g., restrictions that may be imposed by the model deployment phase), and, perhaps most importantly, runtime performance. As we have discussed in many of our previous posts (e.g., here), AI/ML model development can be extremely expensive and the overall impact on cost of even the smallest speed-up due to the choice of framework can be dramatic. In fact, in many cases it may warrant the overhead of porting your model and code to a different framework and/or even maintaining support for multiple frameworks.
The problem is that it is extremely difficult, if not impossible, to know which framework will be most optimal for your model before you start your development. Moreover, even once you have committed to one framework, you will want to stay on top of the evolution and development of all frameworks and to continuously assess potential opportunities to improve your model and/or reduce the cost of development. The landscape of AI/ML development is extremely dynamic with optimizations and enhancements being designed and developed on a consistent basis. You will not want to fall behind.
Keras 3 solves the framework selection problem by enabling you to develop your model without committing to an underlying backend. The option to toggle between multiple framework-backends allows you to focus on the model definition and, once complete, choose the backend that best suits your needs. And even as the properties of the ML project change or the supported frameworks evolve, Keras 3 enables you to easily assess the impact of changing the backend.
Putting it colloquially, you could say that Keras 3 helps humans avoid one of the things they hate doing most — making decisions and committing to them. But humor aside, AI/ML model development using Keras 3 can certainly prevent you from choosing and being stuck with a suboptimal framework.
Enjoy the best of all worlds:
PyTorch, TensorFlow, and JAX, each have their own unique advantages and differentiating properties. JAX, for example, supports just-in-time (JIT) compilation in which the model operators are converted into an intermediate computation graph and then compiled together into machine code specifically targeted for the underlying hardware. For many models this results in a considerable boost in runtime performance. On the other hand, PyTorch, which is typically used in a manner in which the operators are executed immediately (a.k.a. “eagerly”) is often considered to: have the most Pythonic interface, be the easiest to debug, and offer the best overall user experience. By using Keras 3 you can enjoy the best of both worlds. You can set the backend to PyTorch during your initial model development and for debugging and switch to JAX for optimal performance when training in production mode.
Compatibility with the maximum number of AI accelerators and runtime environments:
As we have discussed in the past (e.g., here) our goal is to be compatible with as many AI accelerators and runtime environments as possible. This is especially important in an era of constrained capacity of AI machines in which the ability to switch between different machine types is a huge advantage. When you develop with Keras 3 and its multi-backend support, you automatically increase the number of platforms that you can potentially train and run your model on. For example, while you may be most accustomed to running in PyTorch on GPUs, by simply changing the backend to JAX you can configure your model to run on Google Cloud TPUs, as well ( — though this may depend on the details of the model).
Increase model adoption:
If you are targeting your model for use by other AI/ML teams, you will increase your potential audience by supporting multiple frameworks. For all sorts of reasons, some teams may be limited to a specific ML framework. By delivering your model in Keras you remove barriers for adoption. A great example of this is the recent release of Google’s Gemma models which we will discuss in greater detail below.
Decouple the data input pipeline from the model execution:
Some frameworks encourage the use of certain data storage formats and/or data loading practices. A classic example of this is TensorFlow’s TFRecord data format for storing a sequence of binary records that are typically stored in .tfrecord files. While TensorFlow includes native support for parsing and processing data stored TFRecord files, you might find feeding them into a PyTorch training loop to be a bit more difficult. A preferable format for PyTorch training could be WebDataset. But the creation of training data can be a long process and maintaining it in more than one format could be prohibitively expensive. Thus, the manner in which your training data is stored and maintained might discourage teams from considering alternative frameworks.
Keras 3 helps teams overcome this obstacle by completely decoupling the data input pipeline from the training loop. You can define your input data pipelines in PyTorch, TensorFlow, Numpy, Keras, and other libraries without any consideration for the backend that will be used in your training loop. With Keras 3, having your training data stored in TFRecord files is no longer a barrier to adopting PyTorch as a backend.
As with any other new SW solution on the market, it is important to be aware of the potential downsides of Keras 3. A general rule of thumb in SW development is that the higher up the SW stack you go, the less control you have over the behavior and performance of your application. In AI/ML, where the degree of success is often determined by precise tuning of model hyperparameters, initialization settings, appropriate environment configuration, etc., such control could be critical. Here are just a few potential drawbacks to consider:
Potential drop in runtime performance:
Working the high level Keras APIs rather than directly with the framework APIs, may pose limitations on optimizing runtime performance. In our series of posts on the topic of analyzing and optimizing the performance of PyTorch models, we demonstrated a wide range of tools and techniques for increasing the speed of training. Sometimes these require the direct, unmediated, use of PyTorch’s APIs. For example, Keras’s APIs currently include very limited support for PyTorch’s JIT compilation option (via the jit_compile setting). Another example is PyTorch’s built-in support for scaled dot product attention which is not supported at the Keras level (as of the time of this writing).
Limitations of cross-framework support:
Although Keras’s cross-framework support is extensive, you may find that it is not all-encompassing. For example, one gap in coverage (as of the time of this writing) is distributed training. Although, Keras introduces the Keras distribution API to support data and model parallelism across all backends, it is currently implemented for the JAX backend only. To run distributed training when using other backends, you will need to fall back to the standard distribution APIs of the relevant framework (e.g., PyTorch’s distributed data parallel API).
Overhead of maintaining cross-framework compatibility:
Keras 3 supports a wide variety of pre-built models that you can reuse (e.g., here). However, inevitably, you may want to introduce your own customizations. While Keras 3 supports customization of the model layers, metrics, training loop and more, you will need to take care not to break your cross-framework compatibility. For example, if you create a custom layer using Keras’s backend-agnostic APIs (keras.ops), you can rest assured that multi-backend support is retained. However, sometimes you may choose to rely on framework-specific operations. In such cases maintaining cross-framework compatibility will require a dedicated implementation for each framework and appropriate conditional programming based on the backend in use. The current methods for customizing a training step and a training loop are framework-specific, meaning that they too would require dedicated implementations for each backend to retain cross-framework compatibility. Thus, as your model grows in complexity, so might the overhead required to maintain this unique capability.
We have noted just a few potential disadvantages to Keras 3 and its multi-backend support. You may very well likely come across others. While the multi-framework offering is certainly compelling, its adoption is not necessarily free of cost. Borrowing the name of a well-known theorem in the field of statistical inference, one could say that when it comes to choosing an AI/ML development methodology, there are “no free lunches”.
As in many of our recent posts, the toy model we will define will be a Vision Transformer (ViT) backed classification model. We will rely on the reference implementation located in this Keras tutorial. We have configured our model according to the ViT-Base architecture (~86 million parameters), set the mixed_precision policy to use bfloat16, and defined a PyTorch dataloader with random input data.
The following block includes the configuration settings followed by definitions of the core ViT model components:
import os
# choose backend
backend = 'jax' # 'torch'
os.environ["KERAS_BACKEND"] = backend
import keras
from keras import layers
from keras import ops
# set mixed precision policy
keras.mixed_precision.set_global_policy('mixed_bfloat16')
# use ViT Base settings
num_classes = 1000
image_size = 224
input_shape = (image_size, image_size, 3)
patch_size = 16 # Size of the patches to be extract from the input images
num_patches = (image_size // patch_size) ** 2
projection_dim = 768
num_heads = 12
transformer_units = [
projection_dim * 4,
projection_dim,
] # Size of the transformer layers
transformer_layers = 12
# set training hyperparams
batch_size = 128
multi_worker = False # toggle to use multiple data loader workers
preproc_workers = 0 if 'jax' else 16
# ViT model components:
# ---------------------
def mlp(x, hidden_units, dropout_rate):
for units in hidden_units:
x = layers.Dense(units, activation=keras.activations.gelu)(x)
x = layers.Dropout(dropout_rate)(x)
return x
class Patches(layers.Layer):
def __init__(self, patch_size):
super().__init__()
self.patch_size = patch_size
def call(self, images):
input_shape = ops.shape(images)
batch_size = input_shape[0]
height = input_shape[1]
width = input_shape[2]
channels = input_shape[3]
num_patches_h = height // self.patch_size
num_patches_w = width // self.patch_size
patches = keras.ops.image.extract_patches(images, size=self.patch_size)
patches = ops.reshape(
patches,
(
batch_size,
num_patches_h * num_patches_w,
self.patch_size * self.patch_size * channels,
),
)
return patches
class PatchEncoder(layers.Layer):
def __init__(self, num_patches, projection_dim):
super().__init__()
self.num_patches = num_patches
self.projection = layers.Dense(units=projection_dim)
self.position_embedding = layers.Embedding(
input_dim=num_patches, output_dim=projection_dim
)
def call(self, patch):
positions = ops.expand_dims(
ops.arange(start=0, stop=self.num_patches, step=1), axis=0
)
projected_patches = self.projection(patch)
encoded = projected_patches + self.position_embedding(positions)
return encoded
Using the core components, we define a ViT-backed Keras model:
# the attention layer we will use in our ViT classifier
attention_layer = layers.MultiHeadAttention
def create_vit_classifier():
inputs = keras.Input(shape=input_shape)
# Create patches.
patches = Patches(patch_size)(inputs)
# Encode patches.
encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)
# Create multiple layers of the Transformer block.
for _ in range(transformer_layers):
# Layer normalization 1.
x1 = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
# Create a multi-head attention layer.
attention_output = attention_layer(
num_heads=num_heads, key_dim=projection_dim//num_heads, dropout=0.1
)(x1, x1)
# Skip connection 1.
x2 = layers.Add()([attention_output, encoded_patches])
# Layer normalization 2.
x3 = layers.LayerNormalization(epsilon=1e-6)(x2)
# MLP.
x3 = mlp(x3, hidden_units=transformer_units, dropout_rate=0.1)
# Skip connection 2.
encoded_patches = layers.Add()([x3, x2])
# Create a [batch_size, projection_dim] tensor.
representation = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
representation = layers.GlobalAveragePooling1D()(representation)
representation = layers.Dropout(0.5)(representation)
# Classify outputs.
logits = layers.Dense(num_classes)(representation)
# Create the Keras model.
model = keras.Model(inputs=inputs, outputs=logits)
return model
# create the ViT model
model = create_vit_classifier()
model.summary()
In the next block we define the optimizer, loss, and dataset.
model.compile(
optimizer=keras.optimizers.SGD(),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
)
def get_data_loader(batch_size):
import torch
from torch.utils.data import Dataset, DataLoader
# create dataset of random image and label data
class FakeDataset(Dataset):
def __len__(self):
return 1000000
def __getitem__(self, index):
rand_image = torch.randn([224, 224, 3], dtype=torch.float32)
label = torch.tensor(data=[index % 1000], dtype=torch.int64)
return rand_image, label
ds = FakeDataset()
dl = DataLoader(
ds,
batch_size=batch_size,
num_workers=preproc_workers if multi_worker else 0,
pin_memory=True
)
return dl
dl = get_data_loader(batch_size)
Finally, we start the training using Keras’s Model.fit() function:
model.fit(
dl,
batch_size=batch_size,
epochs=1
)
We ran the script above on a Google Cloud Platform (GCP) g2-standard-16 VM (with a single NVIDIA L4 GPU) with a dedicated deep learning VM image (common-cu121-v20240514-ubuntu-2204-py310) and installations of PyTorch (2.3.0), JAX (0.4.28), Keras (3.3.3), and KerasCV (0.9.0). Please see the official Keras documentation for full installation instructions. Note that we manually modified the format of step time reported by the Keras progress bar:
formatted += f" {time_per_unit:.3f}s/{unit_name}"
Using the backend flag we were able to easily toggle between the backends supported by Keras and compare the runtime performance of each. For example, when configuring PyTorch dataloader with 0 workers, we found that JAX backend to outperform PyTorch by ~24%. When setting the number of workers to 16 this drops to ~12%.
We now define a custom attention layer that replaces Keras’s default attention computation with PyTorch’s flash attention implementation. Note that this will only work when the backend is set to torch.
class MyAttention(layers.MultiHeadAttention):
def _compute_attention(
self, query, key, value, attention_mask=None, training=None
):
from torch.nn.functional import scaled_dot_product_attention
query = ops.multiply(
query, ops.cast(self._inverse_sqrt_key_dim, query.dtype))
return scaled_dot_product_attention(
query.transpose(1,2),
key.transpose(1,2),
value.transpose(1,2),
dropout_p=self._dropout if training else 0.
).transpose(1,2), None
attention_layer = MyAttention
The results of our experiments are summarized in the table below. Keep in mind that the relative performance results are likely to vary greatly based on the details of the model and the runtime environment.
When using our custom attention layer, the gap between the JAX and PyTorch backends virtually disappears. This highlights how the use of a multi-backend solution could come at the expense of optimizations uniquely supported by any of the individual frameworks (in our example, PyTorch SDPA).
Gemma is a family of lightweight, open source models recently released by Google. Keras 3 plays a prominent role in the Gemma release (e.g., see here) and its multi-framework support makes Gemma automatically accessible to AI/ML developers of all persuasions — PyTorch, TensorFlow, and Jax. Please see the official documentation in KerasNLP for more details on the Gemma API offering.
The following code is loosely based on the official Gemma fine-tuning tutorial. In order to run the script, please follow the necessary setup instructions.
import os
backend = 'jax' #'torch'
os.environ["KERAS_BACKEND"] = backend
num_batches = 1000
batch_size = 4 if backend == 'jax' else 2
# Avoid memory fragmentation on JAX backend.
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"
os.environ["KAGGLE_USERNAME"]="chaimrand"
os.environ["KAGGLE_KEY"]="29abebb28f899a81ca48bec1fb97faf1"
import keras
import keras_nlp
keras.mixed_precision.set_global_policy('mixed_bfloat16')
import json
data = []
with open("databricks-dolly-15k.jsonl") as file:
for line in file:
features = json.loads(line)
# Filter out examples with context, to keep it simple.
if features["context"]:
continue
# Format the entire example as a single string.
template = "Instruction:n{instruction}nnResponse:n{response}"
data.append(template.format(**features))
# Only use 1000 training batches, to keep it fast.
data = data[:num_batches*batch_size]
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
# Enable LoRA for the model and set the LoRA rank to 4.
gemma_lm.backbone.enable_lora(rank=4)
gemma_lm.summary()
# Limit the input sequence length to 512 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 512
gemma_lm.compile(
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=keras.optimizers.SGD(learning_rate=5e-5),
weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(data, epochs=1, batch_size=batch_size)
When running the script in the same GCP environment described above, we see a significant (and surprising) discrepancy between the runtime performance when using the JAX backend (6.87 samples per second) and the runtime performance when using the PyTorch backend (3.01 samples per second). This is due, in part, to the fact that the JAX backend allows for doubling the training batch size. A deep dive into the causes of this discrepancy is beyond the scope of this post.
As in our previous example, we demonstrate one way of optimizing the PyTorch runtime by prepending the following configuration of the matrix multiplication operations to the top of our script:
import torch
torch.set_float32_matmul_precision('high')
This simple change results in a 29% performance boost when running with the PyTorch backend. Once again, we can see the impact of applying framework-specific optimizations. The experiment results are summarized in the table below.
Our demonstrations have indicated that sticking with the backend agnostic Keras code could imply a meaningful runtime performance penalty. In each example, we have seen how a simple, framework-specific optimization had a significant impact on the relative performance of our chosen backends. At the same time, the arguments we have discussed for multi-framework AI/ML development are rather compelling.
If you do choose to adopt Keras as a development framework, you may want to consider designing your code in a manner that includes mechanisms for applying and assessing framework-specific optimizations. You might also consider designing your development process in a way that utilizes Keras during the early stages of the project and, as the project matures, optimizes for the one backend that is revealed to be the most appropriate.
In this post we have explored the new and revised Keras 3 release. No longer an appendage to TensorFlow, Keras 3 offers the ability of framework-agnostic AI/ML model development. As we discussed, this capability has several significant advantages. However, as is often the case in the field of AI development, “there are no free lunches” — the added level of abstraction could mean a reduced level of control over the inner workings of our code which could imply slower training speed and higher costs. The best solution might be one that combines the use of Keras and its multi-framework support with dedicated mechanisms for incorporating framework-specific modifications.
Importantly, the applicability of Keras 3 to your project and the cost-best analysis of the investment required, will depend greatly on a wide variety of factors including: the target audience, the model deployment process, project timelines, and more. Please view this post as a mere introduction into your detailed exploration.
Multi-Framework AI/ML Development with Keras 3 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Multi-Framework AI/ML Development with Keras 3
Go Here to Read this Fast! Multi-Framework AI/ML Development with Keras 3