Tag: AI

  • The Math Behind Convolutional Neural Networks

    Cristian Leo

    Dive into CNN, the backbone of Computer Vision, understand its mathematics, implement it from scratch, and explore its applications

    Image by DALL-E

    Index

    · 1: Introduction

    · 2: The Math Behind CNN Architecture
    2.1: Convolutional Layers
    2.2: Stride
    2.3: Padding
    2.4: Multiple Filters and Depth
    2.5: Weight Sharing
    2.6: Feature Map Creation
    2.7: Pooling Layers
    2.8: Fully Connected Layers

    · 3: Step-by-Step Guide to Building a CNN
    3.1: Setting Up the Environment
    3.2: Preparing the Data
    3.3: Designing the CNN Model
    3.4: Compiling the Model
    3.5: Training the CNN

    · 4: Improving Model Performance
    4.1: Data Augmentation
    4.2: Dropout
    4.3: Batch Normalization
    4.4: Transfer Learning

    · 5: Conclusion

    · Additional Resources

    1: Introduction

    Convolutional Neural Networks, or CNNs for short, are a big deal when it comes to working with images, like in photo recognition or sorting. They’re super good at picking up on the patterns and details in pictures automatically, which is why they’re a go-to for any project that deals with a bunch of images.

    The cool thing about CNNs is that they don’t just mash all the image data into one big pile. Instead, they keep the layout of the image intact, which means they’re great at noticing the specific patterns and where they’re located. This approach is a game-changer because it lets CNNs handle the tricky parts of working with images much more smoothly.

    One of the secret sauces of CNNs is something called convolutional layers. These layers move across the image and are able to spot different visual features, like lines, textures, and shapes. This beats the old-school way where people had to manually pick out these features, which was slow and often a bottleneck for getting things done. By having the network figure out these features on its own, CNNs not only get more accurate, they’re also simpler and can be used for a wider range of image-related tasks without much hassle.

    2: The Math Behind CNN Architecture

    CNN Image by Kerin O’Shea in “An Introduction to Convolutional Neural Networks”

    The architecture of Convolutional Neural Networks (CNNs) is designed to mimic the way the human visual system processes images, making them especially powerful for tasks involving visual recognition and classification.

    CNNs are composed of several types of layers, each serving a specific function in the image recognition process. The main layers include convolutional layers, activation functions, pooling layers, and fully connected layers. Together, these layers allow CNNs to detect features, reduce complexity, and make predictions.

    2.1: Convolutional Layers

    Convolutional layers are the cornerstone of Convolutional Neural Networks (CNNs), designed to automatically and efficiently extract spatial features like edges, textures, and shapes from images. Let’s dive deep into how convolutional layers work, including the underlying math.

    The Convolution Operation

    Convolution Operation — Image by Author

    At its core, the convolution operation involves sliding a filter (or kernel) over the input image and computing the dot product of the filter values and the original pixel values at each position. The filter is a small matrix of weights, typically of size 3×3 or 5×5, which is trained to detect specific features in the image.

    Mathematically, the convolution operation can be expressed as:

    Convolution Operation Formula — Image by Author

    Where:

    • S(i,j) is the output feature map.
    • I is the input image.
    • K is the kernel or filter.
    • i,j are the coordinates on the feature map.
    • m,n are the coordinates in the kernel.
    • ∗ denotes the convolution operation.

    This equation tells us that each element S(i,j) of the output feature map is the sum of the element-wise product of the kernel K and the portion of the input image I over which the kernel is currently positioned.

    Now, consider a matrix of pixel values which will serve as input image. If it’s a grayscale image (image above), the matrix will have a single layer; for color images, there are typically three layers (RGB), but the operation is often performed separately on each layer.

    The convolution operation apply a kernel (filter) to the matrix. Here the kernel is another matrix, smaller than the input image, with predefined dimensions (e.g., 3×3). The values in this matrix are the weights, which are learned during the training process. The kernel is designed to detect specific types of features, such as edges, textures, or patterns, from the input image. The kernel, then, strides (we will cover this operation in a moment) over the entire input image and performing element-wise multiplication followed by a sum.

    From the convolution operation, we will get the output feature map. It’s a new matrix where each element represents the presence and intensity of a feature detected by the kernel at a specific location in the input image.

    2.2: Stride

    Stride on Input Image — Animation by Author

    Stride is a crucial concept in the architecture of CNNs, particularly within the convolutional layers. It fundamentally influences how the kernel, traverses across the input image or feature map.

    The stride specifies the number of pixels by which we move the filter across the input image or feature map in each step. It is applied both horizontally and vertically. A stride of 1 means the filter moves one pixel at a time, ensuring a detailed and dense scanning of the input. Larger strides result in the filter skipping pixels, leading to broader and less dense coverage.

    The stride plays a direct role in determining the dimensions of the output feature map:

    • With a Stride of 1: The filter moves across every pixel, often resulting in an output feature map that is relatively large or similar in size to the input, depending on padding, which we will talk about in the next section.
    • With a Larger Stride: The filter skips over pixels, which means it covers the input in fewer steps. This leads to a smaller output feature map since each step covers a larger area of the input with less overlap between positions where the filter is applied.

    Mathematical Representation
    The size of the output feature map (W_out​, H_out​) can be calculated from the input size (W_in​, H_in​), filter size (F), stride (S), and padding (P) using the formula:

    where:

    • W_out​ and H_out​ are the width and height of the output feature map, respectively.
    • W_in​ and H_in​ are the width and height of the input, respectively.
    • F is the size of the filter.
    • S is the stride.
    • P is the padding.

    A larger stride increases the field of view of each application of the filter, allowing the network to capture more global features of the input with fewer parameters.

    Using a larger stride reduces the computational load and memory usage since it decreases the size of the output feature map and, consequently, the number of operations required for convolution.

    A trade-off exists between spatial resolution and coverage. A smaller stride preserves spatial resolution and is better for detecting fine-grained features, while a larger stride offers broader coverage of the input at the expense of detail.

    2.3: Padding

    Padding plays a critical role in shaping the network’s architecture by influencing the spatial dimensions of the output feature maps.
    It involves adding layers of zeros (or other values, but zeros are most common) around the border of the input image or feature map before applying the convolution operation. This technique can be applied for various reasons, the most prominent being to control the size of the output feature maps and to allow the convolutional filters to have access to the edge pixels of the input.

    Therefore, our input image will now look like this:

    Padded image with Stride Filter — Animation by Author

    You can notice how our previous 8×8 matrix is now a 10×10 matrix, as we added a layer of 0s around it.

    Without padding, each convolution operation reduces the size of the feature map. Padding allows us to apply filters to the input without shrinking its spatial dimensions, preserving more information, especially for deeper networks where many convolutional layers are applied sequentially.

    By padding the input, filters can properly process the edge pixels of the image, ensuring that features located at the borders are adequately captured and utilized in the network’s learning process.

    There are two main types of padding:

    Valid Padding (No Padding)
    In this case, no padding is applied to the input. The convolution operation is performed only where the filter fully fits within the bounds of the input. This usually results in a reduction of the output feature map size.

    Same Padding
    With the same padding, enough zeros are added to the edges of the input to ensure that the output feature map has the same dimensions as the input (when the stride is 1). This is particularly useful for designing networks where the input and output sizes need to be consistent.

    The effect of padding on the output feature map size can be captured by adjusting the formula used to calculate the dimensions of the output feature map:

    Adjustment of Feature Map Formula with Padding — Image by Author

    where:

    • W_out​ and H_out​ are the width and height of the output feature map, respectively.
    • W_in​ and H_in​ are the width and height of the input, respectively.
    • F is the size of the filter/kernel.
    • S is the stride.
    • P is the amount of padding added to each side of the input.

    While padding helps in maintaining the spatial dimensions of the input through the layers, excessive padding might lead to computational inefficiency and an increase in the model’s complexity by adding more non-informative inputs (zeros) into the computation.

    The choice between valid and same padding typically depends on the specific requirements of the application, such as the importance of preserving the spatial dimensions of the input or the need to minimize computational overhead.

    2.4: Multiple Filters and Depth

    CNNs employ multiple filters at each convolutional layer to capture a wide array of features from the input image or feature map. This multiplicity and the resultant depth are central to the network’s ability to process visual information in a comprehensive and nuanced manner.

    Each filter in a convolutional layer is designed to detect different features or patterns in the input, such as edges, colors, textures, or more complex shapes in deeper layers. By using multiple filters, a CNN can simultaneously look for various features at each layer, enriching the representation of the input data.

    The output of a convolutional layer with multiple filters is a stack of feature maps, one for each filter. This stack forms a three-dimensional volume where the depth corresponds to the number of filters used. This depth is crucial for building a hierarchical representation of the data, allowing subsequent layers to detect increasingly abstract features by combining the outputs of previous layers.

    How Multiple Filters Achieve Depth
    As the input image or feature map is processed, each filter slides across it, performing the convolution operation. Despite sharing the same input, each filter applies its unique weights, producing a distinct feature map that highlights different aspects of the input.

    The individual feature maps generated by each filter are stacked along the depth dimension, forming a 3D volume. This volume encapsulates the diverse features detected by the filters, providing a rich, multi-faceted representation of the input.

    The depth of the convolutional layer — determined by the number of filters — enables the network to capture a broad spectrum of features. Early layers might capture basic features like edges and textures, while deeper layers can interpret complex patterns by combining these basic features, thanks to the network’s depth.

    Implications of Depth
    More filters mean a deeper network with a higher capacity to learn complex features. However, this also increases the network’s computational complexity and the amount of training data needed to learn effectively.

    Each filter adds parameters to the model (the weights that define the filter). While more filters increase the network’s expressive power, they also raise the total number of parameters, which can impact training efficiency and the risk of overfitting.

    The allocation of filters across layers is strategic. Layers closer to the input might have fewer, more general filters, while deeper layers may use more filters to capture the complexity and variability of higher-order features within the data.

    2.5: Weight Sharing

    Weight sharing significantly enhances CNNs’ efficiency and effectiveness, especially in processing visual information. This concept is pivotal in allowing the model to detect features regardless of their spatial location in the input image.

    In the context of CNNs, weight sharing refers to using the same filter (and thus the same set of weights) across the entire input image or feature map. Instead of learning a unique set of weights for every possible location, a single filter scans the entire image, applying the same weights at each position. This operation is repeated for each filter in the convolutional layer.

    By reusing the same set of weights across different parts of the input image, weight sharing dramatically reduces the number of parameters in the model. This makes CNNs much more parameter-efficient compared to fully connected networks, especially when dealing with large input sizes.

    Weight sharing enables the network to detect features regardless of their position in the input image. If a filter learns to recognize an edge or a specific pattern, it can detect this feature anywhere in the image, making CNNs inherently translation invariant.

    With fewer parameters to learn, CNNs are less likely to overfit the training data. This improves the model’s ability to generalize from the training data to unseen data, enhancing its performance on real-world tasks.

    How Weight Sharing Works
    During the forward pass, a filter with a fixed set of weights slides over the input image, computing the dot product between the filter weights and the local regions of the image. This process generates a feature map that indicates the presence and intensity of the detected feature across the spatial extent of the image.

    Despite the extensive reuse of weights across the spatial domain, each weight is updated based on the aggregate gradient from all positions where it was applied. This ensures that the filter weights are optimized to detect features that are most relevant for the task, based on the entire dataset.

    2.6: Feature Map Creation

    As we saw previously, a feature map is an output generated by applying a filter or kernel to the input image or a preceding feature map within a CNN. It represents the responses of the filter across the spatial dimensions of the input, highlighting where and how specific features are detected in the image. Let’s now recap how each element in the CNN affects the resulting feature map.

    At the core of feature map creation is the convolution operation, where a filter with learned weights slides (or convolves) across the input image or feature map from a previous layer. At each position, the filter performs an element-wise multiplication with the part of the image it covers, and the results are summed up to produce a single output pixel in the new feature map.

    The weights in the filter determine the type of feature it detects, such as edges, textures, or more complex patterns in deeper layers. During training, these weights are adjusted through backpropagation, allowing the network to learn which features are most important for the task at hand.

    The size of the stride and the use of padding directly affect the spatial dimensions of the feature map. A larger stride results in broader coverage with less overlap between filter applications, reducing the feature map size. Padding can be used to preserve the spatial dimensions of the input, ensuring that features at the edges of the image are not lost.

    A convolutional layer typically contains multiple filters, each designed to detect different features. The output for each filter is a separate feature map, and these are stacked along the depth dimension to create a 3D volume. This multi-faceted approach allows the network to capture a rich representation of the input image.

    After a feature map is created through the convolution operation, it is often passed through an activation function, such as ReLU. This introduces non-linearity, enabling the network to learn and represent more complex patterns.

    If you want to learn more about ReLU and other activation functions, take a look at this article:

    The Math Behind Neural Networks

    The activated feature map then proceeds to the next layer or a pooling operation.

    2.7: Pooling Layers

    Pooling layers serve to reduce the spatial dimensions of the feature maps. This reduction is crucial for decreasing the computational load, minimizing overfitting, and retaining only the most essential information. Let’s delve into the specifics of pooling layers, their types, and their impact on CNN performance.

    Pooling layers reduce the size of the feature maps, thereby decreasing the number of parameters and computations required in the network. This simplification helps to focus on the most important features.

    By summarizing the presence of features in patches of the feature map, pooling helps the network to maintain robustness to minor variations and translations in the input image.

    There are a few types of pooling techniques you should know about when playing with CNNs:

    Max Pooling
    This is the most common form of pooling, where the maximum value from a set of values in the feature map is selected and forwarded to the next layer. Max pooling effectively captures the most pronounced feature in each patch of the feature map.

    We denote the feature map by F and the pooling operation by P_max​, the result of max pooling at position (i,j) for a window size of n×n can be expressed as:

    Max Pooling Formula — Image by Author

    Here, s is the stride of the pooling window, and a, b iterate over the window dimensions. This operation is applied independently for each window position across the feature map.

    Average Pooling
    Unlike max pooling, average pooling takes the average of the values in each patch of the feature map. This method provides a more generalized feature representation but might dilute the presence of smaller, yet significant features.

    For a feature map F and an n×n pooling window, the average pooling operation at position (i,j) can be mathematically represented as:

    Average Pooling Formula — Image by Author

    Similar to max pooling, s represents the stride, and a,b iterate over the window, but here the operation computes the mean of the values within each window.

    Global Pooling
    In global pooling, the entire feature map is reduced to a single value by taking the max (global max pooling) or average (global average pooling) of all values in the feature map. This approach is often used to reduce each feature map to a single value before a fully connected layer.

    For a feature map F of size M×N, global max pooling (P_gmax​) and global average pooling (P_gavg​) can be defined as:

    Global Pooling Formula (Top), Global Average Pooling Formula (Bottom) — Image by Author

    Global pooling operations compress the entire feature map into a single summary statistic, which is particularly useful for reducing model parameters before a fully connected layer for classification.

    How Pooling Works
    A pooling layer operates over each feature map independently, sliding a window (or filter) across the feature map and summarizing the values within that window into a single value (based on the pooling strategy used). This process reduces the spatial dimensions of the feature map.

    The size of the window and the stride (how far the window moves each time) determine how much the feature map is reduced. A common choice is a 2×2 window with a stride of 2, which reduces the size of the feature map by half.

    2.8: Fully Connected Layers

    Fully Connected Layer Graph — Image by Author

    Fully connected layers are often positioned towards the end of CNNs. These layers are where the high-level reasoning based on the learned features takes place, ultimately leading to classification or prediction.

    In a fully connected layer, every neuron is connected to every activation from the previous layer. This dense connectivity ensures that the layer has the complete context of the extracted features, allowing it to learn complex patterns that are distributed across the feature map.

    Fully connected layers integrate the spatially distributed features identified by convolutional and pooling layers into a global representation of the input. This integration is crucial for tasks I Hthat require an understanding of the entire input, such as classification.

    From Convolutional to Fully Connected Layers
    Before entering a fully connected layer, the output from the preceding convolutional or pooling layers, typically a multi-dimensional feature map, is flattened into a single vector. This step transforms the spatially structured data into a format suitable for processing by fully connected layers.

    The neurons in fully connected layers can learn high-level patterns in the data by considering the global information presented by the flattened feature map. This ability is fundamental to making predictions or classifications based on the entire input image.

    Role in CNNs
    In many CNN architectures, the final fully connected layer serves as the classification layer, where each neuron represents a specific class. The network’s prediction is determined by the activation of these neurons, typically through a softmax function that converts the activations into probabilities.

    Fully connected layers synthesize the localized, abstract features extracted by the convolutional layers into a cohesive understanding of the input data. This synthesis is essential for the network to reason about the input as a whole and make informed decisions.

    3: Step-by-Step Guide to Building a CNN

    Let’s get into business and build our CNN. We will set up, train, and evaluate a Convolutional Neural Network (CNN) using PyTorch for image classification on the MNIST dataset, an open-source large database of handwritten digits. [MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license]

    Feel free to have this Jupyter Notebook on the side, which contains all the code we will cover today:

    GitHub – cristianleoo/models-from-scratch-python: Repo where I recreate some popular machine learning models from scratch in Python

    3.1: Setting Up the Environment

    Let’s start with the necessary libraries and modules. PyTorch (torch), its neural network module (nn), and optimization module (optim) are imported for constructing and training the neural network. Functionalities from torch.nn.functional are used for operations like ReLU activation and max pooling. DataLoader utilities facilitate batch processing and data management, and torchvision is used for handling datasets and image transformations.

    import numpy as np
    import matplotlib.pyplot as plt
    import torch
    from torch import nn, optim
    from torch.nn import functional as F
    from torch.utils.data import DataLoader
    from torchvision import datasets, transforms

    3.2: Preparing the Data

    The MNIST dataset is loaded with a transformation pipeline that first converts images to tensor format, and then normalizes their pixel values. Normalization parameters (mean=0.1307, std=0.3081) are chosen specifically for the MNIST dataset to standardize its grayscale images for optimal neural network performance.

    transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
    ])
    mnist_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)

    A sample image from the dataset is displayed using matplotlib, illustrating the type of data the network will be trained on.

    image, label = mnist_dataset[0]
    plt.imshow(image.squeeze().numpy(), cmap='gray')
    plt.title(f'Label: {label}')
    plt.show()

    This code will show the following image:

    First Image in MNIST Dataset — Image by Author

    The dataset is divided into training and validation sets to enable model evaluation during training. DataLoader instances handle batching, shuffling, and preparing the dataset for efficient processing by the neural network.

    train_size = int(0.8 * len(mnist_dataset))
    val_size = len(mnist_dataset) - train_size
    train_dataset, val_dataset = random_split(mnist_dataset, [train_size, val_size])

    3.3: Designing the CNN Model

    Once we preprocessed data, we can proceed to the model creation. Therefore, we initialize a MyCNN class, which inherits from nn.Module, which is PyTorch’s way of defining a model. This inheritance gives MyCNN all the functionalities of a PyTorch model, including the ability to train, make predictions, and more.

    The __init__ function is the constructor of the MyCNN class. It’s where the layers of the neural network are defined. The super(MyCNN, self).__init__() line calls the constructor of the base nn.Module class, which is necessary for PyTorch to initialize everything correctly.

    class MyCNN(nn.Module):
    def __init__(self):
    super(MyCNN, self).__init__()
    self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
    self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
    self.fc1 = nn.Linear(7*7*64, 128)
    self.fc2 = nn.Linear(128, 10)

    As you can notice from the code above, the network includes two convolutional layers, conv1 and conv2.

    conv1 takes a single-channel image (like a grayscale image) as input and produces 32 feature maps using a filter (or kernel) size of 3×3, with a stride of 1 and padding of 1. Padding is added to ensure the output feature maps are the same size as the input.

    conv2 takes the 32 feature maps from conv1 as input and produces 64 feature maps, also with a 3×3 kernel, stride of 1, and padding of 1. This layer further extracts features from the input provided by conv1.

    After the convolutional layers, there are two fully connected (fc) layers.

    fc1 is the first fully connected layer that transforms the output from the convolutional layers into a vector of size 128. The input size is 7*7*64, which implies that before reaching this layer, the feature maps are flattened into a single vector and that the dimensionality of the feature maps before flattening is 7×7 with 64 channels. This step is crucial for transitioning from spatial feature extraction to making decisions (classifications) based on those features.

    fc2 is the second fully connected layer, which takes the 128-dimensional vector from fc1 and outputs a 10-dimensional vector. This output size typically corresponds to the number of classes in a classification problem, suggesting this network is designed to classify images into one of 10 categories.

    def _initialize_weights(self):
    for m in self.modules():
    if isinstance(m, nn.Conv2d):
    nn.init.normal_(m.weight, 0, 0.01)
    if m.bias is not None:
    nn.init.constant_(m.bias, 0)
    elif isinstance(m, nn.Linear):
    nn.init.xavier_uniform_(m.weight)
    if m.bias is not None:
    nn.init.constant_(m.bias, 0)

    Weight initialization is applied to ensure the network starts with weights in a range that neither vanishes nor explodes the gradients. Convolutional layers are initialized with normal distribution, while fully connected layers use Xavier uniform initialization.

    To learn more about Xavier initialization and other types of initialization consider diving into my previous article:

    The Math Behind Fine-Tuning Deep Neural Networks

    The forward method within the MyCNN class defines the sequence of operations that input data undergoes as it passes through the CNN.

    def forward(self, x):
    x = F.relu(self.conv1(x))
    x = F.max_pool2d(x, 2, 2)
    x = F.relu(self.conv2(x))
    x = F.max_pool2d(x, 2, 2)
    x = x.view(x.size(0), -1)
    x = F.relu(self.fc1(x))
    x = self.fc2(x)
    return x

    Let’s dissect this method step by step, focusing on each operation to understand how input images are transformed into output predictions.

    First Convolutional Layer

    x = F.relu(self.conv1(x))

    The input tensor x, representing the batch of images, is passed through the first convolutional layer (conv1). This layer applies learned filters to the input, capturing basic visual features like edges and textures. The convolution operation is immediately followed by a ReLU activation function applied in-place. ReLU sets all negative values in the output tensor to zero, enhancing the network’s ability to distinguish features.

    First Pooling Operation

    x = F.max_pool2d(x, 2, 2)

    Following the first convolution and activation, a max pooling operation is applied. This operation reduces the spatial dimensions of the feature map by half (due to the pool size and stride of 2), summarizing the most significant features within 2×2 patches of the feature map. Max pooling helps to make the representation somewhat invariant to small shifts and distortions.

    Second Convolutional Layer

    x = F.relu(self.conv2(x))

    The process repeats with a second convolutional layer (conv2), which applies another set of learned filters to the now-reduced feature map. This layer typically captures more complex features, building upon the basic patterns identified by the first layer. Again, ReLU activation follows to maintain non-linearity in the learning process.

    Second Pooling Operation

    x = F.max_pool2d(x, 2, 2)

    Another max pooling step further reduces the spatial dimensions of the resulting feature map, compacting the feature representation and reducing computational complexity for subsequent layers.

    Flattening

    x = x.view(x.size(0), -1)

    Before transitioning to fully connected layers, the multidimensional feature map must be flattened into a single vector per image in the batch. This operation reshapes the tensor so that each image’s feature map becomes a single row in the tensor, preserving all feature information in a format suitable for fully connected processing.

    First Fully Connected Layer

    x = F.relu(self.fc1(x))

    The flattened tensor is passed through the first fully connected layer (fc1), where neurons can learn complex patterns from the entire feature set. The ReLU function is applied once more to introduce non-linearity, enabling the network to learn and represent more complex functions.

    Second Fully Connected Layer (Output Layer)

    x = self.fc2(x)

    Finally, the tensor passes through a second fully connected layer (fc2), which acts as the output layer. This layer has as many neurons as there are classes to predict (10 for MNIST digits). The output of this layer represents the network’s predictions for each class.

    3.4: Compiling the Model

    The model is compiled with CrossEntropyLoss for classification and the Adam optimizer for adjusting weights, along with specific parameters like learning rate and weight decay.

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5, amsgrad=True, eps=1e-8, betas=(0.9, 0.999))

    The Adam optimizer is a popular algorithm for training deep learning models, combining the best properties of the AdaGrad and RMSProp algorithms to efficiently handle sparse gradients on noisy problems. It adjusts the learning rate on a per-parameter basis, making it highly effective and well-suited for a wide range of tasks and models. If you want to learn more about Adam take a look at my article where I go through its math and build it from scratch:

    The Math behind Adam Optimizer

    3.5: Training the CNN

    The Trainer class in the provided logic necessary for training the CNN model, including the forward pass, and backward pass (gradient calculation and weight update), monitoring the training and validation losses, implementing early stopping, adjusting the learning rate, and evaluating the model’s performance. Let’s dissect this class to understand its structure and functionality in depth.

    class Trainer:
    def __init__(self, model, criterion, optimizer, device, patience=7):
    self.model = model
    self.criterion = criterion
    self.optimizer = optimizer
    self.device = device
    self.early_stopping = EarlyStopping(patience=patience)
    self.scheduler = ReduceLROnPlateau(self.optimizer, 'min', patience=3, verbose=True, factor=0.5, min_lr=1e-6)
    self.train_losses = []
    self.val_losses = []
    self.gradient_norms = []

    In the initialization method __init__, the Trainer class takes the CNN model, the loss function (criterion), and the optimizer as arguments, alongside the device on which to run the training (CPU or GPU) and the patience for early stopping. An EarlyStopping instance is created to monitor validation loss and halt training if the model ceases to improve, preventing overfitting. A learning rate scheduler (ReduceLROnPlateau) is also initialized to dynamically adjust the learning rate based on the validation loss, helping to find the optimal learning rate during training. Lists to track training and validation losses, as well as gradient norms, are initialized for analysis and debugging purposes.

    def train(self, train_loader, val_loader, epochs):
    for epoch in range(epochs):
    self.model.train()
    for images, labels in train_loader:
    images, labels = images.to(self.device), labels.to(self.device)
    self.optimizer.zero_grad()
    outputs = self.model(images)
    loss = self.criterion(outputs, labels)
    self.train_losses.append(loss.item())
    loss.backward()
    self.optimizer.step()

    The train method orchestrates the training process over a specified number of epochs. For each epoch, it sets the model to training mode and iterates over the training dataset using the train_loader. Input images and labels are moved to the specified device. The optimizer’s gradients are zeroed before each forward pass to prevent accumulation from previous iterations. The model’s predictions are obtained, and the loss is calculated using the specified criterion. The loss value is appended to the train_losses list for tracking. Backpropagation is performed by calling loss.backward(), and the optimizer updates the model weights with optimizer.step().

    val_loss = self.evaluate(val_loader)
    self.val_losses.append(val_loss)
    self.scheduler.step(val_loss)
    self.early_stopping(val_loss)

    After processing the training data, the model is evaluated on the validation dataset using the evaluate method, which calculates the average validation loss. This loss is used to adjust the learning rate with the scheduler and to determine if early stopping conditions are met. Validation loss is tracked for analysis.

    if self.early_stopping.early_stop:
    print("Early stopping")
    break

    If early stopping is triggered, training is halted to prevent overfitting. This decision is based on whether the validation loss has stopped improving over several epochs defined by the patience parameter.

    def evaluate(self, test_loader):
    self.model.eval()
    total_loss = 0
    with torch.no_grad():
    for images, labels in test_loader:
    images, labels = images.to(self.device), labels.to(self.device)
    outputs = self.model(images)
    loss = self.criterion(outputs, labels)
    total_loss += loss.item()
    return total_loss / len(test_loader)

    The evaluate method calculates the average loss over the validation or test dataset without updating the model’s weights. It sets the model to evaluation mode and disables gradient computations for efficiency.

    4: Improving Model Performance

    Improving the performance of Convolutional Neural Networks (CNNs) and preventing overfitting are critical challenges in training deep learning models. The code snippet provided does not explicitly detail techniques like data augmentation, dropout, and batch normalization, nor does it delve into transfer learning. However, these strategies are essential for enhancing CNNs, so let’s explore how they can be integrated into the training process and their potential impact on model performance.

    4.1: Data Augmentation

    Data augmentation artificially increases the diversity of the training dataset by applying random transformations (e.g., rotation, flipping, scaling) to the existing images. This diversity helps the model generalize better to new, unseen data by learning from a broader range of input variations.

    To implement data augmentation in PyTorch, you can extend the transforms.Compose used in preparing the dataset:

    transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
    ])

    Adding random flips and rotations diversifies the training data, encouraging the model to learn more robust features.

    4.2: Dropout

    Dropout is a regularization technique that randomly sets a fraction of input units to 0 during training, preventing units from co-adapting too much. This randomness forces the network to learn more robust features that are useful in conjunction with various random subsets of the other neurons.

    In PyTorch, dropout can be added to the CNN model by including nn.Dropout layers:

    class MyCNN(nn.Module):
    def __init__(self):
    super(MyCNN, self).__init__()
    # Convolutional layers
    self.fc1 = nn.Linear(7*7*64, 128)
    self.dropout = nn.Dropout(0.5)
    self.fc2 = nn.Linear(128, 10)
    def forward(self, x):
    # Convolutional and pooling operations
    x = x.view(x.size(0), -1)
    x = F.relu(self.fc1(x))
    x = self.dropout(x)
    x = self.fc2(x)
    return x

    Adding a dropout layer before the final fully connected layer helps mitigate overfitting by encouraging the model to distribute the learned representation across multiple neurons.

    4.3: Batch Normalization

    Batch normalization standardizes the inputs to a layer for each mini-batch, stabilizing the learning process and significantly reducing the number of training epochs required to train deep networks.

    Batch normalization can be incorporated into the model as follows:

    class MyCNN(nn.Module):
    def __init__(self):
    super(MyCNN, self).__init__()
    # Convolutional layers
    self.conv1_bn = nn.BatchNorm2d(32)
    # Fully connected layers
    def forward(self, x):
    x = F.relu(self.conv1_bn(self.conv1(x)))
    # Continue through model

    Applying batch normalization after convolutional layers but before the activation function helps in normalizing the output, contributing to faster convergence and improved overall performance.

    4.4: Transfer Learning

    Transfer learning involves using a model trained on one task as the starting point for training on a different but related task. This technique is particularly useful when you have a limited dataset for the new task. PyTorch facilitates transfer learning by allowing models pre-trained on large datasets (like ImageNet) to be easily loaded and adapted.

    To leverage a pre-trained model in PyTorch:

    from torchvision import models

    model = models.resnet18(pretrained=True)
    # Replace the final fully connected layer
    num_ftrs = model.fc.in_features
    model.fc = nn.Linear(num_ftrs, 10) # Assuming 10 classes for the new task
    # Freeze all layers but the last fully connected layer
    for param in model.parameters():
    param.requires_grad = False
    model.fc.requires_grad = True

    Here, a pre-trained ResNet-18 model is adapted for a new task with 10 classes by replacing its final layer. Freezing the weights of all layers except the last one allows us to fine-tune only the classifier layer, leveraging the feature extraction capabilities learned from the original dataset.

    Incorporating these strategies into the CNN training process not only combats overfitting but also enhances model performance by ensuring robust feature learning and leveraging knowledge from pre-trained models.

    5: Conclusion

    Wrapping up our deep dive into Convolutional Neural Networks, we’ve covered a lot. From setting up and preparing data to dissecting CNN architecture and its layers, we’ve seen what makes these models tick. We’ve looked into how tweaking things like weight initialization and using techniques like data augmentation and transfer learning can seriously boost a model’s performance. These methods help make our models smarter, avoiding common pitfalls like overfitting and making them more versatile.

    CNNs are pretty much everywhere in AI now, helping with everything from spotting faces in photos to diagnosing diseases from medical images. Their knack for picking up on visual cues makes them super valuable for a whole range of tasks.

    Additional Resources

    1. LeCun et al., “Gradient-Based Learning Applied to Document Recognition”
      This seminal paper by Yann LeCun and colleagues introduces LeNet-5, one of the first convolutional neural networks, and demonstrates its application to document recognition tasks.
      Research Gate Link
    2. Simonyan and Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition” (VGGNet)
      This work introduces VGGNet, highlighting the importance of depth in CNN architectures for improving image recognition performance.
      arXiv Link
    3. He et al., “Deep Residual Learning for Image Recognition” (ResNet)
      ResNet introduces the concept of residual learning, enabling the training of much deeper networks by addressing the vanishing gradient problem.
      arXiv Link

    If you liked this article consider leaving a like, and follow me to be updated on my latest posts. My goal is to recreate all the most popular algorithms from scratch and make machine learning accessible to everyone.


    The Math Behind Convolutional Neural Networks was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    The Math Behind Convolutional Neural Networks

    Go Here to Read this Fast! The Math Behind Convolutional Neural Networks

  • Reinforcement Learning: Introduction and Main Concepts

    Reinforcement Learning: Introduction and Main Concepts

    Vyacheslav Efimov

    Making the first step into the world of reinforcement learning

    Introduction

    Reinforcement learning is a special domain in machine learning that differs a lot from the classic methods used in supervised or unsupervised learning.

    The ultimate objective consists of developing a so-called agent that will perform optimal actions in environments. From the start, the agent usually performs very poorly but as time goes on, it adapts its strategy from the trial and error approach by interacting with the environment.

    The beauty of reinforcement learning is that the same algorithms can be used to make the agent adapt to absolutely different, unknown, and complex conditions.

    Reinforcement learning has a wide range of applications and mostly used when a given problem cannot be solved by classic methods:

    • Games. Existing approaches can design optimal game strategies and outperform humans. The most well-known examples are chess and Go.
    • Robotics. Advanced algorithms can be incorporated into robots to help them move, carry objects or complete routine tasks at home.
    • Autopilot. Reinforcement learning methods can be developed to automatically drive cars, control helicopters or drones.
    Some of the reinforcement learning applications

    About this article

    Though reinforcement learning is a very exciting and unique area, it is still one of the most sophisticated topics in machine learning. In addition, it is absolutely critical from the beginning to understand all of its basic terminology and concepts.

    For these reasons, this article introduces only the key theoretical concepts and ideas that will help the reader to further advance in reinforcement learning.

    Additionally, this article is based on the third chapter of the famous book “Reinforcement Learning” written by Richard S. Sutton and Andrew G. Barto, which I would highly recommend to everyone interested in delving deeper.

    Apart from that, this book contains practical exercises. Their solutions can be found in this repository.

    Reinforcement learning framework

    First of all, let us understand the reinforcement learning framework which contains several important terms:

    • Agent represents an object whose goal is to learn a strategy to optimize a certain process;
    • Environment acts as a world in which the agent is located and consists of a set of different states;
    • At each timestamp, the agent can perform an action in the environment that will change the environment’s state to a new one. Additionally, the agent will receive feedback indicating how good or bad the chosen action was. This feedback is called a reward and is represented in the numerical form.
    • By using feedback from different states and actions, the agent gradually learns the optimal strategy to maximize the total reward over time.
    Reinforcement learning framework. Image adopted by the author. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

    In many cases, given the current state and the agent’s action in that state, the change to a new state can result in different rewards (rather a single one) where each of them corresponds to its own probability.

    The formula below considers this fact by summing up over all possible next states and rewards that correspond to them.

    For a given state and action, the sum of all probabilities of transitioning to any other state s’ with reward r is equal to 1. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

    To make things more clear, we are going to use the prime symbol to designate a variable in its next step. For example, if s represents the agent’s current state, then s’ will refer to the next agent’s state.

    Reward types

    To formally define the total reward in the long run, we need to introduce the term the “cumulative reward” (also called “return”) which can take several forms.

    Simple formulation

    Let us denote Rᵢ as the reward received by the agent at timestamp i, then the cumulative reward can be defined as the sum of rewards received between the next timestamp and the final timestamp T:

    Cumulative reward. Image is taken from the book Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

    Discounted cumulative reward

    Most of the time, the discounted cumulative reward is used. It represents the same reward system as before except for the fact that every individual reward in the sum is now multiplied by an exponentially decayed discount coefficient.

    Discounted cumulative reward. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

    The γ (also sometimes denoted as α) parameter in the formula above is called the discount rate and can take a value between 0 and 1. The introduction of discounted reward makes sure that the agent prioritizes actions that result in more short-term rewards. Ultimately, the discounted cumulative reward can be expressed in the recursive form:

    Recursive equation for discounted cumulative reward. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

    Types of tasks

    Episodic tasks

    In some cases, the interaction between an agent and the environment can include a set of independent episodes. In this scenario, every episode starts independently from others and its beginning state is sampled from the distribution of states.

    For instance, imagine that we want the agent to learn the optimal strategy to play a game. To do that, we will run a set of independent games where in each of them a robot can either win or lose. The received rewards in every episode will gradually influence the strategy that the robot will be using in the following games.

    Episodes are also referred to as trials.

    Continuing tasks

    At the same time, not all types of tasks are episodic: some of them can be continuing meaning that they do not have a terminal state. In such cases, it is not always possible to define the cumulative return because the number of timestamps is infinite.

    Policies and value functions

    Policy

    Policy π is a mapping between all possible states s ∈ S to probabilities p of performing any possible action from that state s.

    If an agent follows a policy π, then the agent’s probability p(a | s) of performing the action a from state s is equal to p(a | s) = π(s).

    By definition, any policy can be represented in the form of a table of size |S| x |A|.

    Let us look at the example of the maze game below. The agent that is initially located at the A1 cell. During each step, the agent has to move either horizontally or vertically (not diagonally) to an adjacent cell. The game ends when the agent reaches the terminal state located at C1. The cell A3 contains a large reward that an agent can collect if it steps on it. The cells B1 and C3 are maze walls that cannot be reached.

    Maze example with 7 possible states (the cells B1 and C3 are maze walls). The game starts with the agent being put at A1 and ends when the agent reaches C1. The cell A3 contains a large reward.

    One of the simplest policies that can be used is random: at each state, the agent randomly moves to any allowed cell with equal probability. The corresponding policy for this strategy is illustrated in the figure above.

    The demonstrated maze is also an example of an episodic task. After reaching the terminal state and obtaining a certain reward, a new independent game can be initialized.

    Apart from policies, in reinforcement learning, it is common to use the notion of value functions which describe how good or bad (in terms of the expected reward) it is for the agent to be in a given state or to take a certain action given the current state.

    State-value function

    State-value function v(s) (or simply V-function) is a mapping from every environment state to the cumulative expected reward the agent would receive if it were initially placed at that state following a certain policy π.

    V-function can be represented as a 1-dimensional table of size |S|.

    V-function outputs the expected reward given an input state s under the condition that the agent follows the policy π. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

    To better understand the definition of the V-function, let us refer to the previous example. We can see that cells located in the neighbourhood of A3 (which are A2, A3 and B3) have higher V-values than those located further from it (like A1, B2 and C2). This makes sense because being located near a large reward at A3, the agent has a higher chance to collect it.

    V-function example. Every game state corresponds to a cumulative reward the agent would receive if it were initially placed in it.

    The V-value for terminal states is equal to 0.

    Action-value function

    Action-value functions have similar concept, in comparison to state-value functions. However, they also take into account a possible action the agent can take under a given policy.

    Action-value function q(s, a) (or simply Q-function) is a mapping from each environment state s ∈ S and each possible agent’s action a ∈ A to the expected reward the agent would receive if it were initially placed at that state and had to take that action following a certain policy π.

    Q-function can be represented in the form of table of size |S| x |A|.

    Q-function definion. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto
    Q-function example. For every pair (state, action), the Q-function outputs the corresponding expected reward.

    The difference between state and action functions is only in the fact that the action-value function takes additional information about the action the agent is going to take in the current state. The state function only considers the current state and does not take into account the next agent’s action.

    Both V- and Q-functions are learned from the agent’s experience.

    A subtility on V- and Q-values

    Why is q(s, a) ≠ v(s’), i.e. why the expected reward of an agent being in state s and taking the action a leading to next state s’ is not equal to the expected reward of an agent being in state s’?

    This question might seem logical at first. Indeed, let us take the agent from the example above who is at cell B2 and assume that it then makes a transition to B3. From the Q-table we can see that the expected reward q(B2, “up”) = 5.6. At the same time, the V-table shows the expected reward v(B3) = 6.2. While 5.6 is relatively close to 6.2, both values are still not equal. So the ultimate question is why q(B2, “up”) ≠ v(B3)?

    The answer to this question lies in the fact that despite choosing an action in the current state s that deterministically leads to the next state s’, the reward received by the agent from that transition is taken into account by the Q-function but not by the V-function. In other words, if the current timestamp is t, then the expected reward q(s, a) will consider the discounted sum starting from the step t: Rₜ + αRₜ₊₁ … . The expected reward corresponding to v(s’) will not have the term Rₜ in its sum: Rₜ₊₁ + αRₜ₊₂ + … .

    It is worth additionally noting that sometimes an action a taken at some state s can lead to multiple possible next states. The simple maze example above does not demonstrate this concept. But we can add the following condition, for instance, to the agent’s actions: when the agent chooses a direction to move in the maze, there is a 20% chance that the light in the new cell will be turned off and because of that the agent will ultimately move by 90° relatively to that direction.

    The introduced concept demonstrates how the same action from the same state can lead to different states. As a consequence, the rewards received by the agent from the same action and state can differ. That is another aspect that contributes to the inequality between q(s, a) and v(s’).

    Bellman equation

    Bellman equation is one of the fundamental equations in reinforcement learning! In simple words, it recursively establishes the state / action function values at the current and next timestamps.

    V-function

    By using the definition of the expected value, we can rewrite the expression of the state value function to use the V-function of the next step:

    Bellman equation for the V-function. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

    What this equality states is simply the fact that the v-value of the current state s equals the expected value of the sum of the reward received by the agent from transitioning to that state s and the discounted v-value of the next state s’.

    In their book, Richard S. Sutton and Andrew G. Barto use so-called “backup diagrams” that allow to better understand the flow of state functions and capture the logic behind the probability multiplication which take places in the Bellman equation. The one used for the V-function is demonstrated below.

    Backup diagram for V-function. Image adopted by the author. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

    Bellman equation plays an important role for computing, approximating and calculating the V-function.

    Q-function

    Similarly to V-functions, we can derive the Bellman equation for Q-functions.

    Bellman equation for the Q-function. Source: Reinforcement Learning. Exercise Solutions | GitHub repository (@LyWangPX).
    Backup diagram for Q-function. Image adopted by the author. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

    Optimal policy

    Let us define the comparison operation between different policies.

    A policy π₁ is said to be better than or equal to policy π₂ if the expected reward of π₁ is greater than or equal to the expected reward of π₂ for all states s ∈ S.

    A policy π⁎ is said to be optimal if it is better than or equal to any other policy.

    Every optimal policy also has the optimal V⁎- and Q⁎-functions associated with it.

    Bellman optimality equation

    We can rewrite Bellman equations for optimal policies. In reality, they look very similar to normal Bellman equations we saw before except for the fact that that the policy term π(a|s) is removed and the max function is added to deterministically get the maximum reward from choosing the best action a from the current state s.

    Bellman optimality equation for the V-function. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto
    Bellman optimality equation for the Q-function. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

    These equations can be mathematically solved for every state. If either the optimal V⁎-function or Q⁎-function is found, the optimal policy π⁎ can also be easily calculated which will always greedily choose the actions that maximise the expected reward.

    Unfortunately, it is very hard in practice to mathematically solve Bellman equations because the number of states in most problems is usually huge. For this reason, reinforcement learning methods are used that can approximately calculate optimal policies with much fewer computations and memory requirements.

    Conclusion

    In this article, we have discussed how agents learn through experience by the trial and error approach. The simplicity of the introduced reinforcement learning framework generalizes well for many problems, yet provides a flexible way to use the notions of policy and value functions. At the same time, the utlimate algorithm objective consists of calculating the optimal V– and Q– functions maximizing the expected reward.

    Most of the existing algorithms try to approximate the optimal policy function. While the best solution is almost impossible to get in real-world problems due to memory and computation constraints, approximation methods work very well in most situations.

    Resources

    All images unless otherwise noted are by the author.


    Reinforcement Learning: Introduction and Main Concepts was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Reinforcement Learning: Introduction and Main Concepts

    Go Here to Read this Fast! Reinforcement Learning: Introduction and Main Concepts

  • Generate “Verified” Python Code Using AutoGen Conversable Agents

    Generate “Verified” Python Code Using AutoGen Conversable Agents

    Shahzeb Naveed

    Leverage multi-agentic workflows for code testing and debugging

    “Two AI bots solving an Error” —Source: Adobe Firefly (Image generated by author)

    It’s April 2024 and it’s been about 17 months since we’ve been using LLMs like ChatGPT to aid us in code generation and debugging tasks. While it has added a great level of productivity, there are indeed times when the code generated is full of bugs and makes us take the good ole StackOverflow route.

    In this article, I’ll give a quick demonstration on how we can address this lack of “verification” using Conversable Agents offered by AutoGen.

    What is AutoGen?

    “AutoGen is a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks.”

    Presenting LeetCode Problem Solver:

    Start with quietly installing autogen:

    !pip install pyautogen -q --progress-bar off

    I’m using Google Colab so I entered by OPENAI_API_KEY in the Secrets tab, and securely loaded it along with other modules:

    import os
    import csv
    import autogen
    from autogen import Cache
    from google.colab import userdata
    userdata.get('OPENAI_API_KEY')

    I’m using gpt-3.5-turbo only because it’s cheaper than gpt4. If you can afford more expensive experimentation and/or you’re doing things more “seriously”, you should obviously use a stronger model.

    llm_config = {
    "config_list": [{"model": "gpt-3.5-turbo", "api_key": userdata.get('OPENAI_API_KEY')}],
    "cache_seed": 0, # seed for reproducibility
    "temperature": 0, # temperature to control randomness
    }

    Now, I’ll copy the problem statement from my favourite LeetCode problem Two Sum. It’s one of the most commonly asked questions in leetcode-style interviews and covers basic concepts like caching using hashmaps and basic equation manipulation.

    LEETCODE_QUESTION = """
    Title: Two Sum

    Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order.

    Example 1:
    Input: nums = [2,7,11,15], target = 9
    Output: [0,1]
    Explanation: Because nums[0] + nums[1] == 9, we return [0, 1].

    Example 2:
    Input: nums = [3,2,4], target = 6
    Output: [1,2]

    Example 3:
    Input: nums = [3,3], target = 6
    Output: [0,1]

    Constraints:

    2 <= nums.length <= 104
    -109 <= nums[i] <= 109
    -109 <= target <= 109
    Only one valid answer exists.

    Follow-up: Can you come up with an algorithm that is less than O(n2) time complexity?
    """

    We can now define both of our agents. One agent acts as the “assistant” agent that suggests the solution and the other serves as a proxy to us, the user and is also responsible for executing the suggested Python code.

    # create an AssistantAgent named "assistant"

    SYSTEM_MESSAGE = """You are a helpful AI assistant.
    Solve tasks using your coding and language skills.
    In the following cases, suggest python code (in a python coding block) or shell script (in a sh coding block) for the user to execute.
    1. When you need to collect info, use the code to output the info you need, for example, browse or search the web, download/read a file, print the content of a webpage or a file, get the current date/time, check the operating system. After sufficient info is printed and the task is ready to be solved based on your language skill, you can solve the task by yourself.
    2. When you need to perform some task with code, use the code to perform the task and output the result. Finish the task smartly.
    Solve the task step by step if you need to. If a plan is not provided, explain your plan first. Be clear which step uses code, and which step uses your language skill.
    When using code, you must indicate the script type in the code block. The user cannot provide any other feedback or perform any other action beyond executing the code you suggest. The user can't modify your code. So do not suggest incomplete code which requires users to modify. Don't use a code block if it's not intended to be executed by the user.
    If you want the user to save the code in a file before executing it, put # filename: <filename> inside the code block as the first line. Don't include multiple code blocks in one response. Do not ask users to copy and paste the result. Instead, use 'print' function for the output when relevant. Check the execution result returned by the user.
    If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try.
    When you find an answer, verify the answer carefully. Include verifiable evidence in your response if possible.

    Additional requirements:
    1. Within the code, add functionality to measure the total run-time of the algorithm in python function using "time" library.
    2. Only when the user proxy agent confirms that the Python script ran successfully and the total run-time (printed on stdout console) is less than 50 ms, only then return a concluding message with the word "TERMINATE". Otherwise, repeat the above process with a more optimal solution if it exists.
    """

    assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config=llm_config,
    system_message=SYSTEM_MESSAGE
    )

    # create a UserProxyAgent instance named "user_proxy"
    user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=4,
    is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
    code_execution_config={
    "work_dir": "coding",
    "use_docker": False,
    },
    )

    I set the human_input_mode to “NEVER” because I’m not planning to give any inputs myself and max_consecutive_auto_reply to 4 to limit the back-and-forth turns in the conversation. The Assistant agent has been instructed to respond with the word “TERMINATE” that tells the UserProxyAgent when to conclude the conversation.

    Now, the fun time! We’ll initiate the conversation by sending a message from our UserProxyAgent to our Assistant.

    One added benefit of using AutoGen (even for non-agentic workflows) is that it provides explicit caching capability to help you save API costs during development. Here, I’m caching responses on the disk but you can also integrate redis for this purpose.

    # Use DiskCache as cache
    with Cache.disk(cache_seed=7) as cache:
    # the assistant receives a message from the user_proxy, which contains the task description
    chat_res = user_proxy.initiate_chat(
    assistant,
    message="""Solve the following leetcode problem and also comment on it's time and space complexity:nn""" + LEETCODE_QUESTION
    )

    Output:

    user_proxy (to assistant):

    Solve the following leetcode problem and also comment on it's time and space complexity:


    Title: Two Sum

    ...

    Follow-up: Can you come up with an algorithm that is less than O(n2) time complexity?


    --------------------------------------------------------------------------------
    assistant (to user_proxy):

    To solve this problem efficiently, we can use a dictionary to store the elements we have seen so far along with their indices. While iterating through the array, we can check if the complement of the current element (target - current element) is already in the dictionary. If it is, we have found the pair that sums up to the target.

    Here's the Python code to solve the Two Sum problem with a time complexity less than O(n^2):

    ```python
    # filename: two_sum.py
    def two_sum(nums, target):
    num_dict = {}
    for i, num in enumerate(nums):
    complement = target - num
    if complement in num_dict:
    return [num_dict[complement], i]
    num_dict[num] = i

    # Test the function with the given examples
    nums1 = [2, 7, 11, 15]
    target1 = 9
    print(two_sum(nums1, target1)) # Output: [0, 1]

    nums2 = [3, 2, 4]
    target2 = 6
    print(two_sum(nums2, target2)) # Output: [1, 2]

    nums3 = [3, 3]
    target3 = 6
    print(two_sum(nums3, target3)) # Output: [0, 1]
    ```

    The time complexity of this solution is O(n) because we iterate through the list of numbers only once. The space complexity is also O(n) because we store elements in a dictionary.

    You can run this code and check if it provides the correct output. Let me know the total run-time, and if it's less than 50 ms, you can reply with "TERMINATE". If it's not, I can suggest a more optimal solution.

    --------------------------------------------------------------------------------

    >>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is python)...
    user_proxy (to assistant):

    exitcode: 0 (execution succeeded)
    Code output:
    [0, 1]
    [1, 2]
    [0, 1]


    --------------------------------------------------------------------------------
    assistant (to user_proxy):

    The Python script ran successfully and provided the correct output for the given examples. Since the total run-time is not provided, I will suggest measuring the total run-time of the algorithm using the "time" library to determine if it's less than 50 ms.

    You can add the functionality to measure the total run-time by importing the "time" library and using it to calculate the elapsed time before and after calling the `two_sum` function. Here's how you can modify the code:

    ```python
    # filename: two_sum.py
    import time

    def two_sum(nums, target):
    num_dict = {}
    for i, num in enumerate(nums):
    complement = target - num
    if complement in num_dict:
    return [num_dict[complement], i]
    num_dict[num] = i

    # Test the function with the given examples and measure the total run-time
    nums1 = [2, 7, 11, 15]
    target1 = 9
    start_time = time.time()
    print(two_sum(nums1, target1)) # Output: [0, 1]
    end_time = time.time()
    print("Total run-time:", (end_time - start_time) * 1000, "ms")

    nums2 = [3, 2, 4]
    target2 = 6
    start_time = time.time()
    print(two_sum(nums2, target2)) # Output: [1, 2]
    end_time = time.time()
    print("Total run-time:", (end_time - start_time) * 1000, "ms")

    nums3 = [3, 3]
    target3 = 6
    start_time = time.time()
    print(two_sum(nums3, target3)) # Output: [0, 1]
    end_time = time.time()
    print("Total run-time:", (end_time - start_time) * 1000, "ms")
    ```

    Please run this modified code and provide the total run-time for each example. If the total run-time is less than 50 ms for all examples, you can reply with "TERMINATE".

    --------------------------------------------------------------------------------

    >>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is python)...
    user_proxy (to assistant):

    exitcode: 0 (execution succeeded)
    Code output:
    [0, 1]
    Total run-time: 0.01239776611328125 ms
    [1, 2]
    Total run-time: 0.00286102294921875 ms
    [0, 1]
    Total run-time: 0.0016689300537109375 ms


    --------------------------------------------------------------------------------
    assistant (to user_proxy):

    The Python script ran successfully, and the total run-time for each example is significantly less than 50 ms. Therefore, you can reply with "TERMINATE". The algorithm provided an efficient solution with a time complexity of O(n) and a space complexity of O(n).

    --------------------------------------------------------------------------------

    This is what’s happening here:

    1. The UserProxyAgent asks the Assistant to solve the problem based on the task description.
    2. The Assistant suggests a solution with a Python block
    3. The UserProxyAgent executes the python code.
    4. The Assistant reads the console output and responds back with a modified solution (with time measurement functionality. Honestly, I would’ve expected this modified solution right away but this behavior can be tuned through prompt engineering or by employing a stronger LLM).

    With AutoGen, you can also display the cost of the agentic workflow.

    chat_res.cost

    ({'total_cost': 0,
    'gpt-3.5-turbo-0125': {'cost': 0,
    'prompt_tokens': 14578,
    'completion_tokens': 3460,
    'total_tokens': 18038}}

    Concluding Remarks:

    Thus, by using AutoGen’s conversable agents:

    1. We automatically verified that the Python code suggested by the LLM actually works.
    2. And created a framework by which the LLM can further respond to syntax or logical errors by reading the output in the console.

    Thanks for reading! Please follow me and subscribe to be the first when I post a new article! 🙂

    Check out my other articles:

    1. A Deep Dive into Evaluation in Azure Prompt Flow
    2. Develop a UI for Azure Prompt Flow with Streamlit
    3. Build a custom Chatbot using Hugging Face Chat UI and Cosmos DB on Azure Kubernetes Service
    4. Deploy Hugging Face Text Generation Inference on Azure Container Instance


    Generate “Verified” Python Code Using AutoGen Conversable Agents was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Generate “Verified” Python Code Using AutoGen Conversable Agents

    Go Here to Read this Fast! Generate “Verified” Python Code Using AutoGen Conversable Agents

  • Estimating Individualized Treatment Rules Using Outcome Weighted Learning

    Estimating Individualized Treatment Rules Using Outcome Weighted Learning

    Nadav Har-Tuv

    A non-parametric approach for fitting personalized treatments to patients

    In many diseases, different patients will react differently to different treatments. A drug that is beneficial for some patients may not work for other patients with different characteristics. Therefore, healthcare can significantly improve by treating patients based on their characteristics, rather than treating all patients with the same treatment.

    In this article, I will try to show you how we can train a machine-learning model to learn the optimal personalized treatment.

    This article is about the field of personalized health care, but the results can be used in any field. For example: Different people will react differently to different ads on social media, so, in cases where there are multiple ads for the same product, how do you choose which ad to show to which viewers?

    This method is useful in any case where you have to give a treatment but you can only give one treatment to every individual in the sample and therefore you have no way of knowing how that individual would respond to the other treatments.

    Let’s formalize the problem

    An experiment was performed to compare two (or more) treatments. We’ll name them T = 1,2… A vector of covariates X represents every patient. Every patient i with a covariates vector Xᵢ, that was given a treatment Tᵢ has a recorded response to the treatment, Rᵢ.

    For example, let’s assume that you want to test 3 different drugs for diabetes, we’ll name these drugs “1”, “2”, “3”.

    We have a patient named Esther, she is 64 years old, she’s been diagnosed with diabetes 8 years ago, she weighs 65 kilos and her height is 1.54 meters. Esther has received drug “1” and her blood sugar was reduced by 10 points after being given the new drug.

    In our example, the data point we have on Esther is X = {Female, 64 years old, 8 years since diagnosis, 65 kg, 1.54 meters}, T = “1”, R = 10.

    In this setting, we would like to learn an optimal decision rule D(x), that assigns a treatment “1”, “2”, or “3” to every patient to optimize the outcome for that patient.

    The old way of solving this problem was to model the outcome as a function of the data and the treatment and denote the predicted outcome as f(X,T). Once we have a model we can create a decision rule D(x): we compute f(X,1), f(X,2), and f(X,3) and give the patient the drug that maximizes their expected outcome.

    This solution can work when we have a fairly good understanding of the underlying model that created the data. In this case, all we need is some finetuning to find the best parameters for our case.

    However, if the model is bad then our results will be bad, regardless of the amount of data at hand.

    Can we come up with a decision rule that is not parametric and does not assume any prior knowledge of the relationship between the data and the treatment result?

    The answer is yes, we can use machine learning to find a decision rule that does not make any assumptions about the relationship between the response and the treatment!

    Solving with a non-parametric approach using Outcome Weighted Learning

    The way to solve this problem is to solve a classification problem where the labels are the treatments given in the experiment and every data point i is weighted by Rᵢ/π(Tᵢ|Xᵢ), where π(Tᵢ|Xᵢ) is the propensity of getting treatment Tᵢ, given that you have the characteristics Xᵢ, which can be computed from the data.

    This makes sense because we try to follow the experiment’s results, but only where it worked best. The reason we divide by the propensities is to correct the category size bias. If you’ve learned some reinforced learning then this whole process should look familiar to you.

    Here is an example of an owl classifier using SVM. You can feel free to use any classifier you like.

    import numpy as np
    from sklearn.linear_model import LogisticRegression
    from sklearn import svm

    def owl_classifier(X_train, T, R, kernel, gamma):
    n = len(T)
    pi = np.zeroes(n) #Initialize pi as a vector of zeroes
    probs = LogisticRegression().fit(X_train, T).predict_proba(X_train)#This is a n*unique(T) matrix that gives every person the probability of getting each treatment
    for t in np.unique(T):
    pi += probs[,t]*(T==t) #Every data point is assigned the probability of getting the treatment that it got, given the covariates
    clf = svm.SVC(kernel = kernel, gamma = gamma) # initialize an svm classifier, the parameters need to be found by cross validation
    clf.fit(X_train, T, sample_weight = R/pi) # fit the classifier with the treatments as labels and R/pi as sample weights

    Simulation to test the OWL method

    Simulating data can test the owl method. We create the reward function so that we know what the optimal treatment is for every patient. We can then train the OWL classifier on the data and check how well it fits the optimal classifier.

    For example:

    I created 50 features that are all sampled from a U([-1,1]) distribution. I gave the patients one of three treatments {1,2,3} at random, uniformly.

    The response function is sampled from a N(μ, 1) distribution, where μ = (X₁ + X₂)*I(T=1) + (X₁ — X₂)*I(T=2) + (X₂-X₁)*I(T=3)

    # This code block creates the data for the simulation
    import numpy as np

    n_train = 500 # I purposely chose a small training set to simulate a medical trial
    n_col = 50 # This is the number of features
    n_test = 1000
    X_train = np.random.uniform(low = -1, high = 1, size = (n_train, n_col))
    T = np.random.randint(3, size = n_train) # Treatments given at random uniformly
    R_mean = (X_train[:,0]+X_train[:,1])*(T==0) + (X_train[:,0]-X_train[:,1])*(T==1) + (X_train[:,1]-X_train[:,0])*(T==2)
    R = np.random.normal(loc = R_mean, scale = .1) # The stanadard deviation can be tweaked
    X_test = np.random.uniform(low = -1 , high = 1, size = (n_test, n_col))

    # The optimal classifier can be deduced from the design of R
    optimal_classifier = (1-(X_test[:,0] >0)*(X_test[:,1]>0))*((X_test[:,0] > X_test[:,1]) + 2*(X_test[:,1] > X_test[:,0]))

    It is not hard to see that the optimal treatment regime is to give treatment 1 if both X₁ and X₂ are positive. If they are both negative, give treatment 2 if X₂<X₁ and give treatment 3 if X₁<X₂. If X₁ is positive and X₂ is negative, give treatment 2. If X₂ is positive and X₁ is negative, give treatment 3.

    Or we can show this with an image. These are the different ranges of the optimal treatment, shown for ranges of X₁, X₂:

    Optimal treatment ranges for combinations of X₁, X₂

    I sampled 500 data points with 50 features and the reward function that I described above. I fit an OWL classifier with a Gaussian (‘rbf’) kernel and got the following classifications, which I visualized for values of X₁, X₂:

    Classification of treatment groups visualized for values of X₁, X₂
    # Code for the plot 
    import seaborn as sns

    kernel = 'rbf'
    gamma = 1/X_train.shape[1]
    # gamma is a hyperparameter that has to be found by cross validation but this is a good place to start
    D = owl_classifier(X_train, T, R, kernel, gamma)
    prediction = D.predict(X_test)
    sns.scatterplot(x = X_test[:,0], y = X_test[:,1], c = prediction )

    In case you missed what happened here: The data was composed of 2 features that affected the response and 48 features of noise. The model managed to learn the effect of the two important features without us modeling this relationship in any way!

    This is just one simple example, I made the reward function depend on X₁ and X₂ so that it’s easy to understand and visualize but you can feel free to use other examples and try out different classifiers.

    Conclusion

    Outcome-weighted learning can be used to learn an optimal treatment in cases where we only see one treatment per patient in the training data, without having to model the response as a function of the features and the treatment.

    There is some math that I dropped out from this article that justifies this whole process, I did not just make this up from the top of my head.

    Future research on this topic should include:

    1. Exploitation vs. exploration: Even after we learned a treatment rule, it’s still beneficial to sometimes explore options that are considered not optimal according to our model. The model can be wrong.
    2. Sequential treatment: when there is a sequence of treatments, each one of them changes the state of the patient. The solution for the whole sequence should be found via dynamic programming.
    3. Design: in this article, I just assumed the treatments were given according to a given rule. Perhaps we can find some design that can improve the learning process.


    Estimating Individualized Treatment Rules Using Outcome Weighted Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Estimating Individualized Treatment Rules Using Outcome Weighted Learning

    Go Here to Read this Fast! Estimating Individualized Treatment Rules Using Outcome Weighted Learning

  • Interpretable Latent Spaces Using Space-Filling Vector Quantization

    Interpretable Latent Spaces Using Space-Filling Vector Quantization

    Mohammad Hassan Vali

    A new unsupervised method that combines two concepts of vector quantization and space-filling curves to interpret the latent space of DNNs

    This post is a short explanation of our novel unsupervised distribution modeling technique called space-filling vector quantization [1] published at Interspeech 2023 conference. For more details, please look at the paper under this link.

    Image from StockSnap.io

    Deep generative models are well-known neural network-based architectures that learn a latent space whose samples can be mapped to sensible real-world data such as image, video, and speech. Such latent spaces act as a black-box and they are often difficult to interpret. In this post, we introduce our novel unsupervised distribution modeling technique that combines two concepts of space-filling curves and vector quantization (VQ) which is called Space-Filling Vector Quantization (SFVQ). SFVQ helps to make the latent space interpretable by capturing its underlying morphological structure. Important to note that SFVQ is a generic tool for modeling distributions and using it is not restricted to any specific neural network architecture nor any data type (e.g. image, video, speech and etc.). In this post, we demonstrate the application of SFVQ to interpret the latent space of a voice conversion model. To understand this post you don’t need to know about speech signals technically, because we explain everything in general (not technical). Before everything, let me explain what is the SFVQ technique and how it works.

    Space-Filling Vector Quantization (SFVQ)

    Vector quantization (VQ) is a data compression technique similar to k-means algorithm which can model any data distribution. The figure below shows a VQ applied on a Gaussian distribution. VQ clusters this distribution (gray points) using 32 codebook vectors (blue points) or clusters. Each voronoi cell (green lines) contains one codebook vector such that this codebook vector is the closest codebook vector (in terms of Euclidean distance) to all data points located in that voronoi cell. In other words, each codebook vector is the representative vector of all data points located in its corresponding voronoi cell. Therefore, applying VQ on this Gaussian distribution means to map each data point to its closest codebook vector, i.e. represent each data point with its closest codebook vector. For more information about VQ and its other variants you can check out this post.

    Vector Quantization applied on a Gaussian distribution using 32 codebook vectors. (image by author)

    Space-filling curve is a piece-wise continuous line generated with a recursive rule and if the recursion iterations are repeated infinitely, the curve gets bent until it completely fills a multi-dimensional space. The following figure illustrates the Hilbert curve [2] which is a well-known type of space-filling curves in which the corner points are defined using a specific mathematical formulation at each recursion iteration.

    Five first iterations of Hilbert curve to fill a 2D square distribution. (image by author)

    Getting intuition from space-filling curves, we can thus think of vector quantization (VQ) as mapping input data points on a space-filling curve (rather than only mapping data points exclusively on codebook vectors as what we do in normal VQ). Therefore, we incorporate vector quantization into space-filling curves, such that our proposed space-filling vector quantizer (SFVQ) models a D-dimensional data distribution by continuous piece-wise linear curves whose corner points are vector quantization codebook vectors. The following figure illustrates VQ and SFVQ applied on a Gaussian distribution.

    Codebook vectors (blue points) of a vector quantizer, and a space-filling vector quantizer (curve in black) on a Gaussian distribution (gray points). Voronoi regions for VQ are shown in green. (image by author)

    For technical details on how to train SFVQ and how to map data points on SFVQ’s curve, please see section 2 in our paper [1].

    Note that when we train a normal VQ on a distribution, the adjacent codebook vectors that exists inside the learned codebook matrix can refer to totally different contents. For example, the first codebook element could refer to a vowel phone and the second one could refer to a silent part of speech signal. However, when we train SFVQ on a distribution, the learned codebook vectors will be located in an arranged form such that adjacent elements in the codebook matrix (i.e. adjacent codebook indices) will refer to similar contents in the distribution. We can use this property of SFVQ to interpret and explore the latent spaces in Deep Neural Networks (DNNs). As a typical example, in the following we will explain how we used our SFVQ method to interpret the latent space of a voice conversion model [3].

    Voice Conversion

    The following figure shows a voice conversion model [3] based on vector quantized variational autoencoder (VQ-VAE) [4] architecture. According to this model, encoder takes the speech signal of speaker A as the input and passes the output into vector quantization (VQ) block to extracts the phonetic information (phones) out of this speech signal. Then, these phonetic information together with the identity of speaker B goes into the decoder which outputs the converted speech signal. The converted speech would contain the phonetic information (context) of speaker A with the identity of speaker B.

    Voice conversion model based on VQ-VAE architecture. (image by author)

    In this model, the VQ module acts as an information bottleneck that learns a discrete representation of speech that captures only phonetic content and discards the speaker-related information. In other words, VQ codebook vectors are expected to collect only the phone-related contents of the speech. Here, the representation of VQ output is considered the latent space of this model. Our objective is to replace the VQ module with our proposed SFVQ method to interpret the latent space. By interpretation we mean to figure out what phone each latent vector (codebook vector) corresponds to.

    Interpreting the Latent Space using SFVQ

    We evaluate the performance of our space-filling vector quantizer (SFVQ) on its ability to find the structure in the latent space (representing phonetic information) in the above voice conversion model. For our evaluations, we used the TIMIT dataset [5], since it contains phone-wise labeled data using the phone set from [6]. For our experiments, we use the following phonetic grouping:

    • Plosives (Stops): {p, b, t, d, k, g, jh, ch}
    • Fricatives: {f, v, th, dh, s, z, sh, zh, hh, hv}
    • Nasals: {m, em, n, nx, ng, eng, en}
    • Vowels: {iy, ih, ix, eh, ae, aa, ao, ah, ax, ax-h, uh, uw, ux}
    • Semi-vowels (Approximants): {l, el, r, er, axr, w, y}
    • Diphthongs: {ey, aw, ay, oy, ow}
    • Silence: {h#}.

    To analyze the performance of our proposed SFVQ, we pass the labeled TIMIT speech files through the trained encoder and SFVQ modules, respectively, and extract the codebook vector indices corresponding to all existing phones in the speech. In other words, we pass a speech signal with labeled phones and then compute the index of the learned SFVQ’s codebook vector which those phones are getting mapped to them. As explained above, we expect our SFVQ to map similar phonetic contents next to each other (index-wise in the learned codebook matrix). To examine this expectation, in the following figure we visualize the spectrogram of the sentence “she had your dark suit”, and its corresponding codebook vector indices for the ordinary vector quantizer (VQ) and our proposed SFVQ.

    (Top) Codebook vector indices for the speech signal using our proposed SFVQ (in dark blue circles) and the ordinary VQ (in gray crosses). (Bottom) Spectrogram of the speech signal including codebook vector indices corresponding to speech frames. (image by author)

    We observe that the indices of the ordinary VQ does not have any particular structure. However, when using our proposed SFVQ, there is a clear structure for the codebook vector indices. The indices for the speech frames containing fricative phones of {sh, s} within the words {she, suit} are uniformly distributed next to each other throughout the frames. In addition, silence frames containing phone {h#} and some other low energy frames containing {kcl, tcl: k, t closures} within the words {dark, suit} are uniformly located next to each other in the range [0–20]. Notice that the figure below remains sufficiently consistent for sentences with the same phonetic content, even across speakers with different genders, speech rhythms, and dialects.

    The figure below demonstrates the histogram of SFVQ’s codebook indices for each phonetic group (explained above) for the whole TIMIT speech files. At first glance, we observe that consonants:{silence, plosives, fricatives, nasals} and vowels:{vowels, diphthongs, approximants} can be separated around index 125 (apart from the peak near index 20). We also observe that the most prominent peaks of different groups are separated in different parts of the histogram.

    Histogram of SFVQ’s codebook vector indices for different phonetic groups. (image by author)

    By having this visualization, we have a better understanding of the latent space and we can now distinguish which part of the latent space refers to what phonetic group. We can even go further in details and explore the latent space in terms of phone level. As an example for distribution of phones within a phonetic group, the figure below illustrates the histograms of all phones in fricatives group.

    Histogram of SFVQ’s codebook vector indices for fricative phones. (image by author)

    By observing the most prominent peak for each phone, we find out the peaks of similar phones are located next to each other. To elaborate, we listed similar phones and their corresponding peak index here as {f:51, v:50}, {th:78, dh:80}, {s:71, z:67, sh:65, zh:67}, {hh:46, hv:50}. Except {hh, hv} phones, fricatives are mainly located in the range [50–85]. Further structures can be readily identified from all provided figures by visual inspection.

    These experiments demonstrate that our proposed SFVQ achieves a coherently structured and easily interpretable representation for latent codebook vectors, which represent phonetic information of the input speech. Accordingly, there is an obvious distinction of various phonetic groupings such as {consonants vs. vowels}, {fricatives vs. nasals vs. diphthongs vs. …}, and we can simply tell apart which phone each codebook vector represents. In addition, similar phones within a specific phonetic group are encoded next to each other in the latent codebook space. This is the main interpretability that we aimed to obtain from a black-box called latent space.

    One advantage of SFVQ over other supervised approaches which tries to make the latent space interpretable is that SFVQ does not incur any human labeling and manual restrictions on the learned latent space. To make our method interpretable, it only requires the user to study the unsupervised learned latent space entirely once by observation. This observation needs much much less labeled data than what is necessary for supervised training of big models. Again we want to note that SFVQ is a generic tool for modeling distributions and using it is not restricted to any specific neural network architecture nor any data type (e.g. image, video, speech and etc.).

    GitHub Repository

    PyTorch implementation of our SFVQ technique is publicly available in GitHub using the following link:

    GitHub – MHVali/Space-Filling-Vector-Quantizer

    Acknowledgement

    Special thanks to my doctoral program supervisor Prof. Tom Bäckström, who supported me and was the other contributor for this work.

    References

    [1] M.H. Vali, T. Bäckström, “Interpretable Latent Space Using Space-Filling Curves for Phonetic Analysis in Voice Conversion”, in Proceedings of Interspeech, 2023.

    [2] H. Sagan, “Space-filling curves”, Springer Science & Business Media, 2012.

    [3] B. Van Niekerk, L. Nortje, and H. Kamper, “Vector-quantized neural networks for acoustic unit discovery in the Zerospeech 2020 challenge”, in Proceedings of Interspeech, 2020.

    [4] A. Van Den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural Discrete Representation Learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.

    [5] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM”, Linguistic Data Consortium, 1993.

    [6] C. Lopes and F. Perdigao, “Phoneme recognition on the TIMIT database”, in Speech Technologies. IntechOpen, 2011, ch. 14. [Online]. Available: https://doi.org/10.5772/17600


    Interpretable Latent Spaces Using Space-Filling Vector Quantization was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Interpretable Latent Spaces Using Space-Filling Vector Quantization

    Go Here to Read this Fast! Interpretable Latent Spaces Using Space-Filling Vector Quantization