Tag: AI

  • How BigBasket improved AI-enabled checkout at their physical stores using Amazon SageMaker

    How BigBasket improved AI-enabled checkout at their physical stores using Amazon SageMaker

    Santosh Waddi

    This post is co-written with Santosh Waddi and Nanda Kishore Thatikonda from BigBasket. BigBasket is India’s largest online food and grocery store. They operate in multiple ecommerce channels such as quick commerce, slotted delivery, and daily subscriptions. You can also buy from their physical stores and vending machines. They offer a large assortment of over […]

    Originally appeared here:
    How BigBasket improved AI-enabled checkout at their physical stores using Amazon SageMaker

    Go Here to Read this Fast! How BigBasket improved AI-enabled checkout at their physical stores using Amazon SageMaker

  • Amazon SageMaker Feature Store now supports cross-account sharing, discovery, and access

    Amazon SageMaker Feature Store now supports cross-account sharing, discovery, and access

    Ioan Catana

    Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference. For example, in an application that recommends a music playlist, features could include song ratings, listening duration, and listener demographics. Features are used […]

    Originally appeared here:
    Amazon SageMaker Feature Store now supports cross-account sharing, discovery, and access

    Go Here to Read this Fast! Amazon SageMaker Feature Store now supports cross-account sharing, discovery, and access

  • The Math and Code Behind K-Means Clustering

    Cristian Leo

    Why is K-Means the most popular algorithm in Unsupervised Learning? Let’s dive into its math, and build it from scratch.

    Image generated by DALL-E

    K-means clustering is a staple in machine learning for its straightforward approach to organizing complex data. In this article we’ll explore the core of the algorithm. We will delve into its applications, dissect the math behind it, build it from scratch, and discuss its relevance in the fast-evolving field of data science.

    Index
    · 1: Understanding the Basics
    1.1: What is K-Means Clustering?
    1.2: How Does K-Means Work?

    · 2: Implementing K-Means
    2.1: The Mathematics Behind K-Means
    2.2: Choosing the Optimal Number of Clusters

    · 3: K-Means in Practice
    3.1 K-Means From Scratch in Python
    3.2 Implementing K-Means with Scikit-Learn

    · 4: Advantages and Challenges
    4.1 Benefits of Using K-Means
    4.2 Overcoming K-Means Limitations

    · 5: Beyond Basic K-Means
    5.1 Variants of K-Means

    · 6: Conclusion

    · Bibliography

    1: Understanding the Basics

    1.1: What is K-Means Clustering?

    Image generated by DALL-E

    Imagine you’re at a farmer’s market with a bunch of fruits and veggies that you need to sort out. You want to organize them so all the apples, oranges, and carrots are hanging out with their kind. This task is a lot like what K-Means Clustering does in data science.

    So, what’s K-Means Clustering? It’s a technique used to naturally group data. Think of it as a way to sort unlabeled data into different groups or clusters. Each cluster is a bunch of data points that are more alike to each other than to those in other groups. The “K” in K-Means? It represents the number of groups you think there should be.

    1.2: How Does K-Means Work?

    Now, keep in mind the farmer’s market example, and let’s try to dive deep a bit more into K-means mechanisms.

    Think of K-Means as setting up the initial fruit stands (centroids) in our market. It randomly picks a few spots (data points) to place these stands. The number of stands you decide to set up is the “K” in K-Means, and this number depends on how many different types of produce (groups) you think you have.

    Now, K-Means takes each fruit and veggie (data point) and figures out which stand (centroid) it belongs to based on how close it is. It’s like each apple, orange, or carrot choosing the closest stand to hang out at, forming little groups around them.

    After all the produce has found a stand, K-Means checks if the stands are in the best spots. It does this by finding a new spot for each stand based on the average position of all the produce grouped around them. It’s similar to moving the stands to the exact middle of each produce group to make sure they’re truly at the heart of the action.

    This process of sorting produce and adjusting stands keeps going. Fruits and veggies might move to a different stand if they find one closer, and the stands might shift positions to better represent their groups. This cycle continues until the stands and their produce groups don’t change much anymore. That’s when K-Means knows the market is perfectly organized, and every piece of produce is of its kind.

    Image generated by DALL-E

    What you end up with are nicely organized stands (clusters) with fruits and veggies grouped around them. Each stand represents a type of produce, showing you how your market (data) can be divided into distinct sections (categories) based on similarities.

    2: Implementing K-Means

    K-Means Step Chart — Image by Author

    2.1: The Mathematics Behind K-Means

    Let’s put on our math hats for a moment and peek into the engine room of the K-Means algorithm to see what makes it tick. K-Means is all about finding the sweet spot:

    the optimal placement of centroids that minimizes the distance between points in a cluster and their central point.

    Here’s how the math plays out:

    Step 1: Initialization
    Initially, K-Means needs to decide where to start, choosing k initial centroids (μ1​,μ2​,…,μk​) from the dataset. This can be done randomly or more strategically with methods like K-Means++, which we will cover later.

    While no specific formula is used here, the choice of initial centroids can significantly affect the outcome. But don’t worry, because, in the next section, we will guide you through the best tips to choose the optimal k.

    Step 2: Assignment of Points to Closest Centroid
    NeXT, e
    ach data point x is assigned to the nearest centroid, forming k clusters, where the “nearest” centroid is determined by the minimum distance between the point and all centroids.

    Here, the distance d between a point x and a centroid μi is typically calculated using the Euclidean distance:

    Euclidean distance formula — Image by Author

    where xj​ and μij​ are the j-th dimensions of point x and centroid μi, respectively. Now, while Euclidean distance is the most popular choice for K-Means, you could explore its application with more distances. However, keep in mind that Euclidean distance is usually the suggested choice for optimal performance, whereas other variances of the algorithm are more flexible to different distance methods. Look at the subsection 2.1 of my previous article about K-Nearest Neighbors to know more about distance methods:

    The Math Behind K-Nearest Neighbors

    Keeping the Euclidean distance in mind, let’s see how the algorithm assigns a point x to the cluster Ci​:

    Assignment of x to cluster condition — Image by Author

    Here’s what it means:

    • Ci​: This represents the i-th cluster, a set of points grouped based on their similarity.
    • x: This is a point in the dataset that the K-Means algorithm is trying to assign to one of the k clusters.
    • d(x,μi​): This calculates the distance between the point x and the centroid μi​ of cluster Ci​. The centroid is essentially the mean position of all the points currently in cluster Ci​.
    • d(x,μj​): This calculates the distance between the point x and the centroid μj​ of any other cluster Cj​.
    • k: This is the total number of clusters the algorithm is trying to partition the data into.

    The condition d(x,μi​) ≤ d(x,μj​) ∀ j, 1≤jk specifies that a point x is assigned to the cluster Ci​ if and only if the distance from x to the centroid μi​ of Ci​ is less than or equal to its distance from x to the centroids μj​ of all other clusters Cj​, for every j from 1 to k.

    In simpler terms, the formula says:

    Look at all the clusters and their centers (centroids). Now, find which centroid is closest to point x. Whichever cluster has this closest centroid, that’s the cluster x belongs to.

    This step ensures that each point is assigned to the cluster with the nearest centroid, minimizing within-cluster distances and making the clusters as tight and homogenous as possible.

    Step 3: Updating the Centroids
    After all points have been assigned to clusters, the positions of the centroids are recalculated to be at the center of their respective clusters.

    The new position μi′​ of centroid μi​ is calculated as the mean of all points assigned to cluster Ci​:

    Centroid position’s update formula — Image by Author

    where ∣Ci​∣ is the number of points in cluster Ci​, and the summation aggregates the positions of all points in the cluster.

    Step 4: Iteration Until Convergence
    The assignment and update steps (Steps 2 and 3) are repeated until the centroids no longer change significantly, indicating that the algorithm has converged.
    Here, convergence is typically assessed by measuring the change in centroid positions between iterations. If the change falls below a predetermined threshold, the algorithm stops.

    In the context of the K-Means algorithm, there isn’t a single, universally defined formula for convergence that applies in all cases. Instead, convergence is determined based on the algorithm’s behavior over successive iterations, specifically whether the centroids of the clusters stop changing significantly.

    However, a common way to assess convergence is by looking at the change in the within-cluster sum of squares (WCSS) or the positions of the centroids between iterations. Here, we set a threshold for changes in the centroid positions or the WCSS. If the change falls below this threshold from one iteration to the next, the algorithm can be considered to have converged. This can be expressed as:

    Convergence formula — Image by Author

    where:

    • μi(t)​ is the position of centroid i at iteration t,
    • μi(t−1)​ is the position of centroid i at iteration t−1,
    • ϵ is a small positive value chosen as the convergence threshold.

    2.2: Choosing the Optimal Number of Clusters

    Deciding on the right number of clusters, k, in K-Means clustering is more art than science. It’s like Goldilocks trying to find the bed that’s just right, with not too many clusters to overcomplicate things, and not too few to oversimplify. Here’s how you can find that sweet spot:

    The Elbow Method
    One popular technique is the Elbow Method. It involves running K-Means several times, each time with a different number of clusters, and then plotting the Within-Cluster Sum of Squares (WCSS) against the number of clusters. WCSS measures how tight your clusters are, with lower values indicating that points are closer to their respective centroids.

    The formula for calculating WCSS is :

    Within-Cluster Sum of Squares Formula — Image by Author

    Here’s a breakdown of this formula:

    • k represents the total number of clusters.
    • Ci​ denotes the set of all points belonging to cluster i.
    • x is a point within cluster Ci​.
    • μi​ is the centroid of cluster i.
    • xμi​∥2 calculates the squared Euclidean distance between a point x and the centroid μi​, which quantifies how far the point is from the centroid.

    The WCSS is essentially a sum of these squared distances for all points in each cluster, aggregated across all k clusters. A lower WCSS value indicates that points are on average closer to their cluster centroids, suggesting tighter and possibly more meaningful clusters.

    The sum of Squared Distances by K Plot — Image by Author

    As you increase k, WCSS decreases because the points are closer to centroids. However, there’s a point where adding more clusters doesn’t give you much bang for your buck. This point, where the rate of decrease sharply changes, looks like an elbow on the graph. In the graph above, the elbow point is 3, as the rate of decrease slows down significantly afterward. That’s your Goldilocks k:

    Elbow point ≈ Optimal k

    The Silhouette Method
    Another approach is the Silhouette Method, which evaluates how similar an object is to its cluster compared to other clusters. The silhouette score ranges from -1 to 1, where a high value indicates that the object is well-matched to its cluster and poorly matched to neighboring clusters.

    The formula for calculating the Silhouette Score for a single data point is:

    Silhouette Score Formula — Image by Author
    • s(i) is the silhouette score for a single data point i.
    • a(i) is the average distance of data point i to the other points in the same cluster, measuring how well i fits into its cluster.
    • b(i) is the smallest average distance of i to points in a different cluster, minimized over all clusters. This represents the distance to the nearest cluster that i is not a part of, indicating how poorly i matches its neighboring clusters.
    • The score s(i) ranges from −1−1 to 11, where a high value suggests that point i is well matched to its cluster and poorly matched to neighboring clusters.

    Then, we take the average silhouette score for the dataset, which is calculated by averaging the silhouette scores of all points. In mathematical terms:

    Average Silhouette Score Formula — Image by Author

    where N is the total number of points in the dataset.

    Silhouette Score by K — Image by Author

    For each k, calculate the average silhouette score of all points. The k that gives you the highest average silhouette score is considered the optimal number of clusters. It’s like finding the party where everyone feels they fit in best. In the graph above, the highest score is given by 3 clusters.

    Max average silhouette score → Optimal k

    K-Means++
    Means++ is an algorithm for choosing the initial values (or “seeds”) for the K-Means clustering algorithm. The standard K-Means algorithm starts with a random selection of centroids, which can sometimes result in poor convergence speed and suboptimal clustering. K-Means++ addresses this by spreading out the initial centroids before proceeding with the standard K-Means optimization iterations.

    It begins by randomly selecting the first centroid from the dataset. Then, for each subsequent centroid, K-Means++ chooses data points as new centroids with a probability proportional to their squared distance from the nearest existing centroid. This step increases the chances of selecting centroids that are far away from each other, reducing the likelihood of poor clustering. This process is repeated until all k centroids are chosen.

    Lastly, once the initial centroids are selected using the K-Means++ approach, continue with the standard K-Means algorithm to adjust the centroids and assign points to clusters.

    In this post, I won’t delve deep into K-Means++ math, but expect it in a new post. In the meantime, you could read the following article for a better understanding:
    k-means++: The Advantages of Careful Seeding

    3: K-Means in Practice

    3.1 K-Means From Scratch in Python

    As I like to do in this “Machine Learning From Scratch” series, let’s now recreate a simpler version of the algorithm from scratch. Let’s first look at the whole code:

    class KMeans:
    def __init__(self, K, max_iters=100, tol=1e-4):
    """
    Constructor for KMeans class

    Parameters
    ----------
    K : int
    Number of clusters
    max_iters : int, optional
    Maximum number of iterations, by default 100
    tol : float, optional
    Tolerance to declare convergence, by default 1e-4
    """
    self.K = K
    self.max_iters = max_iters
    self.tol = tol
    self.centroids = None
    self.labels = None
    self.inertia_ = None

    def initialize_centroids(self, X):
    """
    Initialize centroids

    Parameters
    ----------
    X : array-like
    Input data

    Returns
    -------
    array-like
    Initial centroids
    """
    # Simple random initialization, consider k-means++ for improvement
    indices = np.random.permutation(X.shape[0])
    centroids = X[indices[:self.K]]
    return centroids

    def compute_centroids(self, X, labels):
    """
    Compute centroids

    Parameters
    ----------
    X : array-like
    Input data
    labels : array-like
    Cluster labels

    Returns
    -------
    array-like
    Updated centroids
    """
    centroids = np.zeros((self.K, X.shape[1]))
    for k in range(self.K):
    if np.any(labels == k):
    centroids[k] = np.mean(X[labels == k], axis=0)
    else:
    centroids[k] = X[np.random.choice(X.shape[0])]
    return centroids

    def compute_distance(self, X, centroids):
    """
    Compute distances between data points and centroids

    Parameters
    ----------
    X : array-like
    Input data
    centroids : array-like
    Centroids

    Returns
    -------
    array-like
    Distances
    """
    distances = np.zeros((X.shape[0], self.K))
    for k in range(self.K):
    distances[:, k] = np.linalg.norm(X - centroids[k], axis=1) ** 2
    return distances

    def find_closest_cluster(self, distances):
    """
    Find the closest cluster for each data point

    Parameters
    ----------
    distances : array-like
    Distances

    Returns
    -------
    array-like
    Cluster labels
    """
    return np.argmin(distances, axis=1)

    def fit(self, X):
    """
    Fit the model

    Parameters
    ----------
    X : array-like
    Input data
    """
    self.centroids = self.initialize_centroids(X)
    for i in range(self.max_iters):
    distances = self.compute_distance(X, self.centroids)
    self.labels = self.find_closest_cluster(distances)
    new_centroids = self.compute_centroids(X, self.labels)
    # Compute inertia (sum of squared distances)
    inertia = np.sum([distances[i, label] for i, label in enumerate(self.labels)])
    # Check for convergence: if the centroids don't change much (within tolerance)
    if np.allclose(self.centroids, new_centroids, atol=self.tol) or inertia <= self.tol:
    break
    self.centroids = new_centroids
    self.inertia_ = inertia

    def predict(self, X):
    """
    Predict the closest cluster for each data point

    Parameters
    ----------
    X : array-like
    Input data

    Returns
    -------
    array-like
    Cluster labels
    """
    distances = self.compute_distance(X, self.centroids)
    return self.find_closest_cluster(distances)

    Let’s break down each part of the class to explain its functionality:

    1. __init__ Method:
    def __init__(self, K, max_iters=100, tol=1e-4): 
    self.K = K
    self.max_iters = max_iters
    self.tol = tol
    self.centroids = None
    self.labels = None
    self.inertia_ = None

    This method initializes a new instance of the KMeans class.

    Here, the parameters are:

    • K: The number of clusters to form.
    • max_iters: The maximum number of iterations to run the algorithm for.
    • tol: The convergence tolerance. If the change in the sum of squared distances of samples to their closest cluster center (inertia) is less than or equal to this value, the algorithm will stop.

    We also initialize a few attributes to None:

    • self.centroids: Stores the centroids of the clusters.
    • self.labels: Stores the labels of each data point indicating which cluster it belongs to.
    • self.inertia_: Stores the sum of squared distances of samples to their closest cluster center after fitting the model.

    2. initialize_centroids and compute_centroids Methods:

    def initialize_centroids(self, X):
    # Simple random initialization, consider k-means++ for improvement
    indices = np.random.permutation(X.shape[0])
    centroids = X[indices[:self.K]]
    return centroids

    def compute_centroids(self, X, labels):
    centroids = np.zeros((self.K, X.shape[1]))
    for k in range(self.K):
    if np.any(labels == k):
    centroids[k] = np.mean(X[labels == k], axis=0)
    else:
    centroids[k] = X[np.random.choice(X.shape[0])]
    return centroids

    First, we initialize the centroids randomly from the data points, by randomly permuting the indices of the input data and selecting the first K as the initial centroids. As we mentioned before, a random selection is one of the possible centroid initializations, you could go ahead and experiment with more techniques to see how this affects the performance of the model.

    Then, we compute the new centroids as the mean of the data points assigned to each cluster. Here, if a cluster ends up with no points, a random data point from X is chosen as the new centroid for that cluster.

    3. compute_distance and find_closest_clusterMethod:

    def compute_distance(self, X, centroids):
    distances = np.zeros((X.shape[0], self.K))
    for k in range(self.K):
    distances[:, k] = np.linalg.norm(X - centroids[k], axis=1) ** 2
    return distances

    def find_closest_cluster(self, distances):
    return np.argmin(distances, axis=1)

    We define the method to calculate the squared Euclidean distance between each data point and each centroid, which returns an array where each element contains the squared distances from a data point to all centroids.

    Then, we assign each data point to the closest cluster based on the computed distances. The output of the operation consists of an array of labels indicating the closest cluster for each data point.

    4. fit and predict Method:

    def fit(self, X):
    self.centroids = self.initialize_centroids(X)
    for i in range(self.max_iters):
    distances = self.compute_distance(X, self.centroids)
    self.labels = self.find_closest_cluster(distances)
    new_centroids = self.compute_centroids(X, self.labels)
    # Compute inertia (sum of squared distances)
    inertia = np.sum([distances[i, label] for i, label in enumerate(self.labels)])
    # Check for convergence: if the centroids don't change much (within tolerance)
    if np.allclose(self.centroids, new_centroids, atol=self.tol) or inertia <= self.tol:
    break
    self.centroids = new_centroids
    self.inertia_ = inertia

    def predict(self, X):
    distances = self.compute_distance(X, self.centroids)
    return self.find_closest_cluster(distances)

    Now we build the core method of this class: the fit method, which first initializes the centroids, and iterates up to max_iters times, where in each iteration:

    • Computes distances of all points to the centroids.
    • Assigns points to the closest cluster.
    • Computes new centroids.
    • Calculates inertia (sum of squared distances to the closest centroid).
    • Checks for convergence (if centroids change within tol or inertia is less or equal to tol), and stops if it has converged.

    Lastly, we predict the closest cluster for each point in a new dataset, using the predict method.

    You can find the whole code and a practical implementation in this Jupyter Notebook:

    models-from-scratch-python/K-Means/demo.ipynb at main · cristianleoo/models-from-scratch-python

    3.2 Implementing K-Means with Scikit-Learn

    Ok, now you have a solid understanding of what the algorithm does, and you can recreate the algorithm from scratch. Pretty cool right? Well, you won’t be using that code anytime soon, as our friend Scikit Learn already provides a much more efficient implementation of K-Means, which only takes a few lines of code. Here, we can use variations of K-Means only by specifying a parameter. Let’s look at one implementation of it.

    from sklearn.cluster import KMeans
    from sklearn.datasets import make_blobs

    # Generate a synthetic dataset
    X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

    # Initialize and fit KMeans
    kmeans = KMeans(n_clusters=3, max_iter=100, tol=1e-4)
    kmeans.fit(X)

    # Predict the cluster labels
    labels = kmeans.predict(X)

    # Print the cluster centers
    print("Cluster centers:n", kmeans.cluster_centers_)

    We first import KMeans from scikit-learn, and make_blobs is a function to generate synthetic datasets for clustering. We generate a synthetic dataset with 300 samples, 3 centers (clusters), and a standard deviation of 0.60 for each cluster. The random_state parameter is used to ensure that the results are reproducible.

    We initialize the KMeans algorithm with 3 clusters (n_clusters=3), a maximum of 100 iterations (max_iter=100), and a tolerance of 1e-4 (tol=1e-4). The tolerance is the threshold to declare convergence — if the change in the within-cluster sum of squares (inertia) is less than this value, the algorithm will stop.

    We fit the KMeans algorithm to our data using the fit method. This is where the actual clustering happens. We predict the cluster labels for our data using the predict method. This assigns each data point to the nearest cluster.

    sklearn.cluster.KMeans

    4: Advantages and Challenges

    4.1 Benefits of Using K-Means

    At this point, you probably have an idea of why K-Means is so popular, but let’s make sure you know when to consider K-Means for your data science journey:

    Simplicity and Ease of Implementation
    First off, K-Means is straightforward. It doesn’t ask for much, just the data and the number of clusters k you’re aiming for. This simplicity makes it super approachable, even for beginners in the data science field.

    Efficiency at Scale
    K-Means is pretty efficient, especially with large datasets. It’s got a way of cutting through data to find patterns without needing a supercomputer to do its thing. This efficiency makes it a practical choice for real-world applications where time and resources are of the essence.

    Good for Pre-Processing and Simplification
    The algorithm can also be a great pre-processing step. By clustering your data first, you can simplify complex problems, reduce dimensionality, or even improve the performance of other algorithms that you might run afterward.

    Discovering Patterns and Insights
    It can help uncover hidden structures that might not be immediately obvious, providing valuable insights into the nature of the data. This can inform decision-making, and strategy development, and even lead to discoveries within the data.

    Easy Interpretation of Results
    The results from K-Means are pretty straightforward to interpret. Clusters are defined clearly, and each data point is assigned to a specific group. This clarity can be particularly beneficial when explaining outcomes to stakeholders who might not have a deep technical background.

    4.2 Overcoming K-Means Limitations

    While K-Means clustering is a powerful tool for data analysis, it’s not without its quirks and challenges. Here’s a look at some common issues and strategies for addressing them:

    Sensitivity to Initial Centroids
    One of the main challenges with K-Means is that it’s quite sensitive to the initial placement of centroids. Random initialization can lead to suboptimal clustering or different results each time you run the algorithm.

    Think about using the K-Means++ technique for initializing centroids. It’s a smarter way to start the clustering process, reducing the chances of poor outcomes by spreading out the initial centroids.

    Fixed Number of Clusters
    Deciding on the number of clusters k in advance can be tricky. Choose wrong, and you might end up forcing your data into too many or too few clusters.

    As we said before, experimenting with methods like the Elbow Method and the Silhouette Score to find a suitable k. These approaches can provide more insight into the optimal number of clusters for your data.

    Difficulty with Non-Spherical Clusters

    A spherical cluster is one where the data points are roughly equidistant from the cluster’s center, resulting in a shape that is, in a multidimensional space, spherical or circular.

    K-Means assumes that clusters are spherical and of similar size, which isn’t always the case in real-world data. This assumption can lead to less accurate clustering when the true clusters have irregular shapes.

    In this case, consider using clustering algorithms designed for complex cluster shapes, such as Spectral Clustering or DBSCAN.

    Sensitivity to Scale and Outliers
    The performance of K-Means can be significantly affected by the scale of the data and the presence of outliers. Large-scale differences between features can skew the clustering, and outliers can pull centroids away from the true cluster center.

    As usual, make sure to normalize the features to ensure they’re on a similar scale.

    5: Beyond Basic K-Means

    5.1 Variants of K-Means

    The classic K-Means algorithm has inspired a range of variants designed to tackle its limitations and adapt to specific challenges or data types. Let’s explore some of these adaptations:

    K-Means++
    We introduced it before, K-Means++ is all about giving K-Means a better starting point. The main idea here is to choose initial centroids more strategically to improve the chances of converging to an optimal solution. By spreading out the initial centroids, K-Means++ reduces the risk of poor clustering outcomes and often leads to faster convergence.

    k-means++: The Advantages of Careful Seeding

    K-Medoids or PAM (Partitioning Around Medoids)
    While K-Means focuses on minimizing the variance within clusters, K-Medoids aim to minimize the sum of dissimilarities between points labeled to be in a cluster and a point designated as the center of that cluster. This method is more robust to noise and outliers because medoids are actual data points, unlike centroids, which are the mean values of points in a cluster.

    Link to the research paper.

    Fuzzy C-Means (FCM)
    Fuzzy C-Means introduces the concept of partial membership, where each point can belong to multiple clusters with varying degrees of membership. This approach is useful when the boundaries between clusters are not clear-cut, allowing for a softer, more nuanced clustering solution.

    Fuzzy Clustering

    Bisecting K-Means
    This variant adopts a “divide and conquer” strategy, starting with all points in a single cluster and iteratively splitting clusters until the desired number is reached. At each step, the algorithm chooses the best cluster to split based on a criterion, such as the one that will result in the largest decrease in the total sum of squared errors.

    Performance Analysis of K-Means and Bisecting K-Means Algorithms in Weblog Data

    6: Conclusion

    Despite its limitations, the array of strategies and variants available ensures that K-Means remains a versatile tool in your data science toolkit. Whether you’re tackling customer segmentation, image classification, or any number of clustering tasks, K-Means offers a starting point that is both accessible and powerful.

    Bibliography

    1. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys (CSUR), 31(3), 264–323.
    2. Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027–1035). Society for Industrial and Applied Mathematics.
    3. MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, №14, pp. 281–297).

    You made it to the end. Congrats! I hope you enjoyed this article, if so consider leaving a like and following me, as I will regularly post similar articles. My goal is to recreate all the most popular algorithms from scratch and make machine learning accessible to everyone.


    The Math and Code Behind K-Means Clustering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    The Math and Code Behind K-Means Clustering

    Go Here to Read this Fast! The Math and Code Behind K-Means Clustering

  • Finite Automata Simulation for Leveraging AI-Assisted Systems

    Finite Automata Simulation for Leveraging AI-Assisted Systems

    Sofya Lipnitskaya

    Design, modelling, and simulation of real-world AI systems to improve performance on object detection tasks using finite state machines.

    Image by author

    Background

    Problem understanding

    Recently, I came across one of the great cases of using Raspberry Pi and Python for creating a computer vision-based system for object detection. In short, one guy created a device that keeps the neighbor’s chickens away from his property. After following the Reddit thread, it became apparent that the problem is quite pervasive and not limited to certain bird species, if at all. The top rated comments include:

    “I need one of these for the ducks my neighbor feeds and then they sh*t all over my lawn.” — Light_Beard

    “I need this for cats in my yard at night.” — Buddha_

    “Does it work on kids on Halloween? Asking for a friend.” — HardPawns

    Well, one might argue that the problem is not that important and quite legitimately suggest simply asking the neighbors to sort out those hens. However, this is clearly not an engineer’s way to tackle this. Suppose you had already built an AI-assisted bird detection system equipped with a water blaster to chase unwelcome hens away from the yard. The caveat is that its operating version does not perform as well as expected, resulting in still noticeable water usage for sprinkling and lawn cleanup. Hence, chickens run and water bills remain lofty. How to address the challenge?

    Model-based engineering for complex systems

    Here, we are about to tackle this real-world problem by designing a computational model to simulate the complete chicken-on-the-lawn cycle and later optimize its parameters so that we can reduce water consumption. To do so, we will employ different techniques, including those from automata theory and randomized algorithms.

    In particular, this article focuses on modelling and simulation aspects so that you learn how to describe behavior of a real system and design a finite state machine reflecting its dynamics. Then, you will explore the path of implementing such systems in Python and discover ways to leverage computer vision-based models by optimizing its performance on object detection. This should be fun!

    Disclaimer: This work is a part of the “Bird by Bird using Deep Learning” series and is devoted to modelling and simulation of real-life systems for computer vision applications using finite automata. All actors, states, events and outputs are the products of the FSM design process for educational purposes only. Any resemblance to actual persons, birds, or real events is purely coincidental.

    Introducing the related work

    Finite state machines for modelling and simulation

    Finite-state machine (FSM) or finite automaton is a mathematical model that can be used to represent and analyse dynamic behavior of a system by describing it through discrete states, transitions between those states, and a set of rules triggering these transitions.

    The history of FSM traces back to the mid-20th century, marked by pivotal milestones in automata theory and the theory of computation. Early contributions by luminaries such as Alan Turing and John von Neumann laid the foundation, but it was in the 1950s and 1960s that FSM took a significant leap forward. Notably, Edward F. Moore and George H. Mealy independently introduced two primary types of FSMs, Moore and Mealy machines, respectively.

    These FSM types differ in their approach: Moore machines determine next states based solely on the current state, while the Mealy ones associate outputs with the current state and input, offering enhanced adaptability. Originally used in digital circuits, FSMs, in particular Mealy machines, with their responsiveness to external input signals, have become widespread in the design of complex systems that accompany our daily lives.

    FSMs are found in both hardware and software applications. Look around — almost all electronic and computing devices have some sort of finite automata — from vending machines to CPUs, from basic electronic devices to programmable logical controllers for smart home automation. They are also taken in software and game development and, of course, can be used to create adaptive AI-assisted systems for real-time object detection.

    Discrete math strikes back

    At its core, a deterministic finite automaton includes states, inputs, and a transition function. States represent distinct conditions of the system, while inputs trigger switches between states. The transition function defines rules of how the machine transitions between states. Mathematically, such state machine can be represented using a 5-tuple, denoted as M=(Q, Σ, δ, q₀, F), where:

    • Q is a set of states representing distinct configurations of the system.
    • Σ is a set of inputs consisting of events that trigger state changes.
    • Transition function δ governs how the system switches between states given an input (δ:Q×Σ→Q).
    • Initial State q₀ is a starting state of the system to be initialized with, where q₀∈Q.
    • F is a subset of Q (F⊆Q) consisting of final states.

    This way, for any given state and specific input symbol, there is a unique next state determined by the transition function δ, which is typically represented by a state transition table or diagram, specifying transitions given a combination of the current state and inputs.

    FSM design process

    The design process of an FSM involves identifying the states (and inputs if applicable), defining the transition function, as well as specifying the initial and final states. The following methodology can be employed for translating complex systems into comprehensible models, aiding in the further analysis, design, and implementation phases. A 5-step FSM design process includes:

    1. Understanding the problem, analyse the structure of the system.
    2. Defining key components for a conceptual model to be designed.
    3. Creating a state diagram or defining a state-transition table.
    4. Implementing the machine’s states, inputs and outputs, transition logic.
    5. Testing and experimentally validating the FSM.

    This iterative process allows to design a concise representation of the real system’s behavior, allowing for approximations and refinements along the process. For instance, after implementing an FSM (Step 4), you may want to further verify and update the specifications (Steps 2–3), same applies to moving from the experimentation (Step 5) back to the problem definition (Step 1), in order to create a working model that is sufficiently detailed and useful to solve a particular problem.

    State machine example

    Let us take a simple case of the chicken-on-the-lawn scenario, where a bird can either be present on the lawn or not, as a function of external stimuli initiated by the engineer, who, in turn, can either rest or chase away unwelcome guests trespassing on his property. Thus, the controlling object (the engineer) is intended to compensate for the uncertainty of the independent actors (chickens) parameters. In the given example, the set of final states F includes only one state in which the system terminates, say at the end of the day when there are no chickens around. In this way:

    • Q = {q₀, q₁, q₂, ⊙}: Set of states representing No-/Chicken.
    • Σ = {α₀, α₁, α₂}: Set of input events — Engineer Rest/Chase, and Sunset.
    • F = {⊙} contains the final state representing the end of the day.

    Figure 1 provides a state transition diagram consisting of nodes (states) connected by edges (next-state transitions), with the labels above the arcs specifying input events triggering the transitions.

    Figure 1. Graphical representation of a simple state machine (Image by author)

    This representation captures the binary nature of the problem, in which a chicken can either be present or not on the lawn. The system responds to the events triggered by the engineer or the sunset. In the diagram, the initial and final states are indicated by circles. The transition function δ for this FSM can also be written in a tabular form, showing state transitions and control actions for the system, as shown in Table 1.


    Table 1. State-transition table for the chicken-on-the-lawn FSM example

    +========================+========================+========================+
    | Current State | Input Event | Next State |
    +========================+========================+========================+
    | q₀ ≡ ”No Bird” | α₀ ≡ ”Engineer Rest” | q₁ |
    +------------------------+------------------------+------------------------+
    | q₁ ≡ ”Bird Present” | α₁ ≡ ”Engineer Chase” | q₀ |
    +------------------------+------------------------+------------------------+
    | q₀ | α₂ ≡ ”Sunset” | ⊙ |
    +------------------------+------------------------+------------------------+

    Thus, by completing five straightforward steps we designed a simple state machine. Now that everything has been explained, let’s finally create an FSM-based model representing our challenge with birds on the lawn.

    On dealing with the birds-on-the-lawn challenge

    What they do on the lawn

    As you have learnt in the previous section, finite automata can be used to model almost any process. Imagine you have some chickens hopping around in your backyard this afternoon. What are they doing? Just observe. They’re always moving, singing, or interacting. They’re often flying, probing, or eating. On some occasions, they’re displaying, or doing something that draws our attention, like those neighbor’s chickens messing up with the grass, but let’s set those particulars aside for now. Well, eventually, all the birds are s***ing (nothing personal, feathery ones). For the FSM design, we won’t get into the finer details, instead distilling essential components with logic for our simulation. Let the FSM take water adventures to the next level of play!

    System description

    Concerning the chickens, here, we are going to describe the system to reflect our down-to-earth scenario in order to optimize parameters of the object detection system and reduce water costs for lawn cleaning. For the reference, take another look at the previous FSM example. This simple machine differs from the real-life system in some particular aspects. First, we want to actualize the controlling object to include an AI-based device aimed at detecting and repelling birds, this time by means of a high-pressure water sprinkler gun (so the engineer can “self-loop” into the rest state).

    Second, we will need to update and/or extend the set of possible states, events, and transitions reflecting complexity of the updated system’s setup. For the latter, why don’t we consider additional bird categories that can be recognized by a computer vision model (e.g., turkeys) thus being potential independent actors for our FSM. Moreover, assuming that bird size varies across species, an irrigation control system would need a more intense water flow and/or pressure to successfully chase a bulky turkey off the lawn than it would for a chicken. Hereafter, for brevity’s sake, we will refer to the chicken-and-turkey-on-the-lawn problem as the CaT problem.

    Conceptual modelling

    In order to model scenarios where the object detection system has to monitor, classify, and interact with objects trespassing on the property, we will define states, events, and transitions that represent the different aspects of this situation. Our goal is to capture the various states the object detection system and chickens can be in, as well as the events that trigger state transitions.

    For the logic design scenarios, consider that at any given moment a bird can enter the yard, mess up with the lawn (or not), and leave the property either for its own or if it was successfully detected and chased away by the AI-based lawn security system. Now, let’s define some main components of the FSM simulation.

    States represent possible conditions reflecting the CaT scenarios:

    • For hopping targets: Spawn and Intrusion statuses, Attacking and its result, Leaving the lawn.
    • For the AI system: Detection State, Sprinkling State.
    • Initial state “Start” relates to the entry point of the simulation.
    • Final state “End” denotes the termination point of the simulation.

    Transitions govern how the system switches between states based on inputs. For instance, the AI model may overlook a bird and miss the sprinkling process, thus, resulting in a number of consequences for the lawn. Here are some other scenarios and conditions we can anticipate:

    • From “Intrusion” to “Target Detected” on the “Detection” event.
    • From “Target Detected” to “Bird Leaves” events through the sequence of “Sprinkling” and “Hit” events after an intruded bird has been detected and hit by the water sprinkler successfully.
    • From “Bird Present” to “Attack”, in case the system has failed at target detection and prediction steps, while the bird was actually on the lawn. The same event shall take place under the conditions that the bird-intruder was detected, but the shot was missed by the system.

    This way, the FSM will dynamically progress from one state to another as the AI system interacts with the chickens hopping around. To make task easier and less error-prone, we create a combined state transition and conditions table:

    Table 2. FSM inputs describing the events triggering state changes

    +====+==================================+==================+================+
    | ID | Input Description | Defence System | Enemy Tactics |
    | | | Operation Mode | and Waypoints |
    +====+==================================+==================+================+
    | X₁ | Bird is present on the lawn | | Hopping |
    +----+----------------------------------+ Object detection +----------------+
    | X₂ | Bird intrudes the lawn | | Start hopping |
    +----+----------------------------------+------------------+----------------+
    | X₃ | AI-powered detector spots a bird | Start sprinkling | Hopping (still)|
    +----+----------------------------------+------------------+----------------+
    | X₄ | Bird is hit successfully¹ | | |
    +----+----------------------------------+ - | Intimidation |
    | X₅ | Target is susceptible² | | |
    +----+----------------------------------+------------------+----------------+
    | X₆ | Bird spoiled the lawn | | Hopping merrily|
    +----+----------------------------------+ Object detection +----------------+
    | X₇ | Bird leaves the lawn | | Retreat |
    +----+----------------------------------+------------------+----------------+
    | X₈ | Simulation period ends (sunset) | - | - |
    +----+----------------------------------+------------------+----------------+
    ID - input identifier
    ¹ - aiming and sprinkling modules operated correctly
    ² - water flow rate is strong enough to chase the bird away

    State transition tables

    Now, after identifying states and events, we’ll write a combined state transition table with Boolean expressions for the next states. In table 3, we see how the inputs described in Table 2 steer the transitions between the simulation states.

    Table 3. FSM state transition table with next-stage transition logic

    +========================+========================+========================+
    | Current State | Transition Formula | Next State |
    +========================+========================+========================+
    | Start | TRUE | Spawn |
    +------------------------+------------------------+------------------------+
    | | X₁ ∨ X₂ | Intrusion |
    | |------------------------+------------------------+
    | Spawn | ¬X₁ ∧ ¬X₂ | Empty lawn |
    + |------------------------+------------------------+
    | | X₈ | End |
    +------------------------+------------------------+------------------------+
    | | X₃ | Target detected |
    | Intrusion |------------------------+------------------------+
    | | ¬X₃ | Not detected |
    +------------------------+------------------------+------------------------+
    | | X₃ | Target detected |
    | Empty lawn |------------------------+------------------------+
    | | ¬X₃ | Not detected |
    +------------------------+------------------------+------------------------+
    | Target detected | TRUE | Sprinkling |
    +------------------------+------------------------+------------------------+
    | | X₁ | Attacking |
    | Not detected |------------------------+------------------------+
    | | ¬X₁ | Not attacked |
    +------------------------+------------------------+------------------------+
    | | ¬X₁ ∨ ¬X₄ ∨ ¬X | Miss |
    | Sprinkling |------------------------+------------------------+
    | | X₁ ∧ X₄ ∧ X₅ | Hit |
    +------------------------+------------------------+------------------------+
    | | ¬X₁ | Spawn |
    | Miss |------------------------+------------------------+
    | | X₁ | Attacking |
    +------------------------+------------------------+------------------------+
    | Hit | TRUE | Bird leaves |
    +------------------------+------------------------+------------------------+
    | | ¬X₆ | Not attacked |
    | Attacking |------------------------+------------------------+
    | | X₆ | Bird attacked |
    +------------------------+------------------------+------------------------+
    | | ¬X₇ | Spawn |
    | Not attacked |------------------------+------------------------+
    | | X₇ | Bird leaves |
    +------------------------+------------------------+------------------------+
    | | ¬X₇ | Spawn |
    | Bird attacked |------------------------+------------------------+
    | | X₇ | Bird leaves |
    +------------------------+------------------------+------------------------+
    | Bird leaves | TRUE | Spawn |
    +------------------------+------------------------+------------------------+

    In most cases, a single input determines the next state. However, we have to consider a number of conditions simultaneously for switching from “Spawn” or “Sprinkling”. You could also notice that for some states, transitions don’t depend on the external information, like for “Start” or “Hit”. These states are either special (as “Start”) or trigger auxiliary actions. The latter don’t have a direct influence on the story we simulate (i.e. in that regard, they can be combined with the subsequent states) but are important for gathering simulation statistics.

    Finally, let’s look at its visual representation. Figure 3 presents the state transition graph corresponding to the CaT system going through during its lifetime. You can probably see the connection already. Moving forward with the following article, you will learn how to implement this FSM in Python and how to use it to optimize parameters of the AI-assisted bird detection system in order to reduce water cost.

    Figure 2. State transition graph for an FSM representing the AI-assisted lawn security system (Image by author)

    Conclusions

    In this article, we explored how to apply finite state machines in practice to build a model for addressing the CaT problem, allowing for high-level problem analysis and solution design.

    We have described complex yard processes by applying the FSM formalism to individual actors and the interactions between them, thereby creating a holistic view of the internal dynamics of the real-world situation, in which we had to deal with neighborhood birds trespassing on the property.

    This allowed us to create a simulation that reflects, among other things, the operation of the AI-assisted security system equipped with a water pressure controller for sprinkling and aimed at object detection and chasing away unwelcome guests spoiling the lawn.

    What’s next?

    In the upcoming series of articles, we will further investigate the topic of simulation of real-life scenarios using FSMs and its practical applications for addressing a non-analytic optimization problem of water cost reduction.

    Specifically, the next article will feature a Python tutorial from which you will learn how to implement an FSM-driven simulation from scratch as well as how to employ it as a part of a stochastic optimization pipeline. Based on the simulation created, we then examine how to leverage it for improving the resource efficiency of our lawn security system by applying Monte-Carlo and eXplainable AI (XAI) techniques to optimize performance of a computer vision-based subsystem for bird detection.

    Interested to keep it on? Stay updated on more materials at — https://github.com/slipnitskaya/caltech-birds-advanced-classification and https://medium.com/@slipnitskaya.

    References

    1. Moore, Edward F. “Gedanken-experiments on sequential machines.” Automata studies 34 (1956): 129–153.
    2. Mealy, George H. “A method for synthesizing sequential circuits.” The Bell System Technical Journal 34.5 (1955): 1045–1079.
    3. Sipser, M. “Introduction to the Theory of Computation.” 2nd ed., Thomson Course Technology (2006).


    Finite Automata Simulation for Leveraging AI-Assisted Systems was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Finite Automata Simulation for Leveraging AI-Assisted Systems

    Go Here to Read this Fast! Finite Automata Simulation for Leveraging AI-Assisted Systems

  • Setting A Dockerized Python Environment — The Hard Way

    Setting A Dockerized Python Environment — The Hard Way

    Rami Krispin

    Setting A Dockerized Python Environment — The Hard Way

    This post will review different methods to run a dockerized Python environment from the command line (CLI). Am I recommending you run your Python environment from the CLI? Absolutely not!

    There are better ways to set up a Python development environment, such as using VScode and the Dev Containers extension. We will use the “hard way” to set up a Python environment from the CLI for learning purposes. We will review different approaches to launching a container with the run command and see how to customize a built-in image using the Dockerfile.

    A Pixal Python Snake (created by the author with Midjourney)

    Prerequisites

    To follow along with this tutorial, you will need the following:

    • Docker Desktop (or equivalent) if you are using a macOS or Windows OS machine, or Docker installed if you are using a Linux OS
    • Docker Hub account to pull the image from.

    Throughout this tutorial, we will use the official Python image — python:3.1o.

    Getting Started

    Let’s get started by pulling the official Python 3.10 image from Docker Hub. First, let’s log in to Docker Hub by using the docker logincommand:

    docker login docker.io                                                                                                                                                                      ok
    Authenticating with existing credentials...
    Login Succeeded

    Next, we will use the docker pull command from the terminal:

    docker pull python:3.10                                                                                                                                                                     ok

    If this is the first time you pull the image, you should expect the following output:

    3.10: Pulling from library/python
    66932e2b787d: Pull complete
    4afa7e263db1: Pull complete
    c812910e5e62: Pull complete
    f4e4299bb649: Pull complete
    5213cc2f9120: Pull complete
    4a3b5b2f0e66: Pull complete
    c214ceb1cabf: Pull complete
    f5336038b15c: Pull complete
    Digest: sha256:f94601bb6374b0b63835a70c9e5c3ba1b19bc009133900a9473229a406018e46
    Status: Downloaded newer image for python:3.10
    docker.io/library/python:3.10

    You can review the image details with the use of the docker images command:

    docker images                                                                                                                                                                          ok  11s
    REPOSITORY TAG IMAGE ID CREATED SIZE
    python 3.10 f7537c504c9a 7 days ago 1.01GB

    Before running the container, let’s review the image metadata with the docker inspect command:

    docker inspect python:3.10

    This will return the below JSON output:

    [
    {
    "Id": "sha256:f7537c504c9a91a22c9a255ee02048e7079cacdee583290e8238c605d17f9600",
    "RepoTags": [
    "python:3.10"
    ],
    "RepoDigests": [
    "python@sha256:f94601bb6374b0b63835a70c9e5c3ba1b19bc009133900a9473229a406018e46"
    ],
    "Parent": "",
    "Comment": "buildkit.dockerfile.v0",
    "Created": "2024-02-03T10:49:13Z",
    "Container": "",
    "ContainerConfig": {
    "Hostname": "",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": null,
    "Cmd": null,
    "Image": "",
    "Volumes": null,
    "WorkingDir": "",
    "Entrypoint": null,
    "OnBuild": null,
    "Labels": null
    },
    "DockerVersion": "",
    "Author": "",
    "Config": {
    "Hostname": "",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
    "PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
    "LANG=C.UTF-8",
    "GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D",
    "PYTHON_VERSION=3.10.13",
    "PYTHON_PIP_VERSION=23.0.1",
    "PYTHON_SETUPTOOLS_VERSION=65.5.1",
    "PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/dbf0c85f76fb6e1ab42aa672ffca6f0a675d9ee4/public/get-pip.py",
    "PYTHON_GET_PIP_SHA256=dfe9fd5c28dc98b5ac17979a953ea550cec37ae1b47a5116007395bfacff2ab9"
    ],
    "Cmd": [
    "python3"
    ],
    "ArgsEscaped": true,
    "Image": "",
    "Volumes": null,
    "WorkingDir": "",
    "Entrypoint": null,
    "OnBuild": null,
    "Labels": null
    },
    "Architecture": "arm64",
    "Variant": "v8",
    "Os": "linux",
    "Size": 1005570383,
    "VirtualSize": 1005570383,
    "GraphDriver": {
    "Data": {
    "LowerDir": "/var/lib/docker/overlay2/d2fd76e7396796018a959209b51fe8311a188c8eae8e339e9e556de0889ca0bd/diff:/var/lib/docker/overlay2/bbedb25c5aa6ec3f2fc632e62a53989a329b907143fac165f899209293627a69/diff:/var/lib/docker/overlay2/ed6a4bf49214e6b496b7716443b8de380481cd9416bc4a378f29183c9129786f/diff:/var/lib/docker/overlay2/ac9543f44a835c203fb0b0b28958d94df72d206c9060c9d83307b39f50355102/diff:/var/lib/docker/overlay2/94a9f92c36ea6251feda52be8e76ec4da4a9c66b744a29472e1ccfdf34a6f69d/diff:/var/lib/docker/overlay2/6ee22c274256834a64008022856d365d91373bb490ae9f2f1723cb524b246a29/diff:/var/lib/docker/overlay2/2fa272376e0ce68f4f34f18e5ecb1ddd58a32fb20a82e5a417514047f8e684a3/diff",
    "MergedDir": "/var/lib/docker/overlay2/f2d64d1affbe99afb05251435f7705cb97e2efa4f8febb494b4cbaa21e7f742a/merged",
    "UpperDir": "/var/lib/docker/overlay2/f2d64d1affbe99afb05251435f7705cb97e2efa4f8febb494b4cbaa21e7f742a/diff",
    "WorkDir": "/var/lib/docker/overlay2/f2d64d1affbe99afb05251435f7705cb97e2efa4f8febb494b4cbaa21e7f742a/work"
    },
    "Name": "overlay2"
    },
    "RootFS": {
    "Type": "layers",
    "Layers": [
    "sha256:9f8c60461a42fd9f275c56f4ec8fea8a8ea2d938493e316e830994a3814cf0aa",
    "sha256:927a28cdbf6c1705342b2cba0069457313815058dcebe1996e46cade38a09370",
    "sha256:e85139d96aee18e99fc964d02546be48cc6a4d4dfd9f465a91f172b7c535e55f",
    "sha256:f3168ba6a8d2ec30e12002ad5b7b497cf7409f3e19cc8d8f447f6cf4231a2482",
    "sha256:acbc6c8127209b09fa336e354037fdc630d3594e15174f2bc1bdbf31d5591376",
    "sha256:06c4da96c7dd2fbbbb541e000bd0ea4cfbf7c80b24f098a9d67f677ef5e6c63e",
    "sha256:1cdf208dc10679cf5db6d4f0e17ff6d5bfe754b4195ddd3b153b6d1ff51ce909",
    "sha256:5d6f554f67c7da9d34763916bac632a450902c6e6fdbf9244f888f110fd37740"
    ]
    },
    "Metadata": {
    "LastTagTime": "0001-01-01T00:00:00Z"
    }
    }
    ]

    The ispect command provides a lot of useful information about the image, such as the layers information, the image size, the hardware architecture, etc. As we want to run the image, the most interesting detail is the CMD setting. The CMDcommand in the Dockerfile defines what command to execute during the container launch time. We can parse from the above output the CMD information with the jq package:

    docker inspect python:3.10 | jq '.[] | .Config | .Cmd'                                                                                                                                      ok
    [
    "python3"
    ]

    As you can see, the image is set to execute the python3 command during the container run time. Let’s now launch the container using the run command:

    docker run python:3.10

    And… nothing happens. The reason for that, in short, is that we need to give Docker access to the terminal. We will use the interactive and TTY arguments to run the image in an interactive mode:

    docker run --interactive --tty python:3.10

    This will attach the terminal to the container and open Python inside the container:

    Running the Python image in interactive mode (screenshot by the author)

    While we were able to launch Python inside a container, it is not as useful:

    • We cannot create, edit, and run scripts inside the Python interpreter
    • By default, the Python image comes with a limited number of libraries. In this mode, you cannot add additional ones
    • Last but not least, the container is ephemeral. Once you stop it, all the work is lost

    For example, if we will try to load pandas, we will get the following error:

    Trying to load the Pandas library (screenshot by the author)

    In the following sections, we will address those issues by creating a Dockerfile and customizing the base image functionality. This includes adding the following features:

    • Set a virtual environment and install packages with a requirements file. For simplicity,
    • Install a vim editor to edit files
    • Change the CMD command to open a shell terminal upon launch (as opposed to the Python interpreter). This will enable us to create new scripts, edit, and execute from the terminal

    Customize the Base Image

    To customize the Python environment and make the above changes, we will create a Dockerfile with the following functionality:

    • Import the Python image — python:3.10
    • Set a virtual environment
    • Install required libraries
    • Install vim editor
    • Expose a bash terminal

    Setting a Python Virtual Environment

    To set up a Python virtual environment, we will use the following two helper files:

    requirements.txt

    wheel==0.40.0
    pandas==2.0.3

    This file defines the list of Python libraries to install in the virtual environment. For example, in this case, we will install the Pandas library, version 2.0.3. Generally, I also install the wheels library, which handles C dependencies.

    The next helper file is the below bash script, which sets the virtual environment and installs the required libraries from the requirements.txt file.

    set_python_env.sh

    #!/usr/bin/env bash

    PYTHON_ENV=$1

    python3 -m venv /opt/$PYTHON_ENV
    && export PATH=/opt/$PYTHON_ENV/bin:$PATH
    && echo "source /opt/$PYTHON_ENV/bin/activate" >> ~/.bashrc

    source /opt/$PYTHON_ENV/bin/activate

    pip3 install -r ./requirements/requirements.txt

    Note: We use a variable (marked as $1) to define the environment name, which will be assigned to the PYTHON_ENV variable. Using variables during the build is a good practice, as it enables us to modify some of the image characteristics without modifying the code. We will assign the variable via the Dockerfile.

    Let’s explain the following concatenate code from the above bash script that sets the virtual environment:

    python3 -m venv /opt/$PYTHON_ENV  
    && export PATH=/opt/$PYTHON_ENV/bin:$PATH
    && echo "source /opt/$PYTHON_ENV/bin/activate" >> ~/.bashrc

    The above three lines of code include three concatenate expressions:

    • First, the python3 -m venv /opt/$PYTHON_ENV set a virtual environment with the venv command
    • Second, add the virtual environment path to the PATH variable
    • Third, add to the .bashrc file the activate command of the environment. This will ensure that whenever we launch the terminal, it will activate this virtual environment by default (otherwise, you will have to do it manually upon the launch of the environment)

    Once the environment is set, we use the source command to activate the environment, and the pip3 command to install the libraries inside the environment.

    Creating a Dockerfile

    After we review the helper files, let’s see how they are incorporated inside the below Dockerfile.

    Dockerfile

    FROM python:3.10

    ARG PYTHON_ENV=my_env
    ENV PYTHON_ENV=$PYTHON_ENV

    RUN mkdir requirements

    COPY requirements.txt set_python_env.sh /requirements/

    RUN bash ./requirements/set_python_env.sh $PYTHON_ENV

    RUN apt-get update &&
    apt-get install -y
    vim
    && apt update


    CMD ["/bin/sh", "-c", "bash"]

    As you can see, we are using the same image — python:3.10as our base image.

    Next, we set an argument named PYTHON_ENV with the ARG command to define the virtual environment name. We setmy_env as the default value, which can be modified during the build time using the arg argument. We use the PYTHON_ENV argument to set an environment variable as well.

    Before setting the virtual environment, we will create inside the image a new library under the root folder named requirements and use the COPY command to copy the above helper files — requirements.txt and set_my_python.sh to the requirements folder.

    Next, we call the bash script — set_my_python.sh , which sets the virtual environment and installs the required libraries. As mentioned above, we use the PYTHON_ENV variable as an argument with the set_my_python.sh file to set the virtual environment name dynamically.

    We use the apt command to install vim — a CLI editor. This will enable us to edit code on via the container CLI.

    Last but not least, use the CMD command to launch a shell terminal using bash:

    CMD ["/bin/sh", "-c", "bash"]

    At this point, we have the below files in the local folder:

    .
    ├── Dockerfile
    ├── requirements.txt
    └── set_python_env.sh

    Let’s now go ahead and build the image with the docker build command:

    docker build .-f Dockerfile -t my_python_env:3.10
    [+] Building 47.3s (10/10) FINISHED
    => [internal] load build definition from Dockerfile 0.2s
    => => transferring dockerfile: 389B 0.0s
    => [internal] load .dockerignore 0.2s
    => => transferring context: 2B 0.0s
    => [internal] load metadata for docker.io/library/python:3.10 0.0s
    => [1/5] FROM docker.io/library/python:3.10 1.3s
    => [internal] load build context 0.4s
    => => transferring context: 460B 0.0s
    => [2/5] RUN mkdir requirements 0.5s
    => [3/5] COPY requirements.txt set_python_env.sh /requirements/ 0.2s
    => [4/5] RUN bash ./requirements/set_python_env.sh my_env 26.8s
    => [5/5] RUN apt-get update && apt-get install -y vim && apt update 17.4s
    => exporting to image 0.7s
    => => exporting layers 0.6s
    => => writing image sha256:391879baceea6154c191692d4bcb9ec9690de6dc4d5edd5b2ed13f6c579dd05c 0.0s
    => => naming to docker.io/library/my_python_env:3.10

    Let’s run again the docker images command to review the current images:

    docker images                                                  
    REPOSITORY TAG IMAGE ID CREATED SIZE
    my_python_env 3.10 391879baceea 7 minutes ago 1.23GB
    python 3.10 f7537c504c9a 8 days ago 1.01GB

    As you can note, adding the virtual environment and installing the packages added about 250 Mb to the image size.

    Running the Python Environment

    After we built the image, let’s launch the image with the docker run command and check if the above properties are defined as expected:

    docker run --interactive --tty my_python_env:3.10

    This launches the image in interactive mode, and opens a bash terminal as expected:

    Launching the container via the shell terminal (screenshot by the author)

    As you can notice in the above screenshot, it launched the container inside the bash terminal, and the virtual environment is set as expected as my_env. The pandas library was installed and can be loaded, and we can now edit files from the terminal.

    One issue to take care of is that the container is still ephemeral. Therefore, any code we create inside the image is not exportable and will be lost after we stop the container from running.

    A simple solution is to mount a volume with the volume argument. For simplicity, we will go ahead and mount the local folder, where we keep the Dockerfile and the helper files, to a new folder inside the container named my_scripts:

    docker run -v .:/my_scripts  --interactive --tty my_python_env:3.10

    And here is the output:

    Mounting a local volume to the container (screenshot by the author)

    Once the folder is mounted, any file that is created, modified, or deleted from the mounted folder inside the container will be reflected to the local folder. This enables you to maintain your code when stopping the container.

    Summary

    In this tutorial, we reviewed how to set a dockerized Python environment using the command line. While this is neither a practical nor recommended approach to develop with Python, it is a great learning experience of Docker core commands and basic functionalities. We show how we can easily take a built-in image and customize it according to our needs. Last but not least, we saw how to mount a local folder to the container with the volume argument to transfer the container from an ephemeral mode to a persistent mode.

    Resources


    Setting A Dockerized Python Environment — The Hard Way was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Setting A Dockerized Python Environment — The Hard Way

    Go Here to Read this Fast! Setting A Dockerized Python Environment — The Hard Way

  • Text Embeddings: Comprehensive Guide

    Text Embeddings: Comprehensive Guide

    Mariya Mansurova

    Evolution, visualisation, and applications of text embeddings

    Image by DALL-E 3

    As human beings, we can read and understand texts (at least some of them). Computers in opposite “think in numbers”, so they can’t automatically grasp the meaning of words and sentences. If we want computers to understand the natural language, we need to convert this information into the format that computers can work with — vectors of numbers.

    People learned how to convert texts into machine-understandable format many years ago (one of the first versions was ASCII). Such an approach helps render and transfer texts but doesn’t encode the meaning of the words. At that time, the standard search technique was a keyword search when you were just looking for all the documents that contained specific words or N-grams.

    Then, after decades, embeddings have emerged. We can calculate embeddings for words, sentences, and even images. Embeddings are also vectors of numbers, but they can capture the meaning. So, you can use them to do a semantic search and even work with documents in different languages.

    In this article, I would like to dive deeper into the embedding topic and discuss all the details:

    • what preceded the embeddings and how they evolved,
    • how to calculate embeddings using OpenAI tools,
    • how to define whether sentences are close to each other,
    • how to visualise embeddings,
    • the most exciting part is how you could use embeddings in practice.

    Let’s move on and learn about the evolution of embeddings.

    Evolution of Embeddings

    We will start our journey with a brief tour into the history of text representations.

    Bag of Words

    The most basic approach to converting texts into vectors is a bag of words. Let’s look at one of the famous quotes of Richard P. Feynman“We are lucky to live in an age in which we are still making discoveries”. We will use it to illustrate a bag of words approach.

    The first step to get a bag of words vector is to split the text into words (tokens) and then reduce words to their base forms. For example, “running” will transform into “run”. This process is called stemming. We can use the NLTK Python package for it.

    from nltk.stem import SnowballStemmer
    from nltk.tokenize import word_tokenize

    text = 'We are lucky to live in an age in which we are still making discoveries'

    # tokenization - splitting text into words
    words = word_tokenize(text)
    print(words)
    # ['We', 'are', 'lucky', 'to', 'live', 'in', 'an', 'age', 'in', 'which',
    # 'we', 'are', 'still', 'making', 'discoveries']

    stemmer = SnowballStemmer(language = "english")
    stemmed_words = list(map(lambda x: stemmer.stem(x), words))
    print(stemmed_words)
    # ['we', 'are', 'lucki', 'to', 'live', 'in', 'an', 'age', 'in', 'which',
    # 'we', 'are', 'still', 'make', 'discoveri']

    Now, we have a list of base forms of all our words. The next step is to calculate their frequencies to create a vector.

    import collections
    bag_of_words = collections.Counter(stemmed_words)
    print(bag_of_words)
    # {'we': 2, 'are': 2, 'in': 2, 'lucki': 1, 'to': 1, 'live': 1,
    # 'an': 1, 'age': 1, 'which': 1, 'still': 1, 'make': 1, 'discoveri': 1}

    Actually, if we wanted to convert our text into a vector, we would have to take into account not only the words we have in the text but the whole vocabulary. Let’s assume we also have “i”, “you” and ”study” in our vocabulary and let’s create a vector from Feynman’s quote.

    Graph by author

    This approach is quite basic, and it doesn’t take into account the semantic meaning of the words, so the sentences “the girl is studying data science” and “the young woman is learning AI and ML” won’t be close to each other.

    TF-IDF

    A slightly improved version of the bag of the words approach is TF-IDF (Term Frequency — Inverse Document Frequency). It’s the multiplication of two metrics.

    • Term Frequency shows the frequency of the word in the document. The most common way to calculate it is to divide the raw count of the term in this document (like in the bag of words) by the total number of terms (words) in the document. However, there are many other approaches like just raw count, boolean “frequencies”, and different approaches to normalisation. You can learn more about different approaches on Wikipedia.
    • Inverse Document Frequency denotes how much information the word provides. For example, the words “a” or “that” don’t give you any additional information about the document’s topic. In contrast, words like “ChatGPT” or “bioinformatics” can help you define the domain (but not for this sentence). It’s calculated as the logarithm of the ratio of the total number of documents to those containing the word. The closer IDF is to 0 — the more common the word is and the less information it provides.

    So, in the end, we will get vectors where common words (like “I” or “you”) will have low weights, while rare words that occur in the document multiple times will have higher weights. This strategy will give a bit better results, but it still can’t capture semantic meaning.

    The other challenge with this approach is that it produces pretty sparse vectors. The length of the vectors is equal to the corpus size. There are about 470K unique words in English (source), so we will have huge vectors. Since the sentence won’t have more than 50 unique words, 99.99% of the values in vectors will be 0, not encoding any info. Looking at this, scientists started to think about dense vector representation.

    Word2Vec

    One of the most famous approaches to dense representation is word2vec, proposed by Google in 2013 in the paper “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al.

    There are two different word2vec approaches mentioned in the paper: Continuous Bag of Words (when we predict the word based on the surrounding words) and Skip-gram (the opposite task — when we predict context based on the word).

    Figure from the paper by Mikolov et al. 2013 | source

    The high-level idea of dense vector representation is to train two models: encoder and decoder. For example, in the case of skip-gram, we might pass the word “christmas” to the encoder. Then, the encoder will produce a vector that we pass to the decoder expecting to get the words “merry”, “to”, and “you”.

    Scheme by author

    This model started to take into account the meaning of the words since it’s trained on the context of the words. However, it ignores morphology (information we can get from the word parts, for example, that “-less” means the lack of something). This drawback was addressed later by looking at subword skip-grams in GloVe.

    Also, word2vec was capable of working only with words, but we would like to encode whole sentences. So, let’s move on to the next evolutional step with transformers.

    Transformers and Sentence Embeddings

    The next evolution was related to the transformers approach introduced in the “Attention Is All You Need” paper by Vaswani et al. Transformers were able to produce information-reach dense vectors and become the dominant technology for modern language models.

    I won’t cover the details of the transformers’ architecture since it’s not so relevant to our topic and would take a lot of time. If you’re interested in learning more, there are a lot of materials about transformers, for example, “Transformers, Explained” or “The Illustrated Transformer”.

    Transformers allow you to use the same “core” model and fine-tune it for different use cases without retraining the core model (which takes a lot of time and is quite costly). It led to the rise of pre-trained models. One of the first popular models was BERT (Bidirectional Encoder Representations from Transformers) by Google AI.

    Internally, BERT still operates on a token level similar to word2vec, but we still want to get sentence embeddings. So, the naive approach could be to take an average of all tokens’ vectors. Unfortunately, this approach doesn’t show good performance.

    This problem was solved in 2019 when Sentence-BERT was released. It outperformed all previous approaches to semantic textual similarity tasks and allowed the calculation of sentence embeddings.

    It’s a huge topic so we won’t be able to cover it all in this article. So, if you’re really interested, you can learn more about the sentence embeddings in this article.

    We’ve briefly covered the evolution of embeddings and got a high-level understanding of the theory. Now, it’s time to move on to practice and lear how to calculate embeddings using OpenAI tools.

    Calculating embeddings

    In this article, we will be using OpenAI embeddings. We will try a new model text-embedding-3-small that was released just recently. The new model shows better performance compared to text-embedding-ada-002:

    • The average score on a widely used multi-language retrieval (MIRACL) benchmark has risen from 31.4% to 44.0%.
    • The average performance on a frequently used benchmark for English tasks (MTEB) has also improved, rising from 61.0% to 62.3%.

    OpenAI also released a new larger model text-embedding-3-large. Now, it’s their best performing embedding model.

    As a data source, we will be working with a small sample of Stack Exchange Data Dump — an anonymised dump of all user-contributed content on the Stack Exchange network. I’ve selected a bunch of topics that look interesting to me and sample 100 questions from each of them. Topics range from Generative AI to coffee or bicycles so that we will see quite a wide variety of topics.

    First, we need to calculate embeddings for all our Stack Exchange questions. It’s worth doing it once and storing results locally (in a file or vector storage). We can generate embeddings using the OpenAI Python package.

    from openai import OpenAI
    client = OpenAI()

    def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("n", " ")
    return client.embeddings.create(input = [text], model=model)
    .data[0].embedding

    get_embedding("We are lucky to live in an age in which we are still making discoveries.")

    As a result, we got a 1536-dimension vector of float numbers. We can now repeat it for all our data and start analysing the values.

    The primary question you might have is how close the sentences are to each other by meaning. To uncover answers, let’s discuss the concept of distance between vectors.

    Distance between vectors

    Embeddings are actually vectors. So, if we want to understand how close two sentences are to each other, we can calculate the distance between vectors. A smaller distance would be equivalent to a closer semantic meaning.

    Different metrics can be used to measure the distance between two vectors:

    • Euclidean distance (L2),
    • Manhattant distance (L1),
    • Dot product,
    • Cosine distance.

    Let’s discuss them. As a simple example, we will be using two 2D vectors.

    vector1 = [1, 4]
    vector2 = [2, 2]

    Euclidean distance (L2)

    The most standard way to define distance between two points (or vectors) is Euclidean distance or L2 norm. This metric is the most commonly used in day-to-day life, for example, when we are talking about the distance between 2 towns.

    Here’s a visual representation and formula for L2 distance.

    Image by author

    We can calculate this metric using vanilla Python or leveraging the numpy function.

    import numpy as np

    sum(list(map(lambda x, y: (x - y) ** 2, vector1, vector2))) ** 0.5
    # 2.2361

    np.linalg.norm((np.array(vector1) - np.array(vector2)), ord = 2)
    # 2.2361

    Manhattant distance (L1)

    The other commonly used distance is the L1 norm or Manhattan distance. This distance was called after the island of Manhattan (New York). This island has a grid layout of streets, and the shortest routes between two points in Manhattan will be L1 distance since you need to follow the grid.

    Image by author

    We can also implement it from scratch or use the numpy function.

    sum(list(map(lambda x, y: abs(x - y), vector1, vector2)))
    # 3

    np.linalg.norm((np.array(vector1) - np.array(vector2)), ord = 1)
    # 3.0

    Dot product

    Another way to look at the distance between vectors is to calculate a dot or scalar product. Here’s a formula and we can easily implement it.

    Image by author
    sum(list(map(lambda x, y: x*y, vector1, vector2)))
    # 11

    np.dot(vector1, vector2)
    # 11

    This metric is a bit tricky to interpret. On the one hand, it shows you whether vectors are pointing in one direction. On the other hand, the results highly depend on the magnitudes of the vectors. For example, let’s calculate the dot products between two pairs of vectors:

    • (1, 1) vs (1, 1)
    • (1, 1) vs (10, 10).

    In both cases, vectors are collinear, but the dot product is ten times bigger in the second case: 2 vs 20.

    Cosine similarity

    Quite often, cosine similarity is used. Cosine similarity is a dot product normalised by vectors’ magnitudes (or normes).

    Image by author

    We can either calculate everything ourselves (as previously) or use the function from sklearn.

    dot_product = sum(list(map(lambda x, y: x*y, vector1, vector2)))
    norm_vector1 = sum(list(map(lambda x: x ** 2, vector1))) ** 0.5
    norm_vector2 = sum(list(map(lambda x: x ** 2, vector2))) ** 0.5

    dot_product/norm_vector1/norm_vector2

    # 0.8575

    from sklearn.metrics.pairwise import cosine_similarity

    cosine_similarity(
    np.array(vector1).reshape(1, -1),
    np.array(vector2).reshape(1, -1))[0][0]

    # 0.8575

    The function cosine_similarity expects 2D arrays. That’s why we need to reshape the numpy arrays.

    Let’s talk a bit about the physical meaning of this metric. Cosine similarity is equal to the cosine between two vectors. The closer the vectors are, the higher the metric value.

    Image by author

    We can even calculate the exact angle between our vectors in degrees. We get results around 30 degrees, and it looks pretty reasonable.

    import math
    math.degrees(math.acos(0.8575))

    # 30.96

    What metric to use?

    We’ve discussed different ways to calculate the distance between two vectors, and you might start thinking about which one to use.

    You can use any distance to compare the embeddings you have. For example, I calculated the average distances between the different clusters. Both L2 distance and cosine similarity show us similar pictures:

    • Objects within a cluster are closer to each other than to other clusters. It’s a bit tricky to interpret our results since for L2 distance, closer means lower distance, while for cosine similarity — the metric is higher for closer objects. Don’t get confused.
    • We can spot that some topics are really close to each other, for example, “politics” and “economics” or “ai” and “datascience”.
    Image by author
    Image by author

    However, for NLP tasks, the best practice is usually to use cosine similarity. Some reasons behind it:

    • Cosine similarity is between -1 and 1, while L1 and L2 are unbounded, so it’s easier to interpret.
    • From the practical perspective, it’s more effective to calculate dot products than square roots for Euclidean distance.
    • Cosine similarity is less affected by the curse of dimensionality (we will talk about it in a second).

    OpenAI embeddings are already normed, so dot product and cosine similarity are equal in this case.

    You might spot in the results above that the difference between inter- and intra-cluster distances is not so big. The root cause is the high dimensionality of our vectors. This effect is called “the curse of dimensionality”: the higher the dimension, the narrower the distribution of distances between vectors. You can learn more details about it in this article.

    I would like to briefly show you how it works so that you get some intuition. I calculated a distribution of OpenAI embedding values and generated sets of 300 vectors with different dimensionalities. Then, I calculated the distances between all the vectors and draw a histogram. You can easily see that the increase in vector dimensionality makes the distribution narrower.

    Graph by author

    We’ve learned how to measure the similarities between the embeddings. With that we’ve finished with a theoretical part and moving to more practical part (visualisations and practical applications). Let’s start with visualisations since it’s always better to see your data first.

    Visualising embeddings

    The best way to understand the data is to visualise it. Unfortunately, embeddings have 1536 dimensions, so it’s pretty challenging to look at the data. However, there’s a way: we could use dimensionality reduction techniques to project vectors in two-dimensional space.

    PCA

    The most basic dimensionality reduction technique is PCA (Principal Component Analysis). Let’s try to use it.

    First, we need to convert our embeddings into a 2D numpy array to pass it to sklearn.

    import numpy as np
    embeddings_array = np.array(df.embedding.values.tolist())
    print(embeddings_array.shape)
    # (1400, 1536)

    Then, we need to initialise a PCA model with n_components = 2 (because we want to create a 2D visualisation), train the model on the whole data and predict new values.

    from sklearn.decomposition import PCA

    pca_model = PCA(n_components = 2)
    pca_model.fit(embeddings_array)

    pca_embeddings_values = pca_model.transform(embeddings_array)
    print(pca_embeddings_values.shape)
    # (1400, 2)

    As a result, we got a matrix with just two features for each question, so we could easily visualise it on a scatter plot.

    fig = px.scatter(
    x = pca_embeddings_values[:,0],
    y = pca_embeddings_values[:,1],
    color = df.topic.values,
    hover_name = df.full_text.values,
    title = 'PCA embeddings', width = 800, height = 600,
    color_discrete_sequence = plotly.colors.qualitative.Alphabet_r
    )

    fig.update_layout(
    xaxis_title = 'first component',
    yaxis_title = 'second component')
    fig.show()
    Image by author

    We can see that questions from each topic are pretty close to each other, which is good. However, all the clusters are mixed, so there’s room for improvement.

    t-SNE

    PCA is a linear algorithm, while most of the relations are non-linear in real life. So, we may not be able to separate the clusters because of non-linearity. Let’s try to use a non-linear algorithm t-SNE and see whether it will be able to show better results.

    The code is almost identical. I just used the t-SNE model instead of PCA.

    from sklearn.manifold import TSNE
    tsne_model = TSNE(n_components=2, random_state=42)
    tsne_embeddings_values = tsne_model.fit_transform(embeddings_array)

    fig = px.scatter(
    x = tsne_embeddings_values[:,0],
    y = tsne_embeddings_values[:,1],
    color = df.topic.values,
    hover_name = df.full_text.values,
    title = 't-SNE embeddings', width = 800, height = 600,
    color_discrete_sequence = plotly.colors.qualitative.Alphabet_r
    )

    fig.update_layout(
    xaxis_title = 'first component',
    yaxis_title = 'second component')
    fig.show()

    The t-SNE result looks way better. Most of the clusters are separated except “genai”, “datascience” and “ai”. However, it’s pretty expected — I doubt I could separate these topics myself.

    Looking at this visualisation, we see that embeddings are pretty good at encoding semantic meaning.

    Also, you can make a projection to three-dimensional space and visualise it. I’m not sure whether it would be practical, but it can be insightful and engaging to play with the data in 3D.

    tsne_model_3d = TSNE(n_components=3, random_state=42)
    tsne_3d_embeddings_values = tsne_model_3d.fit_transform(embeddings_array)

    fig = px.scatter_3d(
    x = tsne_3d_embeddings_values[:,0],
    y = tsne_3d_embeddings_values[:,1],
    z = tsne_3d_embeddings_values[:,2],
    color = df.topic.values,
    hover_name = df.full_text.values,
    title = 't-SNE embeddings', width = 800, height = 600,
    color_discrete_sequence = plotly.colors.qualitative.Alphabet_r,
    opacity = 0.7
    )
    fig.update_layout(xaxis_title = 'first component', yaxis_title = 'second component')
    fig.show()

    Barcodes

    The way to understand the embeddings is to visualise a couple of them as bar codes and see the correlations. I picked three examples of embeddings: two are closest to each other, and the other is the farthest example in our dataset.

    embedding1 = df.loc[1].embedding
    embedding2 = df.loc[616].embedding
    embedding3 = df.loc[749].embedding
    import seaborn as sns
    import matplotlib.pyplot as plt
    embed_len_thr = 1536

    sns.heatmap(np.array(embedding1[:embed_len_thr]).reshape(-1, embed_len_thr),
    cmap = "Greys", center = 0, square = False,
    xticklabels = False, cbar = False)
    plt.gcf().set_size_inches(15,1)
    plt.yticks([0.5], labels = ['AI'])
    plt.show()

    sns.heatmap(np.array(embedding3[:embed_len_thr]).reshape(-1, embed_len_thr),
    cmap = "Greys", center = 0, square = False,
    xticklabels = False, cbar = False)
    plt.gcf().set_size_inches(15,1)
    plt.yticks([0.5], labels = ['AI'])
    plt.show()

    sns.heatmap(np.array(embedding2[:embed_len_thr]).reshape(-1, embed_len_thr),
    cmap = "Greys", center = 0, square = False,
    xticklabels = False, cbar = False)
    plt.gcf().set_size_inches(15,1)
    plt.yticks([0.5], labels = ['Bioinformatics'])
    plt.show()
    Graph by author

    It’s not easy to see whether vectors are close to each other in our case because of high dimensionality. However, I still like this visualisation. It might be helpful in some cases, so I am sharing this idea with you.

    We’ve learned how to visualise embeddings and have no doubts left about their ability to grasp the meaning of the text. Now, it’s time to move on to the most interesting and fascinating part and discuss how you can leverage embeddings in practice.

    Practical applications

    Of course, embeddings’ primary goal is not to encode texts as vectors of numbers or visualise them just for the sake of it. We can benefit a lot from our ability to capture the texts’ meanings. Let’s go through a bunch of more practical examples.

    Clustering

    Let’s start with clustering. Clustering is an unsupervised learning technique that allows you to split your data into groups without any initial labels. Clustering can help you understand the internal structural patterns in your data.

    We will use one of the most basic clustering algorithms — K-means. For the K-means algorithm, we need to specify the number of clusters. We can define the optimal number of clusters using silhouette scores.

    Let’s try k (number of clusters) between 2 and 50. For each k, we will train a model and calculate silhouette scores. The higher silhouette score — the better clustering we got.

    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_score
    import tqdm

    silhouette_scores = []
    for k in tqdm.tqdm(range(2, 51)):
    kmeans = KMeans(n_clusters=k,
    random_state=42,
    n_init = 'auto').fit(embeddings_array)
    kmeans_labels = kmeans.labels_
    silhouette_scores.append(
    {
    'k': k,
    'silhouette_score': silhouette_score(embeddings_array,
    kmeans_labels, metric = 'cosine')
    }
    )

    fig = px.line(pd.DataFrame(silhouette_scores).set_index('k'),
    title = '<b>Silhouette scores for K-means clustering</b>',
    labels = {'value': 'silhoutte score'},
    color_discrete_sequence = plotly.colors.qualitative.Alphabet)
    fig.update_layout(showlegend = False)

    In our case, the silhouette score reaches a maximum when k = 11. So, let’s use this number of clusters for our final model.

    Graph by author

    Let’s visualise the clusters using t-SNE for dimensionality reduction as we already did before.

    tsne_model = TSNE(n_components=2, random_state=42)
    tsne_embeddings_values = tsne_model.fit_transform(embeddings_array)

    fig = px.scatter(
    x = tsne_embeddings_values[:,0],
    y = tsne_embeddings_values[:,1],
    color = list(map(lambda x: 'cluster %s' % x, kmeans_labels)),
    hover_name = df.full_text.values,
    title = 't-SNE embeddings for clustering', width = 800, height = 600,
    color_discrete_sequence = plotly.colors.qualitative.Alphabet_r
    )
    fig.update_layout(
    xaxis_title = 'first component',
    yaxis_title = 'second component')
    fig.show()

    Visually, we can see that the algorithm was able to define clusters quite well — they are separated pretty well.

    We have factual topic labels, so we can even assess how good clusterisation is. Let’s look at the topics’ mixture for each cluster.

    df['cluster'] = list(map(lambda x: 'cluster %s' % x, kmeans_labels))
    cluster_stats_df = df.reset_index().pivot_table(
    index = 'cluster', values = 'id',
    aggfunc = 'count', columns = 'topic').fillna(0).applymap(int)

    cluster_stats_df = cluster_stats_df.apply(
    lambda x: 100*x/cluster_stats_df.sum(axis = 1))

    fig = px.imshow(
    cluster_stats_df.values,
    x = cluster_stats_df.columns,
    y = cluster_stats_df.index,
    text_auto = '.2f', aspect = "auto",
    labels=dict(x="cluster", y="fact topic", color="share, %"),
    color_continuous_scale='pubugn',
    title = '<b>Share of topics in each cluster</b>', height = 550)

    fig.show()

    In most cases, clusterisation worked perfectly. For example, cluster 5 contains almost only questions about bicycles, while cluster 6 is about coffee. However, it wasn’t able to distinguish close topics:

    • “ai”, “genai” and “datascience” are all in one cluster,
    • the same store with “economics” and “politics”.

    We used only embeddings as the features in this example, but if you have any additional information (for example, age, gender or country of the user who asked the question), you can include it in the model, too.

    Classification

    We can use embeddings for classification or regression tasks. For example, you can do it to predict customer reviews’ sentiment (classification) or NPS score (regression).

    Since classification and regression are supervised learning, you will need to have labels. Luckily, we know the topics for our questions and can fit a model to predict them.

    I will use a Random Forest Classifier. If you need a quick refresher about Random Forests, you can find it here. To assess the classification model’s performance correctly, we will split our dataset into train and test sets (80% vs 20%). Then, we can train our model on a train set and measure the quality on a test set (questions that the model hasn’t seen before).

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    class_model = RandomForestClassifier(max_depth = 10)

    # defining features and target
    X = embeddings_array
    y = df.topic

    # splitting data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state = 42, test_size=0.2, stratify=y
    )

    # fit & predict
    class_model.fit(X_train, y_train)
    y_pred = class_model.predict(X_test)

    To estimate the model’s performance, let’s calculate a confusion matrix. In an ideal situation, all non-diagonal elements should be 0.

    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(y_test, y_pred)

    fig = px.imshow(
    cm, x = class_model.classes_,
    y = class_model.classes_, text_auto='d',
    aspect="auto",
    labels=dict(
    x="predicted label", y="true label",
    color="cases"),
    color_continuous_scale='pubugn',
    title = '<b>Confusion matrix</b>', height = 550)

    fig.show()

    We can see similar results to clusterisation: some topics are easy to classify, and accuracy is 100%, for example, “bicycles” or “travel”, while some others are difficult to distinguish (especially “ai”).

    However, we achieved 91.8% overall accuracy, which is quite good.

    Finding anomalies

    We can also use embedding to find anomalies in our data. For example, at the t-SNE graph, we saw that some questions are pretty far from their clusters, for instance, for the “travel” topic. Let’s look at this theme and try to find anomalies. We will use the Isolation Forest algorithm for it.

    from sklearn.ensemble import IsolationForest

    topic_df = df[df.topic == 'travel']
    topic_embeddings_array = np.array(topic_df.embedding.values.tolist())

    clf = IsolationForest(contamination = 0.03, random_state = 42)
    topic_df['is_anomaly'] = clf.fit_predict(topic_embeddings_array)

    topic_df[topic_df.is_anomaly == -1][['full_text']]

    So, here we are. We’ve found the most uncommon comment for the travel topic (source).

    Is it safe to drink the water from the fountains found all over 
    the older parts of Rome?

    When I visited Rome and walked around the older sections, I saw many
    different types of fountains that were constantly running with water.
    Some went into the ground, some collected in basins, etc.

    Is the water coming out of these fountains potable? Safe for visitors
    to drink from? Any etiquette regarding their use that a visitor
    should know about?

    Since it talks about water, the embedding of this comment is close to the coffee topic where people also discuss water to pour coffee. So, the embedding representation is quite reasonable.

    We could find it on our t-SNE visualisation and see that it’s actually close to the coffee cluster.

    Graph by author

    RAG — Retrieval Augmented Generation

    With the recently increased popularity of LLMs, embeddings have been broadly used in RAG use cases.

    We need Retrieval Augmented Generation when we have a lot of documents (for example, all the questions from Stack Exchange), and we can’t pass them all to an LLM because

    • LLMs have limits on the context size (right now, it’s 128K for GPT-4 Turbo).
    • We pay for tokens, so it’s more expensive to pass all the information all the time.
    • LLMs show worse performance with a bigger context. You can check Needle In A Haystack — Pressure Testing LLMs to learn more details.

    To be able to work with an extensive knowledge base, we can leverage the RAG approach:

    • Compute embeddings for all the documents and store them in vector storage.
    • When we get a user request, we can calculate its embedding and retrieve relevant documents from the storage for this request.
    • Pass only relevant documents to LLM to get a final answer.

    To learn more about RAG, don’t hesitate to read my article with much more details here.

    Summary

    In this article, we’ve discussed text embeddings in much detail. Hopefully, now you have a complete and deep understanding of this topic. Here’s a quick recap of our journey:

    • Firstly, we went through the evolution of approaches to work with texts.
    • Then, we discussed how to understand whether texts have similar meanings to each other.
    • After that, we saw different approaches to text embedding visualisation.
    • Finally, we tried to use embeddings as features in different practical tasks such as clustering, classification, anomaly detection and RAG.

    Thank you a lot for reading this article. If you have any follow-up questions or comments, please leave them in the comments section.

    Reference

    In this article, I used a dataset from Stack Exchange Data Dump, which is available under the Creative Commons license.

    This article was inspired by the following courses:


    Text Embeddings: Comprehensive Guide was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Text Embeddings: Comprehensive Guide

    Go Here to Read this Fast! Text Embeddings: Comprehensive Guide