Tag: artificial intelligence

  • Deploy a Microsoft Teams gateway for Amazon Q, your business expert

    Deploy a Microsoft Teams gateway for Amazon Q, your business expert

    Bob Strahan

    In this post, we show you how to bring Amazon Q, your business expert, to users in Microsoft Teams. (If you use Slack, refer to Deploy a Slack gateway for Amazon Q, your business expert.) You’ll be able converse with Amazon Q business expert using Teams direct messages (DMs) to ask questions and get answers based on company data, get help creating new content such as email drafts, summarize attached files, and perform tasks.

    Originally appeared here:
    Deploy a Microsoft Teams gateway for Amazon Q, your business expert

    Go Here to Read this Fast! Deploy a Microsoft Teams gateway for Amazon Q, your business expert

  • The New Frontiers of LLMs: Challenges, Solutions, and Tools

    TDS Editors

    Large language models have been around for several years, but it wasn’t until 2023 that their presence became truly ubiquitous both within and outside machine learning communities. Previously opaque concepts like fine-tuning and RAG have gone mainstream, and companies big and small have been either building or integrating LLM-powered tools into their workflows.

    As we look ahead at what 2024 might bring, it seems all but certain that these models’ footprint is poised to grow further, and that alongside exciting innovations, they’ll also generate new challenges for practitioners. The standout posts we’re highlighting this week point at some of these emerging aspects of working with LLMs; whether you’re relatively new to the topic or have already experimented extensively with these models, you’re bound to find something here to pique your curiosity.

    • Democratizing LLMs: 4-bit Quantization for Optimal LLM Inference
      Quantization is one of the main approaches for making the power of massive models accessible to a wider user base of ML professionals, many of whom might not have access to limitless memory and compute. Wenqi Glantz walks us through the process of quantizing the Mistral-7B-Instruct-v0.2 model, and explains this method’s inherent tradeoffs between efficiency and performance.
    • Navigating the World of LLM Agents: A Beginner’s Guide
      How can we get LLMs “to the point where they can solve more complex questions on their own?” Dominik Polzer’s accessible primer shows how to build LLM agents that can leverage disparate tools and functionalities to automate complex workflows with minimal human intervention.
    Photo by Beth Macdonald on Unsplash
    • Leverage KeyBERT, HDBSCAN and Zephyr-7B-Beta to Build a Knowledge Graph
      LLMs are very powerful on their own, of course, but their potential becomes even more striking when combined with other approaches and tools. Silvia Onofrei’s recent guide on building a knowledge graph with the aid of the Zephyr-7B-Beta model is a case in point; it demonstrates how bringing together LLMs and traditional NLP methods can produce impressive results.
    • Merge Large Language Models with mergekit
      As unlikely as it may sound, sometimes a single LLM might not be enough for your project’s specific needs. As Maxime Labonne shows in his latest tutorial, model merging, a “relatively new and experimental method to create new models for cheap,” might just be the solution for those moments when you need to mix-and-match elements from multiple models.
    • Does Using an LLM During the Hiring Process Make You a Fraud as a Candidate?
      The types of questions LLMs raise go beyond the technical—they also touch on ethical and social issues that can get quite thorny. Christine Egan focuses on the stakes for job candidates who take advantage of LLMs and tools like ChatGPT as part of the job search, and explores the sometimes blurry line between using and misusing technology to streamline tedious tasks.

    As always, the range and depth of topics our authors covered in recent weeks is staggering—here’s a representative sample of must-reads:

    Thank you for supporting the work of our authors! If you’re feeling inspired to join their ranks, why not write your first post? We’d love to read it.

    Until the next Variable,

    TDS Team


    The New Frontiers of LLMs: Challenges, Solutions, and Tools was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    The New Frontiers of LLMs: Challenges, Solutions, and Tools

    Go Here to Read this Fast! The New Frontiers of LLMs: Challenges, Solutions, and Tools

  • Performant IPv4 Range Spark Joins

    Performant IPv4 Range Spark Joins

    Jean-Claude Cote

    A Practical guide to optimizing non-equi joins in Spark

    Photo by John Lee on Unsplash

    Enriching network events with IP geolocation information is a crucial task, especially for organizations like the Canadian Centre for Cyber Security, the national CSIRT of Canada. In this article, we will demonstrate how to optimize Spark SQL joins, specifically focusing on scenarios involving non-equality conditions — a common challenge when working with IP geolocation data.

    As cybersecurity practitioners, our reliance on enriching network events with IP geolocation databases necessitates efficient strategies for handling non-equi joins. While numerous articles shed light on various join strategies supported by Spark, the practical application of these strategies remains a prevalent concern for professionals in the field.

    David Vrba’s insightful article, “About Joins in Spark 3.0”, published on Towards Data Science, serves as a valuable resource. It explains the conditions guiding Spark’s selection of specific join strategies. In his article, David briefly suggests that optimizing non-equi joins involves transforming them into equi-joins.

    This write-up aims to provide a practical guide for optimizing the performance of a non-equi JOIN, with a specific focus on joining with IP ranges in a geolocation table.

    To exemplify these optimizations, we will revisit the geolocation table introduced in our previous article.

    +----------+--------+---------+-----------+-----------+
    | start_ip | end_ip | country | city | owner |
    +----------+--------+---------+-----------+-----------+
    | 1 | 2 | ca | Toronto | Telus |
    | 3 | 4 | ca | Quebec | Rogers |
    | 5 | 8 | ca | Vancouver | Bell |
    | 10 | 14 | ca | Montreal | Telus |
    | 19 | 22 | ca | Ottawa | Rogers |
    | 23 | 29 | ca | Calgary | Videotron |
    +----------+--------+---------+-----------+-----------+

    Equi-Join

    To illustrate Spark’s execution of an equi-join, we’ll initiate our exploration by considering a hypothetical scenario. Suppose we have a table of events, each event being associated with a specific ownerdenoted by the event_owner column.

    +------------+--------------+
    | event_time | event_owner |
    +------------+--------------+
    | 2024-01-01 | Telus |
    | 2024-01-02 | Bell |
    | 2024-01-03 | Rogers |
    | 2024-01-04 | Videotron |
    | 2024-01-05 | Telus |
    | 2024-01-06 | Videotron |
    | 2024-01-07 | Rogers |
    | 2024-01-08 | Bell |
    +------------+--------------+

    Let’s take a closer look at how Spark handles this equi-join:

    SELECT
    *
    FROM
    events
    JOIN geolocation
    ON (event_owner = owner)

    In this example, the equi-join is established between the events table and the geolocation table. The linking criterion is based on the equality of the event_owner column in the events table and the owner column in the geolocation table.

    As explained by David Vrba in his blog post:

    Spark will plan the join with SMJ if there is an equi-condition and the joining keys are sortable

    Spark will execute a Sort Merge Join, distributing the rows of the two tables by hashing the event_owner on the left side and the owner on the right side. Rows from both tables that hash to the same Spark partition will be processed by the same Spark task—a unit of work. For example, Task-1 might receive:

    +----------+-------+---------+-----------+-----------+
    | start_ip | end_ip| country | city | owner |
    +----------+-------+---------+-----------+-----------+
    | 1 | 2 | ca | Toronto | Telus |
    | 10 | 14 | ca | Montreal | Telus |
    +----------+-------+---------+-----------+-----------+

    +------------+--------------+
    | event_time | event_owner |
    +------------+--------------+
    | 2024-01-01 | Telus |
    | 2024-01-05 | Telus |
    +------------+--------------+

    Notice how Task-1 handles only a subset of the data. The join problem is divided into multiple smaller tasks, where only a subset of the rows from both the left and right sides is required. Furthermore, the left and right side rows processed by Task-1 have to match. This is true because every occurrence of “Telus” will hash to the same partition, regardless of whether it comes from the events or geolocation tables. We can be certain that no other Task-X will have rows with an owner of “Telus”.

    Once the data is divided as shown above, Spark will sort both sides, hence the name of the join strategy, Sort Merge Join. The merge is performed by taking the first row on the left and testing if it matches the right. Once the rows on the right no longer match, Spark will pull rows from the left. It will keep dequeuing each side until no rows are left on either side.

    Non-equi Join

    Now that we have a better understanding of how equi-joins are performed, let’s contrast it with a non-equi join. Suppose we have events with an event_ip, and we want to add geolocation information to this table.

    +------------+----------+
    | event_time | event_ip |
    +------------+----------+
    | 2024-01-01 | 6 |
    | 2024-01-02 | 14 |
    | 2024-01-03 | 18 |
    | 2024-01-04 | 27 |
    | 2024-01-05 | 9 |
    | 2024-01-06 | 23 |
    | 2024-01-07 | 15 |
    | 2024-01-08 | 1 |
    +------------+----------+

    To execute this join, we need to determine the IP range within which the event_ip falls. We accomplish this with the following condition:

    SELECT
    *
    FROM
    events
    JOIN geolocation
    ON (event_ip >= start_ip and event_ip <= end_ip)

    Now, let’s consider how Spark will execute this join. On the right side (the geolocation table), there is no key by which Spark can hash and distribute the rows. It is impossible to divide this problem into smaller tasks that can be distributed across the compute cluster and performed in parallel.

    In a situation like this, Spark is forced to employ more resource-intensive join strategies. As stated by David Vrba:

    If there is no equi-condition, Spark has to use BroadcastNestedLoopJoin (BNLJ) or cartesian product (CPJ).

    Both of these strategies involve brute-forcing the problem; for every row on the left side, Spark will test the “between” condition on every single row of the right side. It has no other choice. If the table on the right is small enough, Spark can optimize by copying the right-side table to every task reading the left side, a scenario known as the BNLJ case. However, if the left side is too large, each task will need to read both the right and left sides of the table, referred to as the CPJ case. In either case, both strategies are highly costly.

    So, how can we improve this situation? The trick is to introduce an equality in the join condition. For example, we could simply unroll all the IP ranges in the geolocation table, producing a row for every IP found in the IP ranges.

    This is easily achievable in Spark; we can execute the following SQL to unroll all the IP ranges:

    SELECT
    country,
    city,
    owner,
    explode(sequence(start_ip, end_ip)) AS ip
    FROM
    geolocation

    The sequence function creates an array with the IP values from start_ip to end_ip. The explode function unrolls this array into individual rows.

    +---------+---------+---------+-----------+
    | country | city | owner | ip |
    +---------+---------+---------+-----------+
    | ca | Toronto | Telus | 1 |
    | ca | Toronto | Telus | 2 |
    | ca | Quebec | Rogers | 3 |
    | ca | Quebec | Rogers | 4 |
    | ca | Vancouver | Bell | 5 |
    | ca | Vancouver | Bell | 6 |
    | ca | Vancouver | Bell | 7 |
    | ca | Vancouver | Bell | 8 |
    | ca | Montreal | Telus | 10 |
    | ca | Montreal | Telus | 11 |
    | ca | Montreal | Telus | 12 |
    | ca | Montreal | Telus | 13 |
    | ca | Montreal | Telus | 14 |
    | ca | Ottawa | Rogers | 19 |
    | ca | Ottawa | Rogers | 20 |
    | ca | Ottawa | Rogers | 21 |
    | ca | Ottawa | Rogers | 22 |
    | ca | Calgary | Videotron | 23 |
    | ca | Calgary | Videotron | 24 |
    | ca | Calgary | Videotron | 25 |
    | ca | Calgary | Videotron | 26 |
    | ca | Calgary | Videotron | 27 |
    | ca | Calgary | Videotron | 28 |
    | ca | Calgary | Videotron | 29 |
    +---------+---------+---------+-----------+

    With a key on both sides, we can now execute an equi-join, and Spark can efficiently distribute the problem, resulting in optimal performance. However, in practice, this scenario is not realistic, as a genuine geolocation table often contains billions of rows.

    To address this, we can enhance the efficiency by increasing the coarseness of this mapping. Instead of mapping IP ranges to each individual IP, we can map the IP ranges to segments within the IP space. Let’s assume we divide the IP space into segments of 5. The segmented space would look something like this:

    +---------------+-------------+-----------+
    | segment_start | segment_end | bucket_id |
    +---------------+-------------+-----------+
    | 1 | 5 | 0 |
    | 6 | 10 | 1 |
    | 11 | 15 | 2 |
    | 16 | 20 | 3 |
    | 21 | 25 | 4 |
    | 26 | 30 | 5 |
    +---------------+-------------+-----------+

    Now, our objective is to map the IP ranges to the segments they overlap with. Similar to what we did earlier, we can unroll the IP ranges, but this time, we’ll do it in segments of 5.

    SELECT
    country,
    city,
    owner,
    explode(sequence(start_ip / 5, end_ip / 5)) AS bucket_id
    FROM
    geolocations

    We observe that certain IP ranges share a bucket_id. Ranges 1–2 and 3–4 both fall within the segment 1–5.

    +----------+--------+---------+-----------+-----------+-----------+
    | start_ip | end_ip | country | city | owner | bucket_id |
    +----------+--------+---------+-----------+-----------+-----------+
    | 1 | 2 | ca | Toronto | Telus | 0 |
    | 3 | 4 | ca | Quebec | Rogers | 0 |
    | 5 | 8 | ca | Vancouver | Bell | 1 |
    | 10 | 14 | ca | Montreal | Telus | 2 |
    | 19 | 22 | ca | Ottawa | Rogers | 3 |
    | 19 | 22 | ca | Ottawa | Rogers | 4 |
    | 23 | 29 | ca | Calgary | Videotron | 4 |
    | 23 | 29 | ca | Calgary | Videotron | 5 |
    +----------+--------+---------+-----------+-----------+-----------+

    Additionally, we notice that some IP ranges are duplicated. The last two rows for the IP range 23–29 overlap with segments 20–25 and 26–30. Similar to the scenario where we unrolled individual IPs, we are still duplicating rows, but to a much lesser extent.

    Now, we can utilize this bucketed table to perform our join.

    SELECT
    *
    FROM
    events
    JOIN geolocation
    ON (
    event_ip / 5 = bucket_id
    AND event_ip >= start_ip
    AND event_ip <= end_ip
    )

    The equality in the join enables Spark to perform a Sort Merge Join (SMJ) strategy. The “between” condition eliminates cases where IP ranges share the same bucket_id.

    In this illustration, we used segments of 5; however, in reality, we would segment the IP space into segments of 256. This is because the global IP address space is overseen by the Internet Assigned Numbers Authority (IANA), and traditionally, IANA allocates address space in blocks of 256 IPs.

    Analyzing the IP ranges in a genuine geolocation table using the Spark approx_percentile function reveals that most records have spans of less than 256, while very few are larger than 256.

    SELECT 
    approx_percentile(
    end_ip - start_ip,
    array(0.800, 0.900, 0.950, 0.990, 0.999, 0.9999),
    10000)
    FROM
    geolocation

    This implies that most IP ranges are assigned a bucket_id, while the few larger ones are unrolled, resulting in the unrolled table containing approximately an extra 10% of rows.

    A query executed with a genuine geolocation table might resemble the following:

    WITH
    b_geo AS (
    SELECT
    explode(
    sequence(
    CAST(start_ip / 256 AS INT),
    CAST(end_ip / 256 AS INT))) AS bucket_id,
    *
    FROM
    geolocation
    ),
    b_events AS (
    SELECT
    CAST(event_ip / 256 AS INT) AS bucket_id,
    *
    FROM
    events
    )

    SELECT
    *
    FROM
    b_events
    JOIN b_geo
    ON (
    b_events.bucket_id = b_geo.bucket_id
    AND b_events.event_ip >= b_geo.start_ip
    AND b_events.event_ip <= b_geo.end_ip
    );

    Conclusion

    In conclusion, this article has presented a practical demonstration of converting a non-equi join into an equi-join through the implementation of a mapping technique that involves segmenting IP ranges. It’s crucial to note that this approach extends beyond IP addresses and can be applied to any dataset characterized by bands or ranges.

    The ability to effectively map and segment data is a valuable tool in the arsenal of data engineers and analysts, providing a pragmatic solution to the challenges posed by non-equality conditions in Spark SQL joins.


    Performant IPv4 Range Spark Joins was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Performant IPv4 Range Spark Joins

    Go Here to Read this Fast! Performant IPv4 Range Spark Joins

  • Stochastic Gradient Descent: Math and Python Code

    Cristian Leo

    Deep Dive on Stochastic Gradient Descent. Algorithm, assumptions, benefits, formula, and practical implementation

    Image by DALL-E-2

    Introduction

    The image above is not just an appealing visual that drew you to this article (despite its length), but it also represents a potential journey of the SGD algorithm in search of a global minimum. In this journey, it navigates rocky paths where the height symbolizes the loss. If this doesn’t sound clear now, don’t worry, it will be by the end of this article.

    Index:
    · 1: Understanding the Basics
    1.1: What is Gradient Descent
    1.2: The ‘Stochastic’ in Stochastic Gradient Descent
    · 2: The Mechanics of SGD
    2.1: The Algorithm Explained
    2.2: Understanding Learning Rate
    · 3: SGD in Practice
    3.1: Implementing SGD in Machine Learning Models
    3.2: SGD in Sci-kit Learn and Tensorflow
    · 4: Advantages and Challenges
    4.1: Why Choose SGD?
    4.2: Overcoming Challenges in SGD
    · 5: Beyond Basic SGD
    5.1: Variants of SGD
    5.2: Future of SGD
    · Conclusion

    1: Understanding the Basics

    1.1: What is Gradient Descent

    Image by DALL-E-2

    In machine learning , Gradient Descent is a star player. It’s an optimization algorithm used to minimize a function by iteratively moving towards the steepest descent as defined by the negative of the gradient. Like in the picture, imagine you’re at the top of a mountain, and your goal is to reach the lowest point. Gradient Descent helps you find the best path down the hill.

    The beauty of Gradient Descent is its simplicity and elegance. Here’s how it works, you start with a random point on the function you’re trying to minimize, for example a random starting point on the mountain. Then, you calculate the gradient (slope) of the function at that point. In the mountain analogy, this is like looking around you to find the steepest slope. Once you know the direction, you take a step downhill in that direction, and then you calculate the gradient again. Repeat this process until you reach the bottom.

    The size of each step is determined by the learning rate. However, if the learning rate is too small, it might take a long time to reach the bottom. If it’s too large, you might overshoot the lowest point. Finding the right balance is key to the success of the algorithm.

    One of the most appealing aspects of Gradient Descent is its generality. It can be applied to almost any function, especially those where an analytical solution is not feasible. This makes it incredibly versatile in solving various types of problems in machine learning, from simple linear regression to complex neural networks.

    1.2: The ‘Stochastic’ in Stochastic Gradient Descent

    Stochastic Gradient Descent (SGD) adds a twist to the traditional gradient descent approach. The term ‘stochastic’ refers to a system or process that is linked with a random probability. Therefore, this randomness is introduced in the way the gradient is calculated, which significantly alters its behavior and efficiency compared to standard gradient descent.

    In traditional batch gradient descent, you calculate the gradient of the loss function with respect to the parameters for the entire training set. As you can imagine, for large datasets, this can be quite computationally intensive and time-consuming. This is where SGD comes into play. Instead of using the entire dataset to calculate the gradient, SGD randomly selects just one data point (or a few data points) to compute the gradient in each iteration.

    Think of this process as if you were again descending a mountain, but this time in thick fog with limited visibility. Rather than viewing the entire landscape to decide your next step, you make your decision based on where your foot lands next. This step is small and random, but it’s repeated many times, each time adjusting your path slightly in response to the immediate terrain under your feet.

    This stochastic nature of the algorithm provides several benefits:

    • Speed: By using only a small subset of data at a time, SGD can make rapid progress in reducing the loss, especially for large datasets.
    • Escape from Local Minima: The randomness helps SGD to potentially escape local minima, a common problem in complex optimization problems.
    • Online Learning: SGD is well-suited for online learning, where the model needs to be updated as new data comes in, due to its ability to update the model incrementally.

    However, the stochastic nature also introduces variability in the path to convergence. The algorithm doesn’t smoothly descend towards the minimum; rather, it takes a more zigzag path, which can sometimes make the convergence process appear erratic.

    2: The Mechanics of SGD

    2.1: The Algorithm Explained

    Stochastic Gradient Descent (SGD) might sound complex, but its algorithm is quite straightforward when broken down. Here’s a step-by-step guide to understanding how SGD works:

    Initialization (Step 1)
    First, you initialize the parameters (weights) of your model. This can be done randomly or by some other initialization technique. The starting point for SGD is crucial as it influences the path the algorithm will take.

    Random Selection (Step 2)
    In each iteration of the training process, SGD randomly selects a single data point (or a small batch of data points) from the entire dataset. This randomness is what makes it ‘stochastic’.

    Compute the Gradient (Step 3)
    Calculate the gradient of the loss function, but only for the randomly selected data point(s). The gradient is a vector that points in the direction of the steepest increase of the loss function. In the context of SGD, it tells you how to tweak the parameters to make the model more accurate for that particular data point.

    Gradient Formula

    Here, ∇θJ(θ) represents the gradient of the loss function J(θ) with respect to the parameters θ. This gradient is a vector of partial derivatives, where each component of the vector is the partial derivative of the loss function with respect to the corresponding parameter in θ.

    Update the Parameters (Step 4)
    Adjust the model parameters in the opposite direction of the gradient. Here’s where the learning rate η plays a crucial role. The formula for updating each parameter is:

    where:

    • θnew​ represents the updated parameters.
    • θold​ represents the current parameters before the update.
    • η is the learning rate, a positive scalar determining the size of the step in the direction of the negative gradient.
    • θJ(θ) is the gradient of the loss function J(θ) with respect to the parameters θ.

    The learning rate determines the size of the steps you take towards the minimum. If it’s too small, the algorithm will be slow; if it’s too large, you might overshoot the minimum.

    Repeat until convergence (Step 5)
    Repeat steps 2 to 4 for a set number of iterations or until the model performance stops improving. Each iteration provides a slightly updated model.
    Ideally, after many iterations, SGD converges to a set of parameters that minimize the loss function, although due to its stochastic nature, the path to convergence is not as smooth and may oscillate around the minimum.

    2.2: Understanding Learning Rate

    One of the most crucial hyperparameters in the Stochastic Gradient Descent (SGD) algorithm is the learning rate. This parameter can significantly impact the performance and convergence of the model. Understanding and choosing the right learning rate is a vital step in effectively employing SGD.

    What is Learning Rate?
    At this point you should have an idea of what learning rate is, but let’s better define it for clarity. The learning rate in SGD determines the size of the steps the algorithm takes towards the minimum of the loss function. It’s a scalar that scales the gradient, dictating how much the weights in the model should be adjusted during each update. If you visualize the loss function as a valley, the learning rate decides how big a step you take with each iteration as you walk down the valley.

    Too High Learning Rate
    If the learning rate is too high, the steps taken might be too large. This can lead to overshooting the minimum, causing the algorithm to diverge or oscillate wildly without finding a stable point.
    Think of it as taking leaps in the valley and possibly jumping over the lowest point back and forth.

    Too Low Learning Rate
    On the other hand, a very low learning rate leads to extremely small steps. While this might sound safe, it significantly slows down the convergence process.
    In a worst-case scenario, the algorithm might get stuck in a local minimum or even stop improving before reaching the minimum.
    Imagine moving so slowly down the valley that you either get stuck or it takes an impractically long time to reach the bottom.

    Finding the Right Balance
    The ideal learning rate is neither too high nor too low but strikes a balance, allowing the algorithm to converge efficiently to the global minimum.
    Typically, the learning rate is chosen through experimentation and is often set to decrease over time. This approach is called learning rate annealing or scheduling.

    Learning Rate Scheduling
    Learning rate scheduling involves adjusting the learning rate over time. Common strategies include:

    • Time-Based Decay: The learning rate decreases over each update.
    • Step Decay: Reduce the learning rate by some factor after a certain number of epochs.
    • Exponential Decay: Decrease the learning rate exponentially.
    • Adaptive Learning Rate: Methods like AdaGrad, RMSProp, and Adam adjust the learning rate automatically during training.

    3: SGD in Practice

    3.1: Implementing SGD in Machine Learning Models

    Link to the full code (Jupyter Notebook): https://github.com/cristianleoo/models-from-scratch-python/blob/main/sgd.ipynb

    Implementing Stochastic Gradient Descent (SGD) in machine learning models is a practical step that brings the theoretical aspects of the algorithm into real-world application. This section will guide you through the basic implementation of SGD and provide tips for integrating it into machine learning workflows.

    Now let’s consider a simple case of SGD applied to Linear Regression:

    class SGDRegressor:
    def __init__(self, learning_rate=0.01, epochs=100, batch_size=1, reg=None, reg_param=0.0):
    """
    Constructor for the SGDRegressor.

    Parameters:
    learning_rate (float): The step size used in each update.
    epochs (int): Number of passes over the training dataset.
    batch_size (int): Number of samples to be used in each batch.
    reg (str): Type of regularization ('l1' or 'l2'); None if no regularization.
    reg_param (float): Regularization parameter.

    The weights and bias are initialized as None and will be set during the fit method.
    """
    self.learning_rate = learning_rate
    self.epochs = epochs
    self.batch_size = batch_size
    self.reg = reg
    self.reg_param = reg_param
    self.weights = None
    self.bias = None

    def fit(self, X, y):
    """
    Fits the SGDRegressor to the training data.

    Parameters:
    X (numpy.ndarray): Training data, shape (m_samples, n_features).
    y (numpy.ndarray): Target values, shape (m_samples,).

    This method initializes the weights and bias, and then updates them over a number of epochs.
    """
    m, n = X.shape # m is number of samples, n is number of features
    self.weights = np.zeros(n)
    self.bias = 0

    for _ in range(self.epochs):
    indices = np.random.permutation(m)
    X_shuffled = X[indices]
    y_shuffled = y[indices]

    for i in range(0, m, self.batch_size):
    X_batch = X_shuffled[i:i+self.batch_size]
    y_batch = y_shuffled[i:i+self.batch_size]

    gradient_w = -2 * np.dot(X_batch.T, (y_batch - np.dot(X_batch, self.weights) - self.bias)) / self.batch_size
    gradient_b = -2 * np.sum(y_batch - np.dot(X_batch, self.weights) - self.bias) / self.batch_size

    if self.reg == 'l1':
    gradient_w += self.reg_param * np.sign(self.weights)
    elif self.reg == 'l2':
    gradient_w += self.reg_param * self.weights

    self.weights -= self.learning_rate * gradient_w
    self.bias -= self.learning_rate * gradient_b

    def predict(self, X):
    """
    Predicts the target values using the linear model.

    Parameters:
    X (numpy.ndarray): Data for which to predict target values.

    Returns:
    numpy.ndarray: Predicted target values.
    """
    return np.dot(X, self.weights) + self.bias

    def compute_loss(self, X, y):
    """
    Computes the loss of the model.

    Parameters:
    X (numpy.ndarray): The input data.
    y (numpy.ndarray): The true target values.

    Returns:
    float: The computed loss value.
    """
    return (np.mean((y - self.predict(X)) ** 2) + self._get_regularization_loss()) ** 0.5

    def _get_regularization_loss(self):
    """
    Computes the regularization loss based on the regularization type.

    Returns:
    float: The regularization loss.
    """
    if self.reg == 'l1':
    return self.reg_param * np.sum(np.abs(self.weights))
    elif self.reg == 'l2':
    return self.reg_param * np.sum(self.weights ** 2)
    else:
    return 0

    def get_weights(self):
    """
    Returns the weights of the model.

    Returns:
    numpy.ndarray: The weights of the linear model.
    """
    return self.weights

    Let’s break it down into smaller steps:

    Initialization (Step 1)

    def __init__(self, learning_rate=0.01, epochs=100, batch_size=1, reg=None, reg_param=0.0):
    self.learning_rate = learning_rate
    self.epochs = epochs
    self.batch_size = batch_size
    self.reg = reg
    self.reg_param = reg_param
    self.weights = None
    self.bias = None

    The constructor (__init__ method) initializes the SGDRegressor with several parameters:

    • learning_rate: The step size used in updating the model.
    • epochs: The number of passes over the entire dataset.
    • batch_size: The number of samples used in each batch for SGD.
    • reg: The type of regularization (either ‘l1’ or ‘l2’; None if no regularization is used).
    • reg_param: The regularization parameter.
    • weights and bias are set to None initially and will be initialized in the fit method.

    Fit the Model(Step 2)

    def fit(self, X, y):
    m, n = X.shape # m is number of samples, n is number of features
    self.weights = np.zeros(n)
    self.bias = 0

    for _ in range(self.epochs):
    indices = np.random.permutation(m)
    X_shuffled = X[indices]
    y_shuffled = y[indices]

    for i in range(0, m, self.batch_size):
    X_batch = X_shuffled[i:i+self.batch_size]
    y_batch = y_shuffled[i:i+self.batch_size]

    gradient_w = -2 * np.dot(X_batch.T, (y_batch - np.dot(X_batch, self.weights) - self.bias)) / self.batch_size
    gradient_b = -2 * np.sum(y_batch - np.dot(X_batch, self.weights) - self.bias) / self.batch_size

    if self.reg == 'l1':
    gradient_w += self.reg_param * np.sign(self.weights)
    elif self.reg == 'l2':
    gradient_w += self.reg_param * self.weights

    self.weights -= self.learning_rate * gradient_w
    self.bias -= self.learning_rate * gradient_b

    This method fits the model to the training data. It starts by initializing weights as a zero vector of length n (number of features) and bias to zero. The model’s parameters are updated over a number of epochs through SGD.

    Random Selection and Batches(Step 3)

    for _ in range(self.epochs):
    indices = np.random.permutation(m)
    X_shuffled = X[indices]
    y_shuffled = y[indices]

    In each epoch, the data is shuffled, and batches are created to update the model parameters using SGD.

    Compute the Gradient and Update the parameters (Step 4)

    gradient_w = -2 * np.dot(X_batch.T, (y_batch - np.dot(X_batch, self.weights) - self.bias)) / self.batch_size
    gradient_b = -2 * np.sum(y_batch - np.dot(X_batch, self.weights) - self.bias) / self.batch_size

    Gradients for weights and bias are computed in each batch. These are then used to update the model’s weights and bias. If regularization is used, it’s also included in the gradient calculation.

    Repeat and converge (Step 5)

    def predict(self, X):

    return np.dot(X, self.weights) + self.bias

    The predict method calculates the predicted target values using the learned linear model.

    Compute Loss (Step 6)

    def compute_loss(self, X, y):
    return (np.mean((y - self.predict(X)) ** 2) + self._get_regularization_loss()) ** 0.5

    It calculates the mean squared error between the predicted values and the actual target values y. Additionally, it incorporates the regularization loss if regularization is specified.

    Regularization Loss Calculation (Step 7)

    def _get_regularization_loss(self):
    if self.reg == 'l1':
    return self.reg_param * np.sum(np.abs(self.weights))
    elif self.reg == 'l2':
    return self.reg_param * np.sum(self.weights ** 2)
    else:
    return 0

    This private method computes the regularization loss based on the type of regularization (l1 or l2) and the regularization parameter. This loss is added to the main loss function to penalize large weights, thereby avoiding overfitting.

    3.2: SGD in Sci-kit Learn and Tensorflow

    Now, while the code above is very useful for educational purposes, data scientists definitely don’t use it on a daily basis. Indeed, we can directly call SGD with few lines of code from popular libraries such as scikit learn (machine learning) or tensorflow (deep learning).

    SGD for linear regression in scikit-learn

    from sklearn.linear_model import SGDRegressor

    # Create and fit the model
    model = SGDRegressor(max_iter=1000)
    model.fit(X, y)

    # Making predictions
    predictions = model.predict(X)

    SGD regressor is directly called from sklearn library, and follows the same structure of other algorithms in the same library.
    The parameter ‘max_iter’ is the number of epochs (rounds). By specifying max_iter to 1000 we will make the algorithm update the linear regression weights and bias 1000 times.

    Neural Network with SGD optimization in Tensorflow

    import tensorflow as tf
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.optimizers import SGD

    # Create a simple neural network model
    model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(1)
    ])

    sgd = SGD(learning_rate=0.01)

    # Compile the model with SGD optimizer
    model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])

    # Train the model
    model.fit(X, y, epochs=10)

    In this code we are defining a Neural Network with one Dense Layer and 64 nodes. However, besides the specifics of the neural network, here we are again calling SGD with just two lines of code:

    from tensorflow.keras.optimizers import SGD
    sgd = SGD(learning_rate=0.01)

    4: Advantages and Challenges

    4.1: Why Choose SGD?

    Efficiency with Large Datasets:
    Scalability
    : One of the primary advantages of SGD is its efficiency in handling large-scale data. Since it updates parameters using only a single data point (or a small batch) at a time, it is much less memory-intensive than algorithms requiring the entire dataset for each update.
    Speed: By frequently updating the model parameters, SGD can converge more quickly to a good solution, especially in cases where the dataset is enormous.

    Flexibility and Adaptability:
    Online Learning
    : SGD’s ability to update the model incrementally makes it well-suited for online learning, where the model needs to adapt continuously as new data arrives.
    Handling Non-Static Datasets: For datasets that change over time, SGD’s incremental update approach can adjust to these changes more effectively than batch methods.

    Overcoming Challenges of Local Minima:
    The stochastic nature of SGD helps it to potentially escape local minima, a significant challenge in many optimization problems. The random fluctuations allow the algorithm to explore a broader range of the solution space.

    General Applicability:
    SGD can be applied to a wide range of problems and is not limited to specific types of models. This general applicability makes it a versatile tool in the machine learning toolbox.

    Simplicity and Ease of Implementation:
    Despite its effectiveness, SGD remains relatively simple to understand and implement. This ease of use is particularly appealing for those new to machine learning.

    Improved Generalization:
    By updating the model frequently with a high degree of variance, SGD can often lead to models that generalize better on unseen data. This is because the algorithm is less likely to overfit to the noise in the training data.

    Compatibility with Advanced Techniques:
    SGD is compatible with a variety of enhancements and extensions, such as momentum, learning rate scheduling, and adaptive learning rate methods like Adam, which further improve its performance and versatility.

    4.2: Overcoming Challenges in SGD

    While Stochastic Gradient Descent (SGD) is a powerful and versatile optimization algorithm, it comes with its own set of challenges. Understanding these hurdles and knowing how to overcome them can greatly enhance the performance and reliability of SGD in practical applications.

    Choosing the Right Learning Rate
    Selecting an appropriate learning rate is crucial for SGD. If it’s too high, the algorithm may diverge; if it’s too low, it might take too long to converge or get stuck in local minima.
    Use a learning rate schedule or adaptive learning rate methods. Techniques like learning rate annealing, where the learning rate decreases over time, can help strike the right balance.

    Dealing with Noisy Updates
    The stochastic nature of SGD leads to noisy updates, which can cause the algorithm to be less stable and take longer to converge.
    Implement mini-batch SGD, where the gradient is computed on a small subset of the data rather than a single data point. This approach can reduce the variance in the updates.

    Risk of Local Minima and Saddle Points
    In complex models, SGD can get stuck in local minima or saddle points, especially in high-dimensional spaces.
    Use techniques like momentum or Nesterov accelerated gradients to help the algorithm navigate through flat regions and escape local minima.

    Sensitivity to Feature Scaling
    SGD is sensitive to the scale of the features, and having features on different scales can make the optimization process inefficient.
    Normalize or standardize the input features so that they are on a similar scale. This practice can significantly improve the performance of SGD.

    Hyperparameter Tuning
    SGD requires careful tuning of hyperparameters, not just the learning rate but also parameters like momentum and the size of the mini-batch.
    Utilize grid search, random search, or more advanced methods like Bayesian optimization to find the optimal set of hyperparameters.

    Overfitting
    Like any machine learning algorithm, there’s a risk of overfitting, where the model performs well on training data but poorly on unseen data.
    Use regularization techniques such as L1 or L2 regularization, and validate the model using a hold-out set or cross-validation.

    5: Beyond Basic SGD

    5.1: Variants of SGD

    Stochastic Gradient Descent (SGD) has several variants, each designed to address specific challenges or to improve upon the basic SGD algorithm in certain aspects. These variants enhance SGD’s efficiency, stability, and convergence rate. Here’s a look at some of the key variants:

    Mini-Batch Gradient Descent
    This is a blend of batch gradient descent and stochastic gradient descent. Instead of using the entire dataset (as in batch GD) or a single sample (as in SGD), it uses a mini-batch of samples.
    It reduces the variance of the parameter updates, which can lead to more stable convergence. It can also take advantage of optimized matrix operations, which makes it more computationally efficient.

    Momentum SGD
    Momentum is an approach that helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction of the previous update vector to the current update.
    It helps in faster convergence and reduces oscillations. It is particularly useful for navigating the ravines of the cost function, where the surface curves much more steeply in one dimension than in another.

    Nesterov Accelerated Gradient (NAG)
    A variant of momentum SGD, Nesterov momentum is a technique that makes a more informed update by calculating the gradient of the future approximate position of the parameters.
    It can speed up convergence and improve the performance of the algorithm, particularly in the context of convex functions.

    Adaptive Gradient (Adagrad)
    Adagrad adapts the learning rate to each parameter, giving parameters that are updated more frequently a lower learning rate.
    It’s particularly useful for dealing with sparse data and is well-suited for problems where data is scarce or features have very different frequencies.

    RMSprop
    RMSprop (Root Mean Square Propagation) modifies Adagrad to address its radically diminishing learning rates. It uses a moving average of squared gradients to normalize the gradient.
    It works well in online and non-stationary settings and has been found to be an effective and practical optimization algorithm for neural networks.

    Adam (Adaptive Moment Estimation)
    Adam combines ideas from both Momentum and RMSprop. It computes adaptive learning rates for each parameter.
    Adam is often considered as a default optimizer due to its effectiveness in a wide range of applications. It’s particularly good at solving problems with noisy or sparse gradients.

    Each of these variants has its own strengths and is suited for specific types of problems. Their development reflects the ongoing effort in the machine learning community to refine and enhance optimization algorithms to achieve better and faster results. Understanding these variants and their appropriate applications is crucial for anyone looking to delve deeper into machine learning optimization techniques.

    5.2: Future of SGD

    As we delve into the future of Stochastic Gradient Descent (SGD), it’s clear that this algorithm continues to evolve, reflecting the dynamic and innovative nature of the field of machine learning. The ongoing research and development in SGD focus on enhancing its efficiency, accuracy, and applicability to a broader range of problems. Here are some key areas where we can expect to see significant advancements:

    Automated Hyperparameter Tuning
    There’s increasing interest in automating the process of selecting optimal hyperparameters, including the learning rate, batch size, and other SGD-specific parameters.
    This automation could significantly reduce the time and expertise required to effectively deploy SGD, making it more accessible and efficient.

    Integration with Advanced Models
    As machine learning models become more complex, especially with the growth of deep learning, there’s a need to adapt and optimize SGD for these advanced architectures.
    Enhanced versions of SGD that are tailored for complex models can lead to faster training times and improved model performance.

    Adapting to Non-Convex Problems
    Research is focusing on making SGD more effective for non-convex optimization problems, which are prevalent in real-world applications.
    Improved strategies for dealing with non-convex landscapes could lead to more robust and reliable models in areas like natural language processing and computer vision.

    Decentralized and Distributed SGD
    With the increase in distributed computing and the need for privacy-preserving methods, there’s a push towards decentralized SGD algorithms that can operate over networks.
    This approach can lead to more scalable and privacy-conscious machine learning solutions, particularly important for big data applications.

    Quantum SGD
    The advent of quantum computing presents an opportunity to explore quantum versions of SGD, leveraging quantum algorithms for optimization.
    Quantum SGD has the potential to dramatically speed up the training process for certain types of models, though this is still largely in the research phase.

    SGD in Reinforcement Learning and Beyond
    Adapting and applying SGD in areas like reinforcement learning, where the optimization landscapes are different from traditional supervised learning tasks.
    This could open new avenues in developing more efficient and powerful reinforcement learning algorithms.

    Ethical and Responsible AI
    There’s a growing awareness of the ethical implications of AI models, including those trained using SGD.
    Research into SGD might also focus on ensuring that models are fair, transparent, and responsible, aligning with broader societal values.

    Conclusion

    As we wrap up our exploration of Stochastic Gradient Descent (SGD), it’s clear that this algorithm is much more than just a method for optimizing machine learning models. It stands as a testament to the ingenuity and continuous evolution in the field of artificial intelligence. From its basic form to its more advanced variants, SGD remains a critical tool in the machine learning toolkit, adaptable to a wide array of challenges and applications.

    If you liked the article please leave a clap, and let me know in the comments what you think about it!


    Stochastic Gradient Descent: Math and Python Code was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Stochastic Gradient Descent: Math and Python Code

    Go Here to Read this Fast! Stochastic Gradient Descent: Math and Python Code

  • From Data Scientist to AI Developer: Lessons Building an Generative AI Web App in 2023

    From Data Scientist to AI Developer: Lessons Building an Generative AI Web App in 2023

    Isaac Tham

    From Data Scientist to AI Developer: Lessons Building a Generative AI Web App in 2023

    A guide of technical tips for any data science enthusiast wanting to build an AI web app serving thousands of users

    Source: DALLE-3

    If you, like me, ventured into the world of data science (be it through college or one of the countless online courses), you may have harbored the dream of creating an ML/AI software product that people can use. A product just like those our CS friends seem to code up effortlessly.

    But if you ever tried your hand at full-stack web development, you’d soon face the seemingly insurmountable hurdles of configuration, deployment, terminal commands and servers and so on.

    Just one of the many frustrated convos with my college roommate during the initial stages coding my app. Image by author.

    I know this all too well, having spent countless hours floundering helplessly, which only deepened my inferiority complex that I’d never manage to craft a functioning software app.

    But exactly one year ago, on the 21st of January, a weekend unexpectedly made free by passport-troubles and a cancelled trip, I embarked on a journey to make an AI app. It was a journey that led me to unexpected places — teaming up with a co-founder halfway across the world, joining a San Francisco startup accelerator, and eventually growing to thousands of users with a significant annual revenue (check out my app, Podsmart! we summarize podcasts).

    My app’s demo page on the Buildspace startup accelerator. Source: https://buildspace.so/s3/demoday/Podsmart

    But most importantly, it was a journey full of frustrations, backtracking, mistakes and rework. It was about navigating the bewildering world of software development without a formal CS/SWE background.

    So, looking back at the past year building my first software product, I’ve compiled a guide of some technical tips — this is for any data science enthusiast who wants to build a functional web app serving thousands of users.

    This guide is born from my struggles and learnings over a year, and represents advice I would have loved to tell my one-year-younger self.

    Disclaimer: These tips are from my specific personal experiences, and may work differently for others. I also do not have any relationship or affiliation with any of the tools recommended here.

    Table of Contents

    · What you want to build
    · The dangers of YouTube web development tutorials
    Tip #1: Use Next.js instead of React
    Tip #2: Opt for Tailwind CSS instead of Bootstrap for styling
    · The trappings of the data science mindset
    Tip #3: Choose FastAPI over Flask for your backend, and rigorously define response models
    Tip #4: Use TypeScript instead of JavaScript
    · About deployment…
    Tip #5: Use Modal for GPU backend
    Tip #6: Use AWS Lambda for backend deployment and Vercel for frontend
    · Making life easier
    Tip #7: don’t build your own landing page using React
    Tip #8: Firebase + Stripe for user authentication and payments
    Tip #9: Implement Sentry for error monitoring
    · Conclusion

    What you want to build

    To build said functional web app, you need a web interface (frontend or client) for users to interact with, as well as a server (backend) which does data processing, data storage, and calling the ML/AI models.

    (You might have heard of Streamlit. It’s great for the simplest demos, but it really lacks the customizability to make a viable production app)

    The dangers of YouTube web development tutorials

    As a data scientist, many aspects of software development fill me with trepidation, such as the prospect of wasting days on broken configuration. Nothing is more frustrating than seeing something break and not know why it broke and how to fix it.

    As a result, I relied desperately on walkthrough-style tutorials, especially on YouTube, that depicted the entire process, from start to end, of setting up a React project, deploying a backend or website etc.

    Looking back, there are two main downsides to this:

    Firstly, confusion at multiple conflicting and potentially outdated tutorials (for instance, as newer versions of React come out). This has often led to me following a tutorial until realizing it no longer works.

    Secondly, most tutorials are aimed at building cool classroom demos which are beginner-friendly. Hence, they use frameworks and reinforce coding patterns which have a low performance ceiling, which will be lacking for production and scaling. On hindsight, I’ve picked up many bad coding habits from YouTube tutorials, that are now obstacles to further developing my app as a live product serving thousands of users.

    Since you learn best from failures, this process, though frustrating, was a massive learning experience for me throughout the year. Hopefully you can save lots of time learning from my failures.

    Tip #1: Use Next.js instead of React

    Searching for ‘full stack app tutorial’ on YouTube gives you lots of React tutorials. Source: https://www.youtube.com/results?search_query=full+stack+app+tutorial

    Many YouTube tutorials advocate for React, and I initially followed suit.

    However, eventually I wanted to improve my site’s SEO performance — which is crucial to gaining more users. React’s limitations, such as inability to change meta tags dynamically, and lack of server-side rendering, were frustrating, necessitating a tedious change to Next.js. After switching, the differences in performance were just night-and-day.

    Vercel has lots of Next.js templates for you to jumpstart your web development. Source: https://vercel.com/templates/next.js

    Some people say React is more beginner-friendly, but there are lots of Next.js templates online, for example by Vercel (Next.js creators), especially AI applications. Next.js is really the modern web framework used for nearly every AI application.

    Tip #2: Opt for Tailwind CSS instead of Bootstrap for styling

    Embarking on my front-end UI journey, I initially, and somewhat naively, followed the herd of frontend tutorials, towards Bootstrap. Its allure? The promise of ease with ready-made components like dropdowns and accordions.

    The ‘Bootstrap look’ — how ugly my website looked on Feb 20, 2023. Image by author.

    However, after a while, I realized that my website just looked … really ugly, especially when compared to the sleek, modern AI demo pages out there. There was this unmistakable ‘Bootstrap look’ — a sort of aesthetic stubbornness that resisted customization, entangled in a web of confusingly named CSS classes. So eventually, I once again bit the bullet and redid my entire frontend with Tailwind CSS, taking 3 whole days.

    This AI demo page was definitely not built by Bootstrap. Source: restorephotos.io

    If you’ve ever seen an AI demo page with a modern and clean UI, it’s highly likely they used Tailwind CSS.

    Tailwind CSS and its utility classes make customizing every component extremely easy. Image by author.

    Initially, I was intimidated by Tailwind — its long component definitions brimming with what seemed like cryptic utility classes appeared anything but beginner-friendly… I thought that Tailwind lacked pre-built components and it would be onerous to memorize the utility classes. However, this couldn’t be more untrue! There are many great UI component libraries built on Tailwind CSS — I used Flowbite React (it has all the components I need!)

    The trappings of the data science mindset

    As a data science student, I’ve grown to love Python with its minimalist, powerful code syntax. Python’s type-inference spared me the tedium of defining types for every variable (a task I found cumbersome, especially in languages I encountered in intro CS classes like Java).

    Hence, I used JavaScript for my frontend and Python for my backend, avoiding defining the types of my API endpoints unless necessary.

    However, as my app grew in complexity, tons of unexpected type errors between my frontend and backend eroded my coding productivity. I’m finally understanding my CS friends’ insistence on the importance of explicit types. It turns out, the meticulous detail in type definition isn’t just pedantic — it’s essential.

    Tip #3: Choose FastAPI over Flask for your backend, and rigorously define response models

    If you search for Python backend tutorials on YouTube, most videos would point you to Flask. Just like how a broken clock is right twice a day, I somehow happened to choose FastAPI as my Python backend, which was definitely correct decision on hindsight.

    (Though hilariously, I had totally disregarded the benefit of FastAPI. Until only recently, I didn’t understand the need to define Pydantic classes for POST requests and thought it more of a hassle than a help.)

    FastAPI has several game-changing benefits:

    • automatically-generated API documentation — this will be very useful for future engineers you onboard (or your future self) to understand the backend structure!
    • easier to write code — since FastAPI is built on Json schema, defining routes is much easier and shorter using FastAPI than Flask — resultantly, there’s lower learning curve for newbies like me
    • better performance — FastAPI is apparently much faster than Flask and consumes less memory — which is great as my app sends around large payloads
    Use Pydantic to build data models, which you can use to define response types for your FastAPI routes. Image by author.

    But the most important thing is FastAPI’s type annotations.

    • FastAPI is built on Pydantic, a data validation library allowing you to define the ‘shape’ of data as classes with attributes.
    • With FastAPI, you can annotate the input and output types for each API route, using Python type hints and Pydantic-defined classes.

    This ensures that each route has outputs of a consistent data structure. But to unleash the full power of this feature, we need to…

    Tip #4: Use TypeScript instead of JavaScript

    For the longest time, I’ve manually written my frontend fetcher methods (once again learning from full-stack tutorials), hence adding new routes to my app was a long and error-prone process.

    You can hence imagine my shock when my big-tech SWE friend told me that you can auto-generate Typescript client code using your API specification. (see here for more FastAPI’s documentation, one such package is openapi-typescript-codegen)

    With auto-generated TypeScript client code, your fetcher methods have autocompletion and documentation based on your FastAPI endpoint response models. Image by author.

    In an instant, I realized that this would solve two major challenges simultaneously: removing my manual and error-prone client fetcher writing, and ensuring type consistency between my backend and frontend. This significantly reduced the persistent type errors that were undermining my app’s reliability.

    Of course, having type constraints for your backend routes only helps if your frontend enforces those type constraints — which naturally requires TypeScript.

    Hence, I’m currently undergoing the arduous process of defining response models for my FastAPI backend, and converting my frontend from JavaScript to TypeScript, a process that you can avoid if you start with FastAPI and TypeScript from the start!

    About deployment…

    Through my data science / ML classes, I’ve grown used to hopping onto Google Colab, pressing play, and voila, the code runs. So, it’s no surprise that the very thought of deployment fills me with dread. But as the founder of the Buildspace accelerator puts it, you need to “GTFOL” (Get The F Off Localhost) to make your software apps accessible to the world. Hence, I naturally wanted the deployment to be as painless as possible.

    Tip #5: Use Modal for GPU backend

    If you want to deploy your own models (e.g. ML models, image recognition, Whisper for transcription, or more recently, open-source LLMs like Llama), you will need a GPU cloud provider to host your model.

    My advice is to choose Modal and never look back.

    Modal stands out with its superb documentation and learning resources, complete with up-to-date sample code for the latest applications — from fine-tuning open-source LLMs to serving LLM chatbots, and more.

    I actually started my entire podcast-transcribing app forking Modal’s sample audio-transcription code, and so it isn’t an exaggeration to say that without Modal I wouldn’t have built my app.

    Modal’s dashboard is very user-friendly when monitoring and error tracking. Image by author. Source: modal.com

    Modal shines in its user-friendliness (and coming from someone who loathes deployment, that’s saying a lot). Just write cloud functions on my local code editor, and deploy it using one terminal command. Its dashboard is so user-friendly (especially compared to AWS), allowing me to track my app’s usage, analyze performance, and trace errors very easily.

    Last of all, Modal serves as my escape valve when it comes to functionality that Lambda doesn’t have, or is tedious to implement, e.g. file storage (this will come in useful in the next point…) and scheduling functions.

    Tip #6: Use AWS Lambda for backend deployment and Vercel for frontend

    When hosting my Python backend, I was confused over whether to use Amazon EC2 or AWS Lambda. My app requires the storage of audio files (which could get big), and since Lambda’s serverless architecture isn’t meant to store files (it had 2 GB of ephemeral storage, but it isn’t persistent), I had thought I had to use Amazon EC2. However, EC2 was much more cumbersome to configure, and being an always-on dedicated instance, it would be much more expensive and difficult to scale.

    This is where Modal’s free file storage came into the rescue, and I was able to structure my backend to be compatible with Lambda, while downloading and storing files when needed on Modal.

    Thankfully, this video was really good, and following their instructions exactly enabled me to successfully deploy my backend.

    For my frontend, Vercel was all I needed. The process was hassle-free and, aside from domain name costs, entirely free.

    Making life easier

    The last 3 miscellaneous tips that would save you from wasting massive amounts of time in development…

    Tip #7: don’t build your own landing page using React

    Yet another mistake I did because all those full-stack tutorials fooled me into thinking I had to code my own landing page with React. Sure, you can (and I did), but there’s a low ceiling of performance and aesthetics — precisely the important traits you need for a successful landing page.

    React is only better for custom functionality like the actual AI app interface. For the landing page with purely static content, you should instead, use no-code site builders like Webflow or Framer to rapidly build landing pages (and outsource landing page creation to your designer so you can work on other things!)

    Tip #8: Firebase + Stripe for user authentication and payments

    When it comes to user authentication, the number of options and tutorials out there can once again be overwhelming. I needed a solution that not only handled authentication but also integrated with a payment system to control access based on user subscription status.

    After spending days trying and failing to use several different authentication solutions e.g. auth0, I found that Stripe + Firebase worked well. Firebase has a Stripe integration that updates users’ subscription status upon successful payment, and Firebase’s React client does client-side authentication, and Python client does server access control well. Following these two videos (here and here) enabled me to successfully implement this on my app.

    Tip #9: Implement Sentry for error monitoring

    For months, I had no clue what bugs users encountered with my app in production. Only when myself or a user spots a bug, do I comb through AWS Cloudwatch interface to try to find the backend bug.

    Sentry tracks errors in your apps in production (both frontend and backend). Image by author. Source: sentry.io.

    This continued until my co-founder introduced me to Sentry, a tool for performance monitoring and error tracking of cloud apps. It’s really easy to initialize for your frontend and backend, and you can even integrate it with Slack to get instant error notifications. Just be careful not to deplete your free plan’s monthly error budget on a trivial but frequent error like authentication timeout. That’s what happened to me — and I had to subscribe to the paid plan to find logs for the important bugs I actually wanted to solve.

    Bonus Tip #10: don’t try to build a web app using Spotify’s API! I wasted my app for 2 months assuming I could integrate Spotify’s API to allow users to load their saved podcasts. But to productionize this, you need to apply for a quota extension request, which takes more than a month for Spotify to review. And they’ll probably reject the application anyway if your app involves any AI/ML model (despite my app not actually using Spotify data to train any model, the wording that is prohibited in their Developer Policy).

    Conclusion

    Thank you for reading this technical guide! I hope this guide demystifies some aspects of web app development for fellow data science enthusiasts.

    If you found this post helpful, let me know in the comments what other topics related to data science and AI you’d like me to explore!

    Lastly, I’d love for you to check out my app — Podsmart transcribes and summarizes podcasts and YouTube videos to help you discover knowledge from the audio medium, saving busy intellectuals hours of listening. We’re expanding our team and looking for frontend and backend developers, reach out if you’re interested.


    From Data Scientist to AI Developer: Lessons Building an Generative AI Web App in 2023 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    From Data Scientist to AI Developer: Lessons Building an Generative AI Web App in 2023

    Go Here to Read this Fast! From Data Scientist to AI Developer: Lessons Building an Generative AI Web App in 2023

  • Build enterprise-ready generative AI solutions with Cohere foundation models in Amazon Bedrock and Weaviate vector database on AWS Marketplace

    Build enterprise-ready generative AI solutions with Cohere foundation models in Amazon Bedrock and Weaviate vector database on AWS Marketplace

    James Yi

    This post discusses how enterprises can build accurate, transparent, and secure generative AI applications while keeping full control over proprietary data. The proposed solution is a RAG pipeline using an AI-native technology stack, whose components are designed from the ground up with AI at their core, rather than having AI capabilities added as an afterthought. We demonstrate how to build an end-to-end RAG application using Cohere’s language models through Amazon Bedrock and a Weaviate vector database on AWS Marketplace.

    Originally appeared here:
    Build enterprise-ready generative AI solutions with Cohere foundation models in Amazon Bedrock and Weaviate vector database on AWS Marketplace

    Go Here to Read this Fast! Build enterprise-ready generative AI solutions with Cohere foundation models in Amazon Bedrock and Weaviate vector database on AWS Marketplace