Category: Artificial Intelligence

  • Using PCA for Outlier Detection

    Using PCA for Outlier Detection

    W Brett Kennedy

    A surprisingly effective means to identify outliers in numeric data

    PCA (principle component analysis) is commonly used in data science, generally for dimensionality reduction (and often for visualization), but it is actually also very useful for outlier detection, which I’ll describe in this article.

    This articles continues my series in outlier detection, which also includes articles on FPOF, Counts Outlier Detector, Distance Metric Learning, Shared Nearest Neighbors, and Doping. This also includes another excerpt from my book Outlier Detection in Python.

    The idea behind PCA is that most datasets have much more variance in some columns than others, and also have correlations between the features. An implication of this is: to represent the data, it’s often not necessary to use as many features as we have; we can often approximate the data quite well using fewer features — sometimes far fewer. For example, with a table of numeric data with, say, 100 features, we may be able to represent the data reasonably well using perhaps 30 or 40 features, possibly less, and possibly much less.

    To allow for this, PCA transforms the data into a different coordinate system, where the dimensions are known as components.

    Given the issues we often face with outlier detection due to the curse of dimensionality, working with fewer features can be very beneficial. As described in Shared Nearest Neighbors and Distance Metric Learning for Outlier Detection, working with many features can make outlier detection unreliable; among the issues with high-dimensional data is that it leads to unreliable distance calculations between points (which many outlier detectors rely on). PCA can mitigate these effects.

    As well, and surprisingly, using PCA can often create a situation where outliers are actually easier to detect. The PCA transformations often reshape the data so that any unusual points are are more easily identified.

    An example is shown here.

    import numpy as np
    import pandas as pd
    from sklearn.decomposition import PCA

    # Create two arrays of 100 random values, with high correlation between them
    x_data = np.random.random(100)
    y_data = np.random.random(100) / 10.0

    # Create a dataframe with this data plus two additional points
    data = pd.DataFrame({'A': x_data, 'B': x_data + y_data})
    data= pd.concat([data,
    pd.DataFrame([[1.8, 1.8], [0.5, 0.1]], columns=['A', 'B'])])

    # Use PCA to transform the data to another 2D space
    pca = PCA(n_components=2)
    pca.fit(data)
    print(pca.explained_variance_ratio_)

    # Create a dataframe with the PCA-transformed data
    new_data = pd.DataFrame(pca.transform(data), columns=['0', '1'])

    This first creates the original data, as shown in the left pane. It then transforms it using PCA. Once this is done, we have the data in the new space, shown in the right pane.

    Here I created a simple synthetic dataset, with the data highly correlated. There are two outliers, one following the general pattern, but extreme (Point A) and one with typical values in each dimension, but not following the general pattern (Point B).

    We then use scikit-learn’s PCA class to transform the data. The output of this is placed in another pandas dataframe, which can then be plotted (as shown), or examined for outliers.

    Looking at the original data, the data tends to appear along a diagonal. Drawing a line from the bottom-left to the top-right (the blue line in the plot), we can create a new, single dimension that represents the data very well. In fact, executing PCA, this will be the first component, with the line orthogonal to this (the orange line, also shown in the left pane) as the second component, which represents the remaining variance.

    With more realistic data, we will not have such strong linear relationships, but we do almost always have some associations between the features — it’s rare for the features to be completely independent. And given this, PCA can usually be an effective way to reduce the dimensionality of a dataset. That is, while it’s usually necessary to use all components to completely describe each item, using only a fraction of the components can often describe every record (or almost every record) sufficiently well.

    The right pane shows the data in the new space created by the PCA transformation, with the first component (which captures most of the variance) on the x-axis and the second (which captures the remaining variance) on the y-axis. In the case of 2D data, a PCA transformation will simply rotate and stretch the data. The transformation is harder to visualize in higher dimensions, but works similarly.

    Printing the explained variance (the code above included a print statement to display this) indicates component 0 contains 0.99 of the variance and component 1 contains 0.01, which matches the plot well.

    Often the components would be examined one at a time (for example, as histograms), but in this example, we use a scatter plot, which saves space as we can view two components at a time. The outliers stand out as extreme values in the two components.

    Looking a little closer at the details of how PCA works, it first finds the line through the data that best describes the data. This is the line where the squared distances to the line, for all points, is minimized. This is, then, the first component. The process then finds a line orthogonal to this that best captures the remaining variance. This dataset contains only two dimensions, and so there is only one choice for direction of the second component, at right angles with the first component.

    Where there are more dimensions in the original data, this process will continue some number of extra steps: the process continues until all variance in the data is captured, which will create as many components as the original data had dimensions. Given this, PCA has three properties:

    • All components are uncorrelated.
    • The first component has the most variation, then the second, and so on.
    • The total variance of the components equals the variance in the original features.

    PCA also has some nice properties that lend themselves well to outlier detection. As we can see in the figure, the outliers become separated well within the components, which allows simple tests to identify them.

    We can also see another interesting result of PCA transformation: points that are in keeping with the general pattern tend to fall along the early components, but can be extreme in these (such as Point A), while points that do not follow the general patterns of the data tend to not fall along the main components, and will be extreme values in the later components (such as Point B).

    There are two common ways to identify outliers using PCA:

    • We can transform the data using PCA and then use a set of tests (conveniently, these can generally be very simple tests), on each component to score each row. This is quite straightforward to code.
    • We can look at the reconstruction error. In the figure, we can see that using only the first component describes the majority of the data quite well. The second component is necessary to fully describe all the data, but by simply projecting the data onto the first component, we can describe reasonably well where most data is located. The exception is point B; its position on the first component does not describe its full location well and there would be a large reconstruction error using only a single component for this point, though not for the other points. In general, the more components necessary to describe a point’s location well (or the higher the error given a fixed number of components), the stronger of an outlier a point is.

    Another method is possible where we remove rows one at a time and identify which rows affect the final PCA calculations the most significantly. Although this can work well, it is often slow and not commonly used. I may cover this in future articles, but for this article will look at reconstruction error, and in the next article at running simple tests on the PCA components.

    Assumptions behind PCA for outlier detection

    PCA does assume there are correlations between the features. The data above is possible to transform such that the first component captures much more variance than the second because the data is correlated. PCA provides little value for outlier detection where the features have no associations, but, given most datasets have significant correlation, it is very often applicable. And given this, we can usually find a reasonably small number of components that capture the bulk of the variance in a dataset.

    As with some other common techniques for outlier detection, including Elliptic Envelope methods, Gaussian mixture models, and Mahalanobis distance calculations, PCA works by creating a covariance matrix representing the general shape of the data, which is then used to transform the space. In fact, there is a strong correspondence between elliptic envelope methods, the Mahalanobis distance, and PCA.

    The covariance matrix is a d x d matrix (where d is the number of features, or dimensions, in the data), that stores the covariance between each pair of features, with the variance of each feature stored on the main diagonal (that is, the covariance of each feature to itself). The covariance matrix, along with the data center, is a concise description of the data — that is, the variance of each feature and the covariances between the features are very often a very good description of the data.

    A covariance matrix for a dataset with three features may look like:

    Example covariance matrix for a dataset with three features

    Here the variance of the three features are shown on the main diagonal: 1.57, 2.33, and 6.98. We also have the covariance between each feature. For example, the covariance between the 1st & 2nd features is 1.50. The matrix is symmetrical across the main diagonal, as the covariance between the 1st and 2nd features is the same as between the 2nd & 1st features, and so on.

    Scikit-learn (and other packages) provide tools that can calculate the covariance matrix for any given numeric dataset, but this is unnecessary to do directly using the techniques described in this and the next article. In this article, we look at tools provided by a popular package for outlier detection called PyOD (probably the most complete and well-used tool for outlier detection on tabular data available in Python today). These tools handle the PCA transformations, as well as the outlier detection, for us.

    Limitations of PCA for outlier detection

    One limitation of PCA is, it is sensitive to outliers. It’s based on minimizing squared distances of the points to the components, so it can be heavily affected by outliers (remote points can have very large squared distances). To address this, robust PCA is often used, where the extreme values in each dimension are removed before performing the transformation. The example below includes this.

    Another limitation of PCA (as well as Mahalanobis distances and similar methods), is it can break down if the correlations are in only certain regions of the data, which is frequently true if the data is clustered. Where data is well-clustered, it may be necessary to cluster (or segment) the data first, and then perform PCA on each subset of the data.

    PCA-based tests for outliers in PyOD

    Now that we’ve gone over how PCA works and, at a high level, how it can be applied to outlier detection, we can look at the detectors provided by PyOD.

    PyOD actually provides three classes based on PCA: PyODKernelPCA, PCA, and KPCA. We’ll look at each of these.

    PyODKernelPCA

    PyOD provides a class called PyODKernelPCA, which is simply a wrapper around scikit-learn’s KernelPCA class. Either may be more convenient in different circumstances. This is not an outlier detector in itself and provides only PCA transformation (and inverse transformation), similar to scikit-learn’s PCA class, which was used in the previous example.

    The KernelPCA class, though, is different than the PCA class, in that KernelPCA allows for nonlinear transformations of the data and can better model some more complex relationships. Kernels work similarly in this context as with SVM models: they transform the space (in a very efficient manner) in a way that allows outliers to be separated more easily.

    Scikit-learn provides several kernels. These are beyond the scope of this article, but can improve the PCA process where there are complex, nonlinear relationships between the features. If used, outlier detection works, otherwise, the same as with using the PCA class. That is, we can either directly run outlier detection tests on the transformed space, or measure the reconstruction error.

    The former method, running tests on the transformed space is quite straightforward and effective. We look at this in more detail in the next article. The latter method, checking for reconstruction error, is a bit more difficult. It’s not unmanageable at all, but the two detectors provided by PyOD we look at next handle the heavy lifting for us.

    The PCA detector

    PyOD provides two PCA-based outlier detectors: the PCA class and KPCA. The latter, as with PyODKernelPCA, allows kernels to handle more complex data. PyOD recommends using the PCA class where the data contains linear relationships, and KPCA otherwise.

    Both classes use the reconstruction error of the data, using the Euclidean distance of points to the hyperplane that’s created using the first k components. The idea, again, is that the first k components capture the main patterns of the data well, and any points not well modeled by these are outliers.

    In the plot above, this would not capture Point A, but would capture Point B. If we set k to 1, we’d use only one component (the first component), and would measure the distance of every point from its actual location to its location on this component. Point B would have a large distance, and so can be flagged as an outlier.

    As with PCA generally, it’s best to remove any obvious outliers before fitting the data. In the example below, we use another detector provided by PyOD called ECOD (Empirical Cumulative Distribution Functions) for this purpose. ECOD is a detector you may not be familiar with, but is a quite strong tool. In fact PyOD recommends, when looking at detectors for a project, to start with Isolation Forest and ECOD.

    ECOD is beyond the scope of this article. It’s covered in Outlier Detection in Python, and PyOD also provides a link to the original journal paper. But, as a quick sketch: ECOD is based on empirical cumulative distributions, and is designed to find the extreme (very small and very large) values in columns of numeric values. It does not check for rare combinations of values, only extreme values. As such, it is not able to find all outliers, but it is quite fast, and quite capable of finding outliers of this type. In this case, we remove the top 1% of rows identified by ECOD before fitting a PCA detector.

    In general when performing outlier detection (not just when using PCA), it’s useful to first clean the data, which in the context of outlier detection often refers to removing any strong outliers. This allows the outlier detector to be fit on more typical data, which allows it to better capture the strong patterns in the data (so that it is then better able to identify exceptions to these strong patterns). In this case, cleaning the data allows the PCA calculations to be performed on more typical data, so as to capture better the main distribution of the data.

    Before executing, it’s necessary to install PyOD, which may be done with:

    pip install pyod

    The code here uses the speech dataset (Public license) from OpenML, which has 400 numeric features. Any numeric dataset, though, may be used (any categorical columns will need to be encoded). As well, generally, any numeric features will need to be scaled, to be on the same scale as each other (skipped for brevity here, as all features here use the same encoding).

    import pandas as pd
    from pyod.models.pca import PCA
    from pyod.models.ecod import ECOD
    from sklearn.datasets import fetch_openml

    #A Collects the data
    data = fetch_openml("speech", version=1, parser='auto')
    df = pd.DataFrame(data.data, columns=data.feature_names)
    scores_df = df.copy()

    # Creates an ECOD detector to clean the data
    clf = ECOD(contamination=0.01)
    clf.fit(df)
    scores_df['ECOD Scores'] = clf.predict(df)

    # Creates a clean version of the data, removing the top
    # outliers found by ECOD
    clean_df = df[scores_df['ECOD Scores'] == 0]

    # Fits a PCA detector to the clean data
    clf = PCA(contamination=0.02)
    clf.fit(clean_df)

    # Predicts on the full data
    pred = clf.predict(df)

    Running this, the pred variable will contain the outlier score for each record in the the data.

    The KPCA detector

    The KPCA detector works very much the same as the PCA detector, with the exception that a specified kernel is applied to the data. This can transform the data quite significantly. The two detectors can flag very different records, and, as both have low interpretability, it can be difficult to determine why. As is common with outlier detection, it may take some experimentation to determine which detector and parameters work best for your data. As both are strong detectors, it may also be useful to use both. Likely this can best be determined (along with the best parameters to use) using doping, as described in Doping: A Technique to Test Outlier Detectors.

    To create a KPCA detector using a linear kernel, we use code such as:

    det = KPCA(kernel='linear')

    KPCA also supports polynomial, radial basis function, sigmoidal, and cosine kernels.

    Conclusions

    In this article we went over the ideas behind PCA and how it can aid outlier detection, particularly looking at standard outlier detection tests on PCA-transformed data and at reconstruction error. We also looked at two outlier detectors provided by PyOD for outlier detection based on PCA (both using reconstruction error), PCA and KPCA, and provided an example using the former.

    PCA-based outlier detection can be very effective, but does suffer from low interpretability. The PCA and KPCA detectors produce outliers that are very difficult to understand.

    In fact, even when using interpretable outlier detectors (such as Counts Outlier Detector, or tests based on z-score or interquartile range), on the PCA-transformed data (as we’ll look at in the next article), the outliers can be difficult to understand since the PCA transformation itself (and the components it generates) are nearly inscrutable. Unfortunately, this is a common theme in outlier detection. The other main tools used in outlier detection, including Isolation Forest, Local Outlier Factor (LOF), k Nearest Neighbors (KNN), and most others are also essentially black boxes (their algorithms are easily understandable — but the specific scores given to individual records can be difficult to understand).

    In the 2d example above, when viewing the PCA-transformed space, it can be easy to see how Point A and Point B are outliers, but it is difficult to understand the two components that are the axes.

    Where interpretability is necessary, it may be impossible to use PCA-based methods. Where this is not necessary, though, PCA-based methods can be extremely effective. And again, PCA has no lower interpretability than most outlier detectors; unfortunately, only a handful of outlier detectors provide a high level of interpretability.

    In the next article, we will look further at performing tests on the PCA-transformed space. This includes simple univariate tests, as well as other standard outlier detectors, considering the time required (for PCA transformation, model fitting, and prediction), and the accuracy. Using PCA can very often improve outlier detection in terms of speed, memory usage, and accuracy.

    All images are by the author


    Using PCA for Outlier Detection was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Using PCA for Outlier Detection

    Go Here to Read this Fast! Using PCA for Outlier Detection

  • Product-Oriented ML: A Guide for Data Scientists

    Product-Oriented ML: A Guide for Data Scientists

    Jake Minns

    How to build ML products users love.

    Photo by Pavel Danilyuk: https://www.pexels.com/photo/a-robot-holding-a-flower-8438979/

    Data science offers rich opportunities to explore new concepts and demonstrate their viability, all towards building the ‘intelligence’ behind features and products. However, most machine learning (ML) projects fail! And this isn’t just because of the inherently experimental nature of the work. Projects may lack purpose or grounding in real-world problems, while integration of ML into products requires a commitment to long-term problem-solving, investment in data infrastructure, and the involvement of multiple technical experts. This post is about mitigating these risks at the planning stage, fail here, fast, while developing into a product-oriented data scientist.

    This article provides a structured approach to planning ML products, by walking through the key areas of a product design document. We’ll cover clarifying requirements, understanding data constraints and defining what success looks like, all of which dictates your approach to building successful ML products. These documents should be flexible, use them to figure out what works for your team.

    I’ve been fortunate to work in startups, part of small scrappy teams where roles and ownership become blended. I mention this because the topics covered below crossover traditional boundaries, into project management, product, UI/UX, marketing and more. I’ve found that people who can cross these boundaries and approach collaboration with empathy make great products and better colleagues.

    To illustrate the process, we will work through a feature request, set out by a hypothetical courier company:

    “As a courier company, we’d like to improve our ability to provide users with advanced warning if their package delivery is expected to be delayed.”

    Problem Definition

    This section is about writing a concise description of the problem and the project’s motivation. As development spans months or years, not only does this start everyone on the same page, but unique to ML, it serves to anchor you as challenges arise and experiments fail. Start with a project kickoff. Encourage open collaboration and aim to surface the assumptions present in all cross-functional teams, ensuring alignment on product strategy and vision from day one.

    Actually writing the statement starts with reiterating the problem in your own words. For me, making this long form and then whittling it down makes it easier to narrow down on the specifics. In our example, we are starting with a feature request. It provides some direction but leaves room for ambiguity around specific requirements. For instance, “improve our ability” suggests an existing system — do we have access to an existing dataset? “Advanced warning” is vague on information but tells us customers will be actively prompted in the event of a delayed delivery. These all have implications for how we build the system, and provides an opportunity to assess the feasibility of the project.

    We also need to understand the motivation behind the project. While we can assume the new feature will provide a better user experience, what’s the business opportunity? When defining the problem, always tie it back to the larger business strategy. For example, improving delivery delay notifications isn’t just about building a better product — it’s about reducing customer churn and increasing satisfaction, which can boost brand loyalty and lower support costs. This is your real measure of success for the project.

    Working within a team to unpack a problem is a skill all engineers should develop — not only is it commonly tested as part of an interview processes, but, as discussed, it sets expectations for a project and strategy that everyone, top-down can buy into. A lack of alignment from the start can be disastrous for a project, even years later. Unfortunately, this was the fate of a health chatbot developed by Babylon. Babylon set out with the ambitious goal of revolutionising healthcare by using AI to deliver accurate diagnostics. To its detriment, the company oversimplified the complexity of healthcare, especially across different regions and patient populations. For example, symptoms like fever might indicate a minor cold in the UK, but could signal something far more serious in Southeast Asia. This lack of clarity and overpromising on AI capabilities led to a major mismatch between what the system could actually do and what was needed in real-world healthcare environments (https://sifted.eu/articles/the-rise-and-fall-of-babylon).

    Requirements and Constraints

    With your problem defined and why it matters, we can now document the requirements for delivering the project and set the scope. These typically fall into two categories:

    1. Functional requirements, which define what the system should do from the user’s perspective. These are directly tied to the features and interactions the user expects.
    2. Non-functional requirements, which address how the system operates — performance, security, scalability, and usability.

    If you’ve worked with agile frameworks, you’ll be familiar with user stories — short, simple descriptions of a feature told from the user’s perspective. I’ve found defining these as a team is a great way to align, this starts with documenting functional requirements from a user perspective. Then, map them across the user journey, and identify key moments your ML model will add value. This approach helps establish clear boundaries early on, reducing the likelihood of “scope creep”. If your project doesn’t have traditional end-users, perhaps you’re replacing an existing process? Talk to people with boots on the ground — be that operational staff or process engineers, they are your domain experts.

    From a simple set of stories we can build actionable model requirements:

    What information is being sent to users?

    As a customer awaiting delivery, I want to receive clear and timely notifications about whether my package is delayed or on time, so that I can plan my day accordingly.

    How will users be sent the warnings?

    As a customer awaiting delivery, I want to receive notifications via my preferred communication channel (SMS or native app) about the delay of my package, so that I can take action without constantly checking the app.

    What user-specific data can the system use?

    As a customer concerned about privacy, I only want essential information like my address to be used to predict whether my package is delayed.

    Done right, these requirements should constrain your decisions regarding data, models and training evaluation. If you find conflicts, balance them based on user impact and feasibility. Let’s unpack the user stories above to find how our ML strategy will be constrained:

    What information is being sent to users?

    • The model can remain simple (binary classification) if only a delay notification is needed; more detailed outputs require more complex model and additional data.

    How will users be sent the warnings?

    • Real-time warnings necessitate low-latency systems, this creates constraints around model and preprocessing complexity.

    What user-specific data can the system use?

    • If we can only use limited user-specific information, our model accuracy might suffer. Alternatively, using more detailed user-specific data requires consent from users and increased complexity around how data is stored in order to adhere to data privacy best practices and regulations.

    Thinking about users prompts us to embed ethics and privacy into our design while building products people trust. Does our training data result in outputs that contain bias, discriminating against certain user groups? For instance, low-income areas may have worse infrastructure affecting delivery times — is this represented fairly in the data? We need to ensure the model does not perpetuate or amplify existing biases. Unfortunately, there are a litany of such cases, take the ML based recidivism tool COMPAS, used across the US that was shown to overestimated the recidivism risk for Black defendants while underestimating it for white defendants (https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm).

    In addition to ethics, we also need to consider other non-functional requirements such as performance and explainability:

    • Transparency and Explainability: How much of a “black-box” do we present the model as? What are the implications of a wrong prediction or bug? These aren’t easy questions to answer. Showing more information about how a model arrives at its decisions requires robust models and the use of explainable models like decision trees. Techniques like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) can help explain how different features contribute to a prediction, at the risk of overwhelming users. For our example would telling users why a package is delayed build trust? Generally model explainability increases buy in from internal stakeholders.
    • Real-time or Batch Processing: Real-time predictions require low-latency infrastructure and streaming data pipelines. Batch predictions can be processed at regular intervals, which might be sufficient for less time-sensitive needs. Choosing between real-time or batch predictions affects the complexity of the solution and influences which models are feasible to deploy. For instance, simpler models or optimisation techniques reduce latency. More on this later.

    A tip borrowed from marketing is the creation of user personas. This typically builds on market research collected through formal interviews and surveys to understand the needs, behaviours, and motivations of users. It’s then segmented based on common characteristics like demographics, goals and challenges. From this we can develop detailed profiles for each segment, giving them names and backstories. Durning planning, personas helps us empathise with how model predictions will be received and the actions they elicit in various contexts.

    Take Sarah, a “Busy Parent” persona. She prioritises speed and simplicity. Hence, she values timely, concise notifications about package delays. This means our model should focus on quick, binary predictions (delayed or on-time) rather than detailed outputs. Finally, since Sarah prefers real-time notifications via her mobile, the model needs to integrate seamlessly with low-latency systems to deliver instant updates.

    By documenting functional and non-functional requirements, we define “What” we are building to meet the needs of users combine with “Why” this aligns with business objectives.

    Modelling Approach

    It’s now time to think about “How” we meet our requirements. This starts with framing the problem in ML terms by documenting the type of inputs (features), outputs (predictions) and a strategy for learning the relationship between them. At least something to get us started, we know it’s going to be experimental.

    For our example, the input features could include traffic data, weather reports or package details while a binary prediction is required: “delayed” or “on-time”. It’s clear that our problem requires a binary classification model. For us this was simple, but for other product contexts a range of approaches exist:

    Supervised Learning Models: Requires a labeled dataset to train.

    • Classification Models: Binary classification is simple to implement and interpret for stakeholders, making it ideal for a MVP. This comes at the cost of more nuanced insights provided by multi-class classification, like a reason for delay in our case. However, this often requires more data, meaning higher costs and development time.
    • Regression Models: If the target is a continuous value, like the exact time a package will be delayed (e.g., “Your package will be delayed by 20 minutes”), a regression model would be the appropriate choice. These outputs are also subject to more uncertainty.

    Unsupervised Learning Models: Works with unlabelled data.

    • Clustering Models: In the context of delivery delays, clustering could be used during the exploratory phase to group deliveries based on similar characteristics, such as region or recurring traffic issues. Discovering patterns can inform product improvements or guide user segmentation for personalising features/notifications.
    • Dimensionality Reduction: For noisy datasets with a large feature space dimensional reduction techniques like Principal Component Analysis (PCA) or autoencoders can be used to reduce computational costs and overfitting by allowing for smaller models at the cost of some loss in feature context.

    Generative Models: Generates new data by training on either labelled and unlabelled data.

    • Generative Adversarial Networks (GANs): For us, GANs could be used sparingly to simulate rare but impactful delivery delay scenarios, such as extreme weather conditions or unforeseen traffic events, if a tolerance to edge cases is required. However, these are notoriously difficult to train with high computational costs and car must be taken that generated data is realistic. This isn’t typically appropriate for early-stage products.
    • Variational Autoencoders (VAEs): VAEs have a similar use case to GANs, with the added benefit of more control over the range of outputs generated.
    • Large Language Models (LLMs): If we wanted to incorporate text-based data like customer feedback or driver notes into our predictions, LLMs could help generate summaries or insights. But, real-time processing is a challenge with heavy computational loads.

    Reinforcement Learning Models: These models learn by interacting with an environment, receiving feedback through rewards or penalties. For a delivery company, reinforcement learning could be used to optimise the system based on the real outcome of the delivery. Again, this isn’t really appropriate for an MVP.

    It’s normal for the initial framing of a problem to evolve as we gain insights from data exploration and early model training. Therefore, start with a simple, interpretable model, to test feasibility. Then, incrementally increase complexity by adding more features, tuning hyperparameters, and then explore more complex models like ensemble methods or deep learning architectures. This keeps costs and development time low while making for a quick go to market.

    ML differs significantly from traditional software development when it comes to estimating development time, with a large chunk of the work being made up of experiments. Where the outcome is always unknown, and so is the number required. This means any estimate you provide should have a large contingency baked in, or the expectation that it’s subject to change. If the product feature isn’t critical we can afford to give tighter time estimates by starting with simple models while planning for incremental improvements later.

    The time taken to develop your model is a significant cost to any project. In my experience, getting results from even a simple model fast, will be massively beneficial downstream, allowing you to handover to frontend developers and ops teams. To help, I have a few tips. First, fail fast and prioritise experiments by least effort, and maximum likelihood of success. Then adjust your plan on the go based on what you learn. Although obvious, people do struggle to embrace failure. So, be supportive of your team, it’s part of the process. My second tip is, do your research! Find examples of similar problems and how they were solved, or not. Despite the recent boom in popularity of ML, the field has been around for a long time, and 9 times out of 10 someone has solved a problem at least marginally related to yours. Keep up with the literature, use sites like Papers with Code, daily papers from Hugging Face or AlphaSignal, which provides a nice email newsletter. For databases try, Google Scholar, Web of Science or ResearchGate. Frustratingly, the cost of accessing major journals is a significant barrier to a comprehensive literature review. Sci-Hub…

    Data Requirements

    Now that we know what our “black box” will do, what shall we put in it? It’s time for data, and from my experience this is the most critical part of the design with respect to mitigating risk. The goal is to create an early roadmap for sourcing sufficient, relevant, high-quality data. This covers training data, potential internal or external sources, and evaluating data relevance, quality, completeness, and coverage. Address privacy concerns and plan for data collection, storage, and preprocessing, while considering strategies for limitations like class imbalances.

    Without proper accounting for the data requirements of a project, you risk exploding budgets and never fully delivering, take Tesla AutoPilot as one such example. Their challenge with data collection highlights the risks of underestimating real-world data needs. From the start, the system was limited by the data captured from early adopters vehicles, which to date, has lacked the sensor depth required for true autonomy (https://spectrum.ieee.org/tesla-autopilot-data-deluge).

    Data sourcing is made significantly easier if the feature you’re developing is already part of a manual process. If so, you’ll likely have existing datasets and a performance benchmark. If not, look internally. Most organisations capture vast amounts of data, this could be system logs, CRM data or user analytics. Remember though, garbage in, garbage out! Datasets not built for ML from the beginning often lack the quality required for training. They might not be rich enough, or fully representative of the task at hand.

    If unsuccessful, you’ll have to look externally. Start with high-quality public repositories specifically designed for ML, such as Kaggle, UCI ML Repository and Google Dataset Search.

    If problem-specific data isn’t available, try more general publicly available datasets. Look through data leaks like the Enron email dataset (for text analysis and natural language processing), government census data (for population-based studies), or commercially released datasets like the IMDb movie review dataset (for sentiment analysis). If that fails, you can start to aggregate from multiple sources to create an enriched dataset. This might involve pulling data from spreadsheets, APIs, or even scraping the web. The challenge for both cases is to ensure your data is clean, consistent, and appropriately formatted for ML purposes.

    Worst case, you’re starting from scratch and need to collect your own raw data. This will be expensive and time-consuming, especially when dealing with unstructured data like video, images, or text. For some cases data collection can automated by conducting surveys, setting up sensors or IoT devices or even launching crowd sourced labelling challenges.

    Regardless, manual labelling is almost always necessary. There are many highly recommended, off the shelf solutions here, including LabelBox, Amazon SageMaker Ground Truth and Label Studio. Each of these can speed up labelling and help maintain quality, even across large datasets with random sampling.

    If it’s not clear already, as you move from internal sources to manual collection; the cost and complexity of building a dataset appropriate for ML grows significantly, and so does the risk for your project. While this isn’t a project-killer, it’s important to take into account what your timelines and budgets allow. If you can only collect a small dataset you’ll likely be restricted to smaller model solutions, or the fine tuning of foundation models from platforms like Hugging Face and Ollama. In addition, ensure you have a costed contingency for obtaining more data later in the project. This is important because understanding how much data is required for your project can only be answered by solving the ML problem. Therefore, mitigate the risk upfront by ensuring you have a route to gathering more. It’s common to see back-of-the-napkin calculations quoted as a reasonable estimate for how much data is required. But, this really only applies to very well understood problems like image classification and classical ML problems.

    If it becomes clear you won’t be able to gather enough data, there has been some limited success with generative models for producing synthetic training data. Fraud detection systems developed by American Express have used this technique to simulate card numbers and transactions in order to detect discrepancies or similarities with actual fraud (https://masterofcode.com/blog/generative-ai-for-fraud-detection).

    Once a basic dataset has been established you’ll need to understand the quality. I have found manually working the problem to be very effective. Providing insight into useful features and future challenges, while setting realistic expectations for model performance. All while uncovering data quality issues and gaps in coverage early on. Get hands on with the data and build up domain knowledge while taking note of the following:

    • Data relevance: Ensure the available data reflects your attempts to solve the problem. For our example, traffic reports and delivery distances are useful, but customer purchase history may be irrelevant. Identifying the relevance of data helps reduce noise, while allowing smaller data sets and models to be more effective.
    • Data quality: Pay attention to any biases, missing data, or anomalies that you find, this will be useful when building data preprocessing pipelines later on.
    • Data completeness and coverage: Check the data sufficiently covers all relevant scenarios. For our example, data might be required for both city centres and more rural areas, failing to account for this impacts the model’s ability to generalise.
    • Class imbalance: Understand the distribution of classes or the target variable so that you can collect more data if possible. Hopefully for our case, “delayed” packages will be a rare event. While training we can implement cost-sensitive learning to counter this. Personally, I have always had more success oversampling minority classes with techniques like SMOTE (Synthetic Minority Over-sampling Technique) or Adaptive Synthetic (ADASYN) sampling.
    • Timeliness of data: Consider how up-to-date the data needs to be for accurate predictions. For instance, it might be that real-time traffic data is required for the most accurate predictions.

    When it comes to a more comprehensive look at quality, Exploratory Data Analysis (EDA) is the way to uncover patterns, spot anomalies, and better understand data distributions. I will cover EDA in more detail in a separate post, but visualising data trends, using correlation matrices, and understanding outliers can reveal potential feature importance or challenges.

    Finally, think beyond just solving the immediate problem — consider the long-term value of the data. Can it be reused for future projects or scaled for other models? For example, traffic and delivery data could eventually help optimise delivery routes across the whole logistics chain, improving efficiency and cutting costs in the long run.

    Success Metrics — Finding Good Enough

    When training models, quick performance gains are often followed by a phase of diminishing returns. This can lead to directionless trial-and-error while killing morale. The solution? Define “good enough” training metrics from the start, such that you meet the minimum threshold to deliver the business goals for the project.

    Setting acceptable thresholds for these metrics requires a broad understanding of the product and soft skills to communicate the gap between technical and business perspectives. Within agile methodologies, we call these acceptance criteria. Doing so allows us to ship quick to the minimum spec and then iterate.

    What are business metrics? Business metrics are the real measure of success for any project. These could be reducing customer support costs or increasing user engagement, and are measured once the product is live, hence referred to as online metrics. For our example, 80% accuracy might be acceptable if it reduces customer service costs by 15%. In practice, you should track a single model with a single business metric, this keeps the project focused and avoids ambiguity about when you have successfully delivered. You’ll also want to establish how you track this metrics, look for internal dashboards and analytics that business teams should have available, if they’re not, maybe it’s not a driver for the business.

    Balancing business and technical metrics: Finding a “good enough” performance starts with understanding the distribution of events in the real world, and then relating this to how it impacts users (and hence the business). Take our courier example, we expect delayed packages to be a rare event, and so for our binary classifier there is a class imbalance. This makes accuracy alone inappropriate and we need to factor in how our users respond to predictions:

    • False positives (predicting a delay when there isn’t one) could generate annoying notifications for customers, but when a package subsequently arrives on time, the inconvenience is minor. Avoiding false positives means prioritising high precision.
    • False negatives (failing to predict a delay) are likely to cause much higher frustration when customers don’t receive a package without warning, reducing the chance of repeat business and increasing customer support costs. Avoiding false negatives means prioritising high recall.

    For our example, it’s likely the business values high recall models. Still, for models less than 100% accurate, a balance between precision and recall is still necessary (we can’t notify every customer their package is delayed). This trade off is best illustrated with an ROC curve. For all classification problems, we measure a balance of precision and recall with the F1 score, and for imbalanced classes we can extend this to a weighted F1 score.

    Balancing precision and recall is a fine art, and can lead to unintended consequences for your users. To illustrate this point consider a services like Google calendar that offers both company and personal user accounts. In order to reduce the burden on businesses that frequently receive fake meeting requests, engineers might prioritise high precision spam filtering. This ensures most fake meetings are correctly flagged as spam, at the cost of lower recall, where some legitimate meetings will be mislabeled as spam. However, for personal accounts, receiving fake meeting requests is far less common. Over the lifetime of the account, the risk of a legitimate meeting being flagged becomes significant due to the trade-off of a lower recall model. Here, the negative impact on the user’s perception of the service is significant.

    If we consider our courier example as a regression tasks, with the aim of predicting a delay time, metrics like MAE and MSE are the choices, with slightly different implications for your product:

    1. Mean Absolute Error (MAE): This is fairly intuitive measure of how close the average prediction is to the actual value. Therefore a simple indicator for the accuracy of delay estimates sent to users.
    2. Mean Squared Error (MSE): This penalises larger errors more heavily due to the squaring of differences, and therefore important if significant errors in delay predictions are deemed more costly to user satisfaction. However, this does mean the metric is more sensitive to outliers.

    As stated above, this is about translating model metrics into terms everyone can understand and communicating trade-offs. This is a collaborative process, as team members closer to the users and product will have a better understanding of the business metrics to drive. Find the single model metric that points the project in the same direction.

    One final point, I have seen a tendency for projects involving ML to overpromise on what can be delivered. Generally this comes from the top of an organisation, where hype is generated for a product or amongst investors. This is detrimental to a project and your sanity. Your best chance to counter this is by communicating in your design realistic expectations that match the complexity of the problem. It’s always better to underpromise and overdeliver.

    High Level System Design

    At this point, we’ve covered data, models, and metrics, and addresses how we will approach our functional requirements. Now, it’s time to focus on non-functional requirements, specifically scalability, performance, security, and deployment strategies. For ML systems, this involves documenting the system architecture with system-context or data-flow diagrams. These diagrams represent key components as blocks, with defined inputs, transformations, and outputs. Illustrating how different parts of the system interact, including data ingestion, processing pipelines, model serving, and user interfaces. This approach ensures a modular system, allowing teams to isolate and address issues without affecting the entire pipeline, as data volume or user demand grows. Therefore, minimising risks related to bottlenecks or escalating costs.

    Once our models are trained we need a plan for deploying the model into production, allowing it to be accessible to users or downstream systems. A common method is to expose your model through a REST API that other services or front-end can request. For real-time applications, serverless platforms like AWS Lambda or Google Cloud Functions are ideal for low latency (just manage your cold starts). If high-throughput is a requirement then use batch processing with scalable data pipelines like AWS Batch or Apache Spark. We can breakdown the considerations for ML system design into the following:

    Infrastructure and Scalability:

    Firstly, we need to make a choice about system infrastructure. Specifically, where will we deploy our system: on-premise, in the cloud, or as a hybrid solution. Cloud platforms, such as AWS or Google Cloud offer automated scaling in response to demand both vertically (bigger machines) and horizontally (adding more machines). Think about how the system would handle 10x or 100x the data volume. Netflix provides excellent insight via there technical blog for how they operate at scale. For instance, they have open sourced there container orchestration platform Titus, which automates deployment of thousands of containers across AWS EC2 instances using Autoscaling groups (https://netflixtechblog.com/auto-scaling-production-services-on-titus-1f3cd49f5cd7). Sometimes on-premises infrastructure is required if you’re handling sensitive data. This provides more control over security while being costly to maintain and scale. Regardless, prepare to version control your infrastructure with infrastructure-as-code tools like Terraform and AWS CloudFormation and automate deployment.

    Performance (Throughput and Latency):

    For real-time predictions, performance is critical. Two key metrics to consider, throughput measuring how many requests your system can handle per second (i.e., requests per second), and latency, measuring long how long it takes to return a prediction. If you expect to make repeated predictions with the same inputs then consider adding caching for either part or all the pipeline to reduce latency. In general, horizontal scaling is preferred in order to respond to spikes in traffic at peak times, and reducing single point bottlenecks. This highlights how key decisions taken during your system design process will have direct implications on performance. Take Uber, who built their core service around Cassandra database specifically to optimise for low latency real-time data replication, ensuring quick access to relevant data. (https://www.uber.com/en-GB/blog/how-uber-optimized-cassandra-operations-at-scale/).

    Security:

    For ML systems security applies to API authentication for user requests. This is relatively standard with methods like OAuth2, and protecting endpoints with rate limiting, blocked IP address lists and following OWASP standards. Additionally, ensure that any stored user data is encrypted at rest and flight with strict access control polices for both internal and external users is in place.

    Monitoring and Alerts:

    It’s also key to consider monitoring for maintaining system health. Track key performance indicators (KPIs) like throughput, latency, and error rates, while alerts are setup to notify engineers if any of these metrics fall below acceptable thresholds. This can be done server-side (e.g., your model endpoint) or client-side (e.g., the users end) to include network latency.

    Cost Considerations:

    In return for simple infrastructure management the cost of cloud-based systems can quickly spiral. Start by estimating the number of instances required for data processing, model training, and serving, and balance these against project budgets and growing user demands. Most cloud platforms provide cost-management tools to help you keep track of spending and optimise resources.

    MLOps:

    From the beginning include a plan to efficiently manage model lifecycle. The goal is to accelerate model iteration, automate deployment, and maintain robust monitoring for metrics and data drift. This allows you to start simple and iterate fast! Implement version control with Git for code and DVC (Data Version Control) for tracking data model artefact changes. Tools like MLFlow or Weights & Biases track experiments, while CI/CD pipelines automate testing and deployment. Once deployed, models require real-time monitoring with tools like Promethes and Grafana to detect issues like data drift.

    A high-level system design mitigates risks and ensures your team can adapt and evolve as the system grows. This means designing a system that is model agnostic and ready to scale by breaking down the system into modular components for a robust architecture that supports rapid trail and error, scalable deployment, and effective monitoring.

    Prototyping By Mocking the ML

    We now have an approach for delivering the project requirements, at least from an ML perspective. To round our design off, we can now outline a product prototype, focusing on the user interface and experience (UI/UX). Where possible, this should be interactive, validating whether the feature provides real value to users, ready to iterate on the UX. Since we know ML to be time-consuming and resource-intensive, you can set aside your model design and prototype without a working ML component. Document how you’ll simulate these outputs and test the end-to-end system, detailing the tools and methods used for prototyping in your design document. This is important, as the prototype will likely be your first chance to gather feedback and refine the design, likely evolving into V1.

    To mock our ML we replace predictions with a simple placeholder and simulate outputs. This can be as simple as generating random predictions or building a rule-based system. Prototyping the UI/UX involves creating mockups with design tools like Figma, or prototyping APIs with Postman and Swagger.

    Once your prototype is ready, put it in the hands of people, no matter how embarrassed you are of it. Larger companies often have resources for this, but smaller teams can create their own user panels. I’ve had great success with local universities — students love to engage with something new, Amazon vouchers also help! Gather feedback, iterate, and start basic A/B testing. As you approach a live product, consider more advanced methods like multi-armed bandit testing.

    There is an excellent write up by Apple as an example of mocking ML in this way. During user testing of a conversational digital assistant similar to Siri, they used human operators to impersonate a prototype assistant, varying responses between a conversational style — chatty, non-chatty, or mirror the user’s own style. With this approach they showed users preferred assistants that mirror their own level of chattiness, improving trustworthiness and likability. All without investing in extensive ML development to test UX (https://arxiv.org/abs/1904.01664).

    From this we see that mocking the ML component puts the emphasis on outcomes, allowing us to change output formats, test positive and negative flows and find edge cases. We can also gauge the limits of perceived performance and how we manage user frustration, this has implications for the complexity of models we can build and infrastructure costs. All without concern for model accuracy. Finally, sharing prototypes internally helps get buy in from business leaders, nothing sparks support and commitment for a project more than putting it people’s hands.

    Gather Feedback and Iterate

    As you move into development and deployment, you’ll inevitably find that requirements evolve and your experiments will throw up the unexpected. You’ll need to iterate! Document changes with version control, incorporate feedback loops by revisiting the problem definition, re-evaluating data quality, and re-assessing user needs. This starts with continuous monitoring, as your product matures look for performance degradation by applying statistical tests to detect shifts in prediction distributions (data drift). Implement online learning to counter this or if possible bake into the UI feedback methods from users to help reveal real bias and build trust, so called human-in-the-loop. Actively seek feedback internally first, then from users, through interviews and panels understand how they interact with the product and how this causes new problems. Use A/B testing to compare select versions of your model to understand the impact on user behaviour and the relevant product/business metrics.

    ML projects benefit from adopting agile methodologies across the model life cycle, allowing us to manage the uncertainty and change that is inherent in ML, this starts with the planning process. Start small, test quickly, and don’t be afraid to fail fast. Applying this to the planning and discovery phase, will reduce risk, while delivering a product that not only works but resonates with your users.


    Product-Oriented ML: A Guide for Data Scientists was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Product-Oriented ML: A Guide for Data Scientists

    Go Here to Read this Fast! Product-Oriented ML: A Guide for Data Scientists

  • Calculating the Uncertainty Coefficient (Theil’s U) in Python

    Calculating the Uncertainty Coefficient (Theil’s U) in Python

    Marc Linder

    A measure of correlation between discrete (categorical) variables

    Introduction

    Theil’s U, also known as the uncertainty coefficient or entropy coefficient, quantifies the strength of association between two nominal variables. It assesses how much knowing the value of one variable reduces uncertainty about the other, providing a measure of association that ranges from 0 to 1. A higher value indicates a stronger relationship, making Thiel’s U particularly useful in fields such as statistics and data science for exploring relationships within categorical data.

    Theory

    Theil’s U is a measure of nominal association based on the concept of information entropy. Suppose we have samples from two discrete random variables, X and Y.

    Then the entropy of X is defined as:

    Entropy of a single distribution X

    And the conditional entropy of X given Y is defined as:

    Conditional entropy of X given Y

    We can then use the joint distribution (numerator) in combination with the marginal probabilities of X or Y to calculate the conditional distributions of X given Y (denominator) or Y given X, respectively, as follows:

    Conditional distribution of x given y
    Conditional distribution of y given x

    The result captures how the probability of one variable changes given the value of the other. We can calculate the probability of X given Y by using the joint probability of X and Y — that is, the probability of different combinations of X and Y — as well as the marginal probability of Y. We insert the result of their division into our formula for H(X) to obtain:

    Conditional entropy of X given Y

    So much for the theory; here’s how we can calculate the conditional entropy of X given Y in Python.

    from typing import List, Union
    from collections import Counter
    import math

    def conditional_entropy(
    x: List[Union[int, float]],
    y: List[Union[int, float]]
    ) -> float:
    """ Calculates conditional entropy """

    # Count unique values
    y_counter = Counter(y) # Counts of unique values in y
    xy_counter = Counter(list(zip(x, y))) # Counts of unique pairs from (x, y)
    # Calculate sum of y values
    total_occurrences = sum(y_counter.values())
    # (Re-)set entropy to 0
    entropy = 0

    # For every unique value pair of x and y
    for xy in xy_counter.keys():
    # Joint probability of x AND y
    p_xy = xy_counter[xy] / total_occurrences
    # Marginal probability of y
    p_y = y_counter[xy[1]] / total_occurrences
    # Conditional probability of x given y
    p_x_given_y = p_xy / p_y
    # Calculate the conditional entropy H(X|Y)
    entropy += p_xy * math.log(p_x_given_y, 2) # Use base 2 instead of natural (base e)

    return -entropy

    Once we have calculated the conditional entropy of X given Y, we can calculate Theil’s U. One last step is to calculate the entropy of X, which we defined at the beginning of this article. The uncertainty coefficient, or proficiency, is then calculated as follows:

    Theil’s U — Uncertainty coefficient or proficiency

    Switching from theory to practice, this can be accomplished in Python using the following code:

    import scipy.stats as ss

    def theils_u(
    x: List[Union[int, float]],
    y: List[Union[int, float]]
    ) -> float:
    """ Calculate Theil U """

    # Calculate conditional entropy of x and y
    H_xy = conditional_entropy(x,y)

    # Count unique values
    x_counter = Counter(x)

    # Calculate sum of x values
    total_occurrences = sum(x_counter.values())

    # Convert all absolute counts of x values in x_counter to probabilities
    p_x = list(map(lambda count: count/total_occurrences, x_counter.values()))

    # Calculate entropy of single distribution x
    H_x = ss.entropy(p_x)

    return (H_x - H_xy) / H_x if H_x != 0 else 0

    Lastly we can then define a function that calculates the Theil’s values for every feature combination within a given dataset. We can do this in Python with the following code:

    import itertools
    import pandas as pd

    def get_theils_u_for_df(df: pd.DataFrame) -> pd.DataFrame:
    """ Compute Theil's U for every feature combination in the input df """

    # Create an empty dataframe to fill
    theilu = pd.DataFrame(index=df.columns, columns=df.columns)

    # Insert Theil U values into empty dataframe
    for var1, var2 in itertools.combinations(df, 2):
    u = theil_u(df[var1],df[var2])
    theilu[var1][var2] = round(u, 2) # fill lower diagonal

    u = theil_u(df[var2],df[var1])
    theilu[var2][var1] = round(u, 2) # fill upper diagonal

    # Set 1s to diagonal where row index + column index == n - 1
    for i in range(0, len(theilu.columns)):
    for j in range(0, len(theilu.columns)):
    if i == j:
    theilu.iloc[i, j] = 1

    # Convert all values in the DataFrame to float
    return theilu.map(float)

    Code Example

    We will demonstrate the functionality of the code using the well-known Iris dataset. In addition to its numeric variables, the dataset contains a categorical variable, “species.” Traditional correlation measures, such as Pearson’s correlation, are limited in capturing relationships between categorical and numerical features. However, Thiel’s U can effectively measure the association between “species” and the other numerical features.

    import pandas as pd
    import seaborn as sns
    import itertools
    import matplotlib.pyplot as plt

    # Load the Iris dataset from seaborn
    df = sns.load_dataset('iris')

    # Compute Theil's U for every feature combination in the input df
    theilu = get_theils_u_for_df(df)

    # Create a heatmap of the Theil's V values
    plt.figure(figsize=(10, 4))
    sns.heatmap(theilu, annot=True, cmap='Reds', fmt='.2f')
    plt.title('Heatmap of Theil's U for all variable pairs')
    plt.show()

    The result is a heatmap of Thiel’s U for all variable pairs. Note that this measure has the advantage of being asymmetric, meaning the relationship between two variables can differ depending on the direction of analysis. For example, Thiel’s U can quantify how much information X provides about Y, which may not be the same as how much information Y provides about X.

    Heatmap of Theil’s U values for all variable pairs

    The interpretation of the results is relatively straightforward: Petal Length and Petal Width have the strongest associations with the categorical variable “species,” both with a value of 0.91. This indicates that knowing the petal dimensions provides a high degree of information about the flower species. Sepal Length also has a moderate relationship with species at 0.55, meaning it offers some information about the species, though less than the petal measurements. Sepal Width has the weakest association with species at 0.33, indicating it provides relatively little information about the flower type. The relatively lower values between the sepal measurements and species highlight that petal dimensions are more informative for predicting species, which is consistent with the known characteristics of the Iris dataset.

    Conclusion

    In this article, we demonstrated how to calculate Theil’s U to assess associations between categorical and numerical variables. By applying this measure to the Iris dataset, we showed that petal dimensions provide significant insights into predicting flower species, highlighting the effectiveness of Theil’s U compared to traditional correlation methods.

    Sources

    • Theil, H. (1958): Economic Forecasts and Policy. Amsterdam: North Holland.
    • Theil, H. (1966): Applied Economic Forecasting. Chicago: Rand McNally.
    • Bliemel, F. (1973): Theil’s Forecast Accuracy Coefficient: A Clarification, Journal of Marketing Research 10(4), pp. 444–446

    Note: Unless otherwise noted, all images are by the author.


    Calculating the Uncertainty Coefficient (Theil’s U) in Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Calculating the Uncertainty Coefficient (Theil’s U) in Python

    Go Here to Read this Fast! Calculating the Uncertainty Coefficient (Theil’s U) in Python

  • Self-Service ML with Relational Deep Learning

    Self-Service ML with Relational Deep Learning

    Laurin Brechter

    Do ML directly on your relational database

    Relational Schema of our Dataset, source; Image by Author

    In this blog post, we will dive into an interesting new approach to Deep Learning (DL) called Relational Deep Learning (RDL). We will also gain some hands-on experience by doing some RDL on a real-world database (not a dataset!) of an e-commerce company.

    Introduction

    In the real world, we usually have a relational database against which we want to run some ML task. But especially when the database is highly normalized, this implies lots of time-consuming feature engineering and loss of granularity as we have to do many aggregations. What’s more, there’s a myriad of possible combinations of features that we can construct each of which might yield good performance [2]. That means we are likely to leave some information relevant to the ML task on the table.

    This is similar to the early days of computer vision, before the advent of deep neural networks where features were hand-crafted from the pixel values. Nowadays, models work directly with the raw pixels instead of relying on this intermediate layer.

    Relational Deep Learning

    RDL promises to do the same for tabular learning. That is, it removes the extra step of constructing a feature matrix by learning directly on top of your relational database. It does so by transforming the database with its relations into a graph where a row in a table becomes a node and relations between tables become edges. The row values are stored inside the nodes as node features.

    In this blog post, we will be using this e-commerce dataset from kaggle which contains transactional data about an e-commerce platform in a star schema with a central fact table (transactions) and some dimension tables. The full code can be found in this notebook.

    Throughout this blog post, we will be using the relbench library to do RDL. The first thing we have to do in relbench is to specify the schema of our relational database. Below is an example of how we can do so for the ‘transactions’ table in the database. We give the table as a pandas dataframe and specify the primary key and the timestamp column. The primary key column is used to uniquely identify the entity. The timestamp ensures that we can only learn from past transactions when we want to forecast future transactions. In the graph, this means that information can only flow from nodes with a lower timestamp (i.e. in the past) to ones with a higher timestamp. Additionally, we specify the foreign keys that exist in the relation. In this case, the transactions table has the column ‘customer_key’ which is a foreign key that points to the ‘customer_dim’ table.

    tables['transactions'] = Table(
    df=pd.DataFrame(t),
    pkey_col='t_id',
    fkey_col_to_pkey_table={
    'customer_key': 'customers',
    'item_key': 'products',
    'store_key': 'stores'
    },
    time_col='date'
    )

    The rest of the tables need to be defined in the same way. Note that this could also be automated if you already have a database schema. Since the dataset is from Kaggle, I needed to create the schema manually. We also need to convert the date columns to actual pandas datetime objects and remove any NaN values.

    class EcommerceDataBase(Dataset):
    # example of creating your own dataset: https://github.com/snap-stanford/relbench/blob/main/tutorials/custom_dataset.ipynb

    val_timestamp = pd.Timestamp(year=2018, month=1, day=1)
    test_timestamp = pd.Timestamp(year=2020, month=1, day=1)

    def make_db(self) -> Database:

    tables = {}

    customers = load_csv_to_db(BASE_DIR + '/customer_dim.csv').drop(columns=['contact_no', 'nid']).rename(columns={'coustomer_key': 'customer_key'})
    stores = load_csv_to_db(BASE_DIR + '/store_dim.csv').drop(columns=['upazila'])
    products = load_csv_to_db(BASE_DIR + '/item_dim.csv')
    transactions = load_csv_to_db(BASE_DIR + '/fact_table.csv').rename(columns={'coustomer_key': 'customer_key'})
    times = load_csv_to_db(BASE_DIR + '/time_dim.csv')

    t = transactions.merge(times[['time_key', 'date']], on='time_key').drop(columns=['payment_key', 'time_key', 'unit'])
    t['date'] = pd.to_datetime(t.date)
    t = t.reset_index().rename(columns={'index': 't_id'})
    t['quantity'] = t.quantity.astype(int)
    t['unit_price'] = t.unit_price.astype(float)
    products['unit_price'] = products.unit_price.astype(float)
    t['total_price'] = t.total_price.astype(float)

    print(t.isna().sum(axis=0))
    print(products.isna().sum(axis=0))
    print(stores.isna().sum(axis=0))
    print(customers.isna().sum(axis=0))

    tables['products'] = Table(
    df=pd.DataFrame(products),
    pkey_col='item_key',
    fkey_col_to_pkey_table={},
    time_col=None
    )

    tables['customers'] = Table(
    df=pd.DataFrame(customers),
    pkey_col='customer_key',
    fkey_col_to_pkey_table={},
    time_col=None
    )

    tables['transactions'] = Table(
    df=pd.DataFrame(t),
    pkey_col='t_id',
    fkey_col_to_pkey_table={
    'customer_key': 'customers',
    'item_key': 'products',
    'store_key': 'stores'
    },
    time_col='date'
    )

    tables['stores'] = Table(
    df=pd.DataFrame(stores),
    pkey_col='store_key',
    fkey_col_to_pkey_table={}
    )

    return Database(tables)

    Crucially, the authors introduce the idea of a training table. This training table essentially defines the ML task. The idea here is that we want to predict the future state (i.e. a future value) of some entity in the database. We do this by specifying a table where each row has a timestamp, the identifier of the entity, and some value we want to predict. The id serves to specify the entity, the timestamp specifies at which point in time we need to predict the entity. This will also limit the data that can be used to infer the value of this entity (i.e. only past data). The value itself is what we want to predict (i.e. ground truth).

    In our case, we have an online platform with customers. We want to predict a customer’s revenue in the next 30 days. We can create the training table with a SQL statement executed with DuckDB. This is the big advantage of RDL as we could create any kind of ML task with just SQL. For example, we can define a query to select the number of purchases of buyers in the next 30 days to make a churn prediction.

    df = duckdb.sql(f"""
    select
    timestamp,
    customer_key,
    sum(total_price) as revenue
    from
    timestamp_df t
    left join
    transactions ta
    on
    ta.date <= t.timestamp + INTERVAL '{self.timedelta}'
    and ta.date > t.timestamp
    group by timestamp, customer_key
    """).df().dropna()

    The result will be a table that has the seller_id as the key of the entity that we want to predict, the revenue as the target, and the timestamp as the time at which we need to make the prediction (i.e. we can only use data up until this point to make the prediction).

    Training Table; Image by Author

    Below is the complete code for creating the ‘customer_revenue’ task.

    class CustomerRevenueTask(EntityTask):
    # example of custom task: https://github.com/snap-stanford/relbench/blob/main/tutorials/custom_task.ipynb


    task_type = TaskType.REGRESSION
    entity_col = "customer_key"
    entity_table = "customers"
    time_col = "timestamp"
    target_col = "revenue"
    timedelta = pd.Timedelta(days=30) # how far we want to predict revenue into the future.
    metrics = [r2, mae]
    num_eval_timestamps = 40

    def make_table(self, db: Database, timestamps: "pd.Series[pd.Timestamp]") -> Table:

    timestamp_df = pd.DataFrame({"timestamp": timestamps})

    transactions = db.table_dict["transactions"].df

    df = duckdb.sql(f"""
    select
    timestamp,
    customer_key,
    sum(total_price) as revenue
    from
    timestamp_df t
    left join
    transactions ta
    on
    ta.date <= t.timestamp + INTERVAL '{self.timedelta}'
    and ta.date > t.timestamp
    group by timestamp, customer_key
    """).df().dropna()

    print(df)

    return Table(
    df=df,
    fkey_col_to_pkey_table={self.entity_col: self.entity_table},
    pkey_col=None,
    time_col=self.time_col,
    )

    With that, we have done the bulk of the work. The rest of the workflow will be similar, independent of the ML task. I was able to copy most of the code from the example notebook that relbench provides.

    For example, we need to encode the node features. Here, we can use glove embeddings to encode all the text features such as the product descriptions and the product names.

    from typing import List, Optional
    from sentence_transformers import SentenceTransformer
    from torch import Tensor


    class GloveTextEmbedding:
    def __init__(self, device: Optional[torch.device
    ] = None):
    self.model = SentenceTransformer(
    "sentence-transformers/average_word_embeddings_glove.6B.300d",
    device=device,
    )

    def __call__(self, sentences: List[str]) -> Tensor:
    return torch.from_numpy(self.model.encode(sentences))

    After that, we can apply those transformations to our data and build out the graph.

    from torch_frame.config.text_embedder import TextEmbedderConfig
    from relbench.modeling.graph import make_pkey_fkey_graph

    text_embedder_cfg = TextEmbedderConfig(
    text_embedder=GloveTextEmbedding(device=device), batch_size=256
    )

    data, col_stats_dict = make_pkey_fkey_graph(
    db,
    col_to_stype_dict=col_to_stype_dict, # speficied column types
    text_embedder_cfg=text_embedder_cfg, # our chosen text encoder
    cache_dir=os.path.join(
    root_dir, f"rel-ecomm_materialized_cache"
    ), # store materialized graph for convenience
    )

    The rest of the code will be building the GNN from standard layers, coding the training loop, and doing some evaluations. I will leave this code out of this blog post for brevity since it is very standard and will be the same across tasks. You can check out the notebook here.

    Result of Training, Image by Author

    As a result, we can train this GNN to reach an r2 of around 0.3 and an MAE of 500. This means that it predicts the seller’s revenue in the next 30 days with an average error of +- $500. Of course, we can’t know if this is good or not, maybe we could have gotten an r2 of 80% with a combination of classical ML and feature engineering.

    Conclusion

    Relational Deep Learning is an interesting new approach to ML especially when we have a complex relational schema where manual feature engineering would be too laborious. It gives us the ability to define an ML task with just SQL which can be especially useful for individuals that are not deep into data science but know some SQL. This also means that we can iterate quickly and experiment a lot with different tasks.

    At the same time, this approach presents its own problems such as the difficulty of training GNNs and constructing the graph from the relational schema. Additionally, the question is to what extent RDL can compete in terms of performance with classical ML models. In the past, we have seen that models such as XGboost have proven to be better than neural networks on tabular prediction problems.

    References

    • [1] Robinson, Joshua, et al. “RelBench: A Benchmark for Deep Learning on Relational Databases.” arXiv, 2024, https://arxiv.org/abs/2407.20060.
    • [2] Fey, Matthias, et al. “Relational deep learning: Graph representation learning on relational databases.” arXiv preprint arXiv:2312.04615 (2023).
    • [3] Schlichtkrull, Michael, et al. “Modeling relational data with graph convolutional networks.” The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15. Springer International Publishing, 2018.


    Self-Service ML with Relational Deep Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Self-Service ML with Relational Deep Learning

    Go Here to Read this Fast! Self-Service ML with Relational Deep Learning

  • Generative AI foundation model training on Amazon SageMaker

    Generative AI foundation model training on Amazon SageMaker

    Trevor Harvey

    In this post, we explore how organizations can cost-effectively customize and adapt FMs using AWS managed services such as Amazon SageMaker training jobs and Amazon SageMaker HyperPod. We discuss how these powerful tools enable organizations to optimize compute resources and reduce the complexity of model training and fine-tuning. We explore how you can make an informed decision about which Amazon SageMaker service is most applicable to your business needs and requirements.

    Originally appeared here:
    Generative AI foundation model training on Amazon SageMaker

    Go Here to Read this Fast! Generative AI foundation model training on Amazon SageMaker

  • Automate fine-tuning of Llama 3.x models with the new visual designer for Amazon SageMaker Pipelines

    Automate fine-tuning of Llama 3.x models with the new visual designer for Amazon SageMaker Pipelines

    Lauren Mullennex

    In this post, we will show you how to set up an automated LLM customization (fine-tuning) workflow so that the Llama 3.x models from Meta can provide a high-quality summary of SEC filings for financial applications. Fine-tuning allows you to configure LLMs to achieve improved performance on your domain-specific tasks.

    Originally appeared here:
    Automate fine-tuning of Llama 3.x models with the new visual designer for Amazon SageMaker Pipelines

    Go Here to Read this Fast! Automate fine-tuning of Llama 3.x models with the new visual designer for Amazon SageMaker Pipelines

  • Implement Amazon SageMaker domain cross-Region disaster recovery using custom Amazon EFS instances

    Implement Amazon SageMaker domain cross-Region disaster recovery using custom Amazon EFS instances

    Jinzhao Feng

    In this post, we guide you through a step-by-step process to seamlessly migrate and safeguard your SageMaker domain from one active Region to another passive or active Region, including all associated user profiles and files.

    Originally appeared here:
    Implement Amazon SageMaker domain cross-Region disaster recovery using custom Amazon EFS instances

    Go Here to Read this Fast! Implement Amazon SageMaker domain cross-Region disaster recovery using custom Amazon EFS instances