Category: AI

Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

Andreas Naoum

How to visualise MediaPipe’s Hand Tracking and Gesture Recognition with Rerun

Continue reading on Towards Data Science »

Originally appeared here:
Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

Go Here to Read this Fast! Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

March 5, 2024
Structure and Relationships: Graph Neural Networks and a Pytorch Implementation
Najib Sharifi
Understanding the mathematical background of graph neural networks and implementation for a regression problem in pytorch

Introduction

Interconnected graphical data is all around us, ranging from molecular structures to social networks and design structures of cities. Graph Neural Networks (GNNs) are emerging as a powerful method of modelling and learning the spatial and graphical structure of such data. It has been applied to protein structures and other molecular applications such as drug discovery as well as modelling systems such as social networks. Recently the standard GNN has been combined with ideas from other ML models to develop exciting innovative applications. One such development is the integration of GNN with sequential models — Spatio-Temporal GNN that is able to capture both the temporal and spatial (hence the name) dependences of data, this alone could be applied to a number of challenges/problems in industry/research.

Despite the exciting developments in GNN, there are very few resources on the topic which makes it inaccessible to many. In this short article, I want to provide a brief introduction to GNN covering both the mathematical description as well as a regression problem using the pytorch library. By unraveling the principles behind GNNs, we unlock a deeper comprehension of their capabilities and applications.

Mathematical Description of GNNs

A graph G can be defined as G = (V, E), where V is the set of nodes, and E are the edges between them. A graph is often represented by an adjacency matrix, A, which represents the presence or absence of edges between nodes i.e. aij takes values of 1 to indicate an edge (connection) between nodes i and j or 0 otherwise. If a graph has n nodes, A has a dimension of (n × n). The adjacency matrix is demonstrated in Figure 1.

Figure 1. Adjacency matrix for three different graphs

Each node (and edges! But we’ll come back to this later for simplicity) will have a set of features (e.g. if the node is a person, the features will be age, gender, height, job etc). If each node has f features, then the feature matrix X is (n × f). In some problems, each node may also have a target label which maybe a set of categorical labels or numerical values (shown in Figure 2).

Single Node Calculations

To learn the interdependence between any node and its neighbours, we need to consider the features of its neighbours. This is what enables GNNs to learn the structural representation of the data through a graph. Consider a node j with Nj neighbours, GNNs transform the features from each neighbour, aggregate them and then update node i’s feature space. Each of these steps are described as follows.

Figure 2. A schematic of a node (j) with features xj and label yj and neighbour nodes (i, 2, 3), each with their on feature embeddings and corresponding label.

Neighbour feature transformation could be done a number of ways such as passing through an MLP network or by linear transformation such as

where w and b represent the weights and bias of the transformation. Information aggregation, the information from each neighboring nodes are then aggregated:

The nature of the aggregation step could be a number of different methods such as summation, averaging, min/max pooling and concatenation:

Following the aggregation step, the final step is to update node j:

This updated could be done using MLP with the concatenated node features and neighbour information aggregation (mj) or we could use linear transformation i.e.

Where U is a learnable weights matrix that combines the original node features (xj) with aggregated neighbour features (mj) through a non-linear activation function (ReLU in this case). This is it for the process of updating a single node in a single layer, the same process is applied to all other nodes in the graph, mathematically, this can be presented using the Adjacency matrix.

Graph Level Calculation

For a graph with n nodes and each node has f features, we can concatenate all the features in a single matrix:

The neighbour feature transformation and aggregation steps can therefore be written as:

Where I is the identity matrix, this helps to include the each nodes own features too, otherwise, we are only considering the transformed features from the node j’s neighbours and not it’s own features. One final step is to normalise each node based on the number of connections i.e. for node j with Nj connections, the feature transformation can be done as:

The equation above can be adjusted as:

Where D is the degree matrix, a diagonal matrix of number of connections for each node. However, more commonly, this normalisation step is done as

This is the graph convolution network (GCN) method that enables GNN to learn the structure and relationship between nodes. However, an issue with GCN is that the weight vector for neighbour feature transformation is shared across all neighbours i.e. all neighbours are considered equal, but this is usually not the case so not a good representative of real systems. To address is, graph attention network (GATs) can be used to compute the importance of a neighbour’s feature to the target node, allowing the different neighbours to contribute differently to the feature update of the target node based on their relevance. The attention coefficients are determined using a learnable matrix as follows:

Where W is the shared learnable feature linear transformation, Wa is a learnable weight vector and eij is the raw attention score indicating importance of node i’s features to node j. The attention score is normalised using the SoftMax function:

Now the feature aggregation step can be calculated using the attention coefficients:

This is it for a single layer, we can build multiple layers to increase the complexity of the model, this is demonstrated in Figure 3. Increasing the number of layers will allow the model to learn more global features and also capture more complex relationships, however, it is also likely to overfit so regularisation techniques should always be used to prevent this.

Figure 3. An illustration of a multilayered GNN model

Finally, once final feature vectors for all nodes are obtained from the network, a feature matrix, H can be formed:

This feature matrix can be used to do a number of tasks e.g. node or graph classification. This brings us to the end of introduction into the mathematical description of GCN/GATs.

GCN Regression Example

Let’s implement a regression example where the aim is to train a network to predict the value of a node given the value of all other nodes i.e. each node has a single feature (which is a scalar value). The aim of this example is to leverage the inherent relational information encoded in the graph to accurately predict numerical values for each node. The key thing to note is that we input the numerical value for all nodes except the target node (we mask the target node value with 0) then predict the target node’s value. For each data point, we repeat the process for all nodes. Perhaps this might come across as a bizarre task but lets see if we can predict the expected value of any node given the values of the other nodes. The data used is the corresponding simulation data to a series of sensors from industry and the graph structure I have chosen in the example below is based on the actual process structure. I have provided comments in the code to make it easy to follow. You can find a copy of the dataset here (Note: this is my own data, generated from simulations).

This code and training procedure is far from being optimised but it’s aim is to illustrate the implementation of GNNs and get an intuition for how they work. An issue with the currently way I have done that should definitely not be done this way beyond learning purposes is the masking of node feature value and predicting it from the neighbours feature. Currently you’d have to loop over each node (not very efficient), a much better way to do is the stop the model from include it’s own features in the aggregation step and hence you wouldn’t need to do one node at a time but I thought it is easier to build intuition for the model with the current method:)

Preprocessing Data

Importing the necessary libraries and Sensor data from CSV file. Normalise all data in the range of 0 to 1.
```
import pandas as pd
import torch
from torch_geometric.data import Data, Batch
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
from torch_geometric.data import DataLoader

# load and scale the dataset
df = pd.read_csv('SensorDataSynthetic.csv').dropna()
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
```
Defining the connectivity (edge index) between nodes in the graph using a PyTorch tensor — i.e. this provides the system’s graphical topology.
```
nodes_order = [
    'Sensor1', 'Sensor2', 'Sensor3', 'Sensor4', 
    'Sensor5', 'Sensor6', 'Sensor7', 'Sensor8'
]

# define the graph connectivity for the data
edges = torch.tensor([
    [0, 1, 2, 2, 3, 3, 6, 2],  # source nodes
    [1, 2, 3, 4, 5, 6, 2, 7]   # target nodes
], dtype=torch.long)
```
The Data imported from csv has a tabular structure but to use this in GNNs, it must be transformed to a graphical structure. Each row of data (one observation) is represented as one graph. Iterate through Each Row to Create Graphical representation of the data

A mask is created for each node/sensor to indicate the presence (1) or absence (0) of data, allowing for flexibility in handling missing data. In most systems, there may be items with no data available hence the need for flexibility in handling missing data. Split the data into training and testing sets
```
graphs = []

# iterate through each row of data to create a graph for each observation
# some nodes will not have any data, not the case here but created a mask to allow us to deal with any nodes that do not have data available
for _, row in df_scaled.iterrows():
    node_features = []
    node_data_mask = []
    for node in nodes_order:
        if node in df_scaled.columns:
            node_features.append([row[node]])
            node_data_mask.append(1) # mask value of to indicate present of data
        else:
            # missing nodes feature if necessary
            node_features.append(2)
            node_data_mask.append(0) # data not present
    
    node_features_tensor = torch.tensor(node_features, dtype=torch.float)
    node_data_mask_tensor = torch.tensor(node_data_mask, dtype=torch.float)

    
    # Create a Data object for this row/graph
    graph_data = Data(x=node_features_tensor, edge_index=edges.t().contiguous(), mask = node_data_mask_tensor)
    graphs.append(graph_data)


#### splitting the data into train, test observation
# Split indices
observation_indices = df_scaled.index.tolist()
train_indices, test_indices = train_test_split(observation_indices, test_size=0.05, random_state=42)

# Create training and testing graphs
train_graphs = [graphs[i] for i in train_indices]
test_graphs = [graphs[i] for i in test_indices]
```
Graph Visualisation

The graph structure created above using the edge indices can be visualised using networkx.
```
import networkx as nx
import matplotlib.pyplot as plt

G = nx.Graph() 
for src, dst in edges.t().numpy():
    G.add_edge(nodes_order[src], nodes_order[dst])

plt.figure(figsize=(10, 8))
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color='lightblue', edge_color='gray', node_size=2000, font_weight='bold')
plt.title('Graph Visualization')
plt.show()
```
Model Definition

Let’s define the model. The model incorporates 2 GAT convolutional layers. The first layer transforms node features to an 8 dimensional space, and the second GAT layer further reduces this to an 8-dimensional representation.

GNNs are highly susceptible to overfitting, regularation (dropout) is applied after each GAT layer with a user defined probability to prevent over fitting. The dropout layer essentially randomly zeros some of the elements of the input tensor during training.

The GAT convolution layer output results are passed through a fully connected (linear) layer to map the 8-dimensional output to the final node feature which in this case is a scalar value per node.

Masking the value of the target Node; as mentioned earlier, the aim of this of task is to regress the value of the target node based on the value of it’s neighbours. This is the reason behind masking/replacing the target node’s value with zero.
```
from torch_geometric.nn import GATConv
import torch.nn.functional as F
import torch.nn as nn

class GNNModel(nn.Module):
    def __init__(self, num_node_features):
        super(GNNModel, self).__init__()
        self.conv1 = GATConv(num_node_features, 16)
        self.conv2 = GATConv(16, 8)
        self.fc = nn.Linear(8, 1)  # Outputting a single value per node

    def forward(self, data, target_node_idx=None):
        x, edge_index = data.x, data.edge_index
        edge_index = edge_index.T
        x = x.clone()

        # Mask the target node's feature with a value of zero! 
        # Aim is to predict this value from the features of the neighbours
        if target_node_idx is not None:
            x[target_node_idx] = torch.zeros_like(x[target_node_idx])

        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.05, training=self.training)
        x = F.relu(self.conv2(x, edge_index))
        x = F.relu(self.conv3(x, edge_index))
        x = F.dropout(x, p=0.05, training=self.training)
        x = self.fc(x)

        return x
```
Training the model

Initialising the model and defining the optimiser, loss function and the hyper parameters including learning rate, weight decay (for regularisation), batch_size and number of epochs.
```
model = GNNModel(num_node_features=1) 
batch_size = 8
optimizer = torch.optim.Adam(model.parameters(), lr=0.0002, weight_decay=1e-6)
criterion = torch.nn.MSELoss()
num_epochs = 200  
train_loader = DataLoader(train_graphs, batch_size=1, shuffle=True) 
model.train()
```
The training process is fairly standard, each graph (one data point) of data is passed through the forward pass of the model (iterating over each node and predicting the target node. The loss from the prediction is accumulated over the defined batch size before updating the GNN through backpropagation.
```
for epoch in range(num_epochs):
    accumulated_loss = 0 
    optimizer.zero_grad()
    loss = 0  
    for batch_idx, data in enumerate(train_loader):
        mask = data.mask  
        for i in range(1,data.num_nodes):
            if mask[i] == 1:  # Only train on nodes with data
                output = model(data, i)  # get predictions with the target node masked
                                         # check the feed forward part of the model
                target = data.x[i] 
                prediction = output[i].view(1) 
                loss += criterion(prediction, target)
        #Update parameters at the end of each set of batches
        if (batch_idx+1) % batch_size == 0 or (batch_idx +1 ) == len(train_loader):
            loss.backward() 
            optimizer.step()
            optimizer.zero_grad()
            accumulated_loss += loss.item()
            loss = 0

    average_loss = accumulated_loss / len(train_loader)
    print(f'Epoch {epoch+1}, Average Loss: {average_loss}')
```
Testing the trained model

Using the test dataset, pass each graph through the forward pass of the trained model and predict each node’s value based on it’s neighbours value.
```
test_loader = DataLoader(test_graphs, batch_size=1, shuffle=True)
model.eval()

actual = []
pred = []

for data in test_loader:
    mask = data.mask
    for i in range(1,data.num_nodes):
        output = model(data, i)
        prediction = output[i].view(1)
        target = data.x[i]

        actual.append(target)
        pred.append(prediction)
```
Visualising the test results

Using iplot we can visualise the predicted values of nodes against the ground truth values.
```
import plotly.graph_objects as go
from plotly.offline import iplot

actual_values_float = [value.item() for value in actual]
pred_values_float = [value.item() for value in pred]


scatter_trace = go.Scatter(
    x=actual_values_float,
    y=pred_values_float,
    mode='markers',
    marker=dict(
        size=10,
        opacity=0.5,  
        color='rgba(255,255,255,0)',  
        line=dict(
            width=2,
            color='rgba(152, 0, 0, .8)', 
        )
    ),
    name='Actual vs Predicted'
)

line_trace = go.Scatter(
    x=[min(actual_values_float), max(actual_values_float)],
    y=[min(actual_values_float), max(actual_values_float)],
    mode='lines',
    marker=dict(color='blue'),
    name='Perfect Prediction'
)

data = [scatter_trace, line_trace]

layout = dict(
    title='Actual vs Predicted Values',
    xaxis=dict(title='Actual Values'),
    yaxis=dict(title='Predicted Values'),
    autosize=False,
    width=800,
    height=600
)

fig = dict(data=data, layout=layout)

iplot(fig)
```
Despite a lack of fine tuning the model architecture or hyperparameters, it has done a decent job actually, we could tune the model further to get improved accuracy.

This brings us to the end of this article. GNNs are relatively newer than other branches of machine learning, it will be very exciting to see the developments of this field but also it’s application to different problems. Finally, thank you for taking the time to read this article, I hope you found it useful in your understanding of GNNs or their mathematical background.

Unless otherwise noted, all images are by the author

Structure and Relationships: Graph Neural Networks and a Pytorch Implementation was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Structure and Relationships: Graph Neural Networks and a Pytorch Implementation

Go Here to Read this Fast! Structure and Relationships: Graph Neural Networks and a Pytorch Implementation
March 5, 2024
Time-LLM: Reprogram an LLM for Time Series Forecasting

Marco Peixeiro

Discover the architecture of Time-LLM and apply it in a forecasting project with Python

Continue reading on Towards Data Science »

Originally appeared here:
Time-LLM: Reprogram an LLM for Time Series Forecasting

Go Here to Read this Fast! Time-LLM: Reprogram an LLM for Time Series Forecasting

March 5, 2024
Extensible and Customisable Vertex AI MLOps Platform
Kabeer Akande
MLOps Platform

Building scalable Kubeflow ML pipelines on Vertex AI and ‘jailbreaking’ Google prebuilt containers

Tools and corresponding operations supporting MLOps platform

When I decided to write an article on building scalable pipelines with Vertex AI last year, I contemplated the different formats it could take. I finally settled on building a fully functioning MLOps platform, as lean as possible due to time restriction, and open source the platform for the community to gradually develop. But time proved a limiting factor and I keep dillydallying. On some weekends, when I finally decided to put together the material, I found a litany of issues which I have now documented to serve as guide to others who might tread the same path.

This is what led to the development of mlops-platform, an initiative designed to demonstrate a streamlined, end-to-end process of building scalable and operationalised machine learning models on VertexAI using Kubeflow pipelines. The major features of the platform can be broken down in fourfold: firstly, it encapsulates a modular and flexible pipeline architecture that accommodates various stages of the machine learning lifecycle, from data loading and preprocessing to model training, evaluation, deployment and inference. Secondly, it leverages Google Cloud’s Vertex AI services for seamless integration, ensuring optimal performance, scalability, and resource efficiency. Thirdly, it is scaffolded with a series of operations that are frequently used to automate ML workflows. Lastly, it documents common challenges experienced when building projects of this scale and their respective workarounds.

I have built the mlops platform with two major purposes in mind:
1. To serve as an educational place where the community can learn about the fundamental components of MLOps platform including the various operations that enable such platform
2. To serve as building blocks for teams with little to no engineering support so they can self serve when developing data science and ML engineering projects
I hope the platform will continue to grow from contributions from the community.

Though Google has a GitHub repo containing numerous examples of using Vertex AI pipeline, the repo is daunting to navigate. Moreover, you often need a multiple of ops wrappers around your application for organisation purposes as you would have multiple teams using the platform. And more often, there are issues that crop up during development that do not get addressed enough, leaving developers frustrated. Google support might be insufficient especially when chasing production deadlines. On a personal experience, even though my company have enhanced support, I have an issue raised with Google Vertex engineering team which drags on for more than four months. In addition, due to the rapid pace at which technology is evolving, posting on forums might not yield desired solution since only few people might have experienced the issue being posted about. So having a working end to end platform to build upon with community support is invaluable.

By the way, have you heard about pain driven development (PDD)? It is analogous to test or behaviour driven development. In PDD, the development is driven by pain points. This means changes are made to codebase when the team feels impacted and could justify the trade off. It follows the mantra of if it ain’t broke, don’t fix. Not to worry, this post will save some pains (emanating from frustration) when using Google Vertex AI, especially the prebuilt containers, for building scalable ML pipelines. But more appropriately, in line with the PDD principle, I have deliberately made it a working platform with some pain points. I have detailed those pain points hoping that interested parties from the community would join me in gradually integrating the fixes. With those house keeping out of the way, lets cut to the chase!

Google Vertex AI pipelines provides a framework to run ML workflows using pipelines that are designed with Kubeflow or Tensorflow Extended frameworks. In this way, Vertex AI serves as an orchestration platform that allows composing a number of ML tasks and automating their executions on GCP infrastructure. This is an important distinction to make since we don’t write the pipelines with Vertex AI rather, it serves as the platform for orchestrating the pipelines. The underlying Kubeflow or Tensorflow Extended pipeline follows common framework used for orchestrating tasks in modern architecture. The framework separates logic from computing environment. The logic, in the case of ML workflow, is the ML code while the computing environment is a container. Both together are referred to as a component. When multiple components are grouped together, they are referred to as pipeline. There is modality in place, similar to other orchestration platforms, to pass data between the components. The best place to learn in depth about pipelines is from Kubeflow documentation and several other blog posts which I have linked in the references section.

I mentioned the general architecture of orchestration platforms previously. Some other tools using similar architecture as Vertex AI where logic are separated from compute are Airflow (tasks and executors), GitHub actions (jobs and runners), CircleCI (jobs and executors) and so on. I have an article in the pipeline on how having a good grasp of the principle of separation of concerns integrated in this modern workflow architecture can significantly help in the day to day use of the tools and their troubleshooting. Though Vertex AI is synonymous for orchestrating ML pipelines, in theory any logic such as Python script, data pipeline or any containerised application could be run on the platform. Composer, which is a managed Apache Airflow environment, was the main orchestrating platform on GCP prior to Vertex AI. The two platforms have pros and cons that should be considered when making a decision to use either.

I am going to avoid spamming this post with code which are easily accessible from the platform repository. However, I will run through the important parts of the mlops platform architecture. Please refer to the repo to follow along.

MLOps platform

Components

The architecture of the platform revolves around a set of well-defined components housed within the components directory. These components, such as data loading, preprocessing, model training, evaluation, and deployment, provide a modular structure, allowing for easy customisation and extension. Lets look through one of the components, the preprocess_data.py, to understand the general structure of a component.
```
from config.config import base_image
from kfp.v2 import dsl
from kfp.v2.dsl import Dataset, Input, Output

@dsl.component(base_image=base_image) 
def preprocess_data(
    input_dataset: Input[Dataset], 
    train_dataset: Output[Dataset],
    test_dataset: Output[Dataset],
    train_ratio: float = 0.7,  
):
    """
    Preprocess data by partitioning it into training and testing sets.
    """

    import pandas as pd
    from sklearn.model_selection import train_test_split

    df = pd.read_csv(input_dataset.path)
    df = df.dropna()

    if set(df.iloc[:, -1].unique()) == {'Yes', 'No'}:
        df.iloc[:, -1] = df.iloc[:, -1].map({'Yes': 1, 'No': 0})

    train_data, test_data = train_test_split(df, train_size=train_ratio, random_state=42)

    train_data.to_csv(train_dataset.path, index=False)
    test_data.to_csv(test_dataset.path, index=False)
```
A closer look at the script above would show a familiar data science workflow. All the script does is read in some data, split them for model development and write the splits to some path where it can be readily accessed by downstream tasks. However, since this function would be run on Vertex AI, it is decorated by a Kubeflow pipeline @dsl.component(base_image=base_image) which marks the function as a Kubeflow pipeline component to be run within the base_image container. I will talk about the base_image later. This is all is required to run a function within a container on Vertex AI. Once we structured all our other functions in similar manner and decorate them as Kubeflow pipeline components, the mlpipeline.py function will import each components to structure the pipeline.
```
#mlpipeline.py

from kfp.v2 import dsl, compiler
from kfp.v2.dsl import pipeline
from components.load_data import load_data
from components.preprocess_data import preprocess_data
from components.train_random_forest import train_random_forest
from components.train_decision_tree import train_decision_tree
from components.evaluate_model import evaluate_model
from components.deploy_model import deploy_model
from config.config import gcs_url, train_ratio, project_id, region, serving_image, service_account, pipeline_root
from google.cloud import aiplatform

@pipeline(
    name="ml-platform-pipeline",
    description="A pipeline that performs data loading, preprocessing, model training, evaluation, and deployment",
    pipeline_root= pipeline_root
)
def mlplatform_pipeline(
    gcs_url: str = gcs_url,
    train_ratio: float = train_ratio,
    ):
    load_data_op = load_data(gcs_url=gcs_url)
    preprocess_data_op = preprocess_data(input_dataset=load_data_op.output, 
                                         train_ratio=train_ratio
                                         )

    train_rf_op = train_random_forest(train_dataset=preprocess_data_op.outputs['train_dataset'])
    train_dt_op = train_decision_tree(train_dataset=preprocess_data_op.outputs['train_dataset'])

    evaluate_op = evaluate_model(
        test_dataset=preprocess_data_op.outputs['test_dataset'],
        dt_model=train_dt_op.output,
        rf_model=train_rf_op.output
    )

    deploy_model_op = deploy_model(
        optimal_model_name=evaluate_op.outputs['optimal_model'],
        project=project_id,
        region=region,
        serving_image=serving_image,
        rf_model=train_rf_op.output,
        dt_model=train_dt_op.output
    )

if __name__ == "__main__":
    pipeline_filename = "mlplatform_pipeline.json"
    compiler.Compiler().compile(
        pipeline_func=mlplatform_pipeline,
        package_path=pipeline_filename
    )
    
    aiplatform.init(project=project_id, location=region)
    _ = aiplatform.PipelineJob(
        display_name="ml-platform-pipeline",
        template_path=pipeline_filename,
        parameter_values={
            "gcs_url": gcs_url,
            "train_ratio": train_ratio
        },
        enable_caching=True
    ).submit(service_account=service_account)
```
@pipeline decorator enables the function mlplatform_pipeline to be run as a pipeline. The pipeline is then compiled to the specified pipeline filename. Here, I have specified JSON configuration extension for the compiled file but I think Google is moving toYAML. The compiled file is then picked up by aiplatform and submitted to Vertex AI platform for execution.

The only other thing I found puzzling while starting out with the kubeflow pipelines are the parameters and artifacts set up so have a look to get up to speed.

Configuration

The configuration file in the config directory facilitates the adjustment of parameters and settings across different stages of the pipeline. Along with the config file, I have also included a dot.env file which has comments on the variables specifics and is meant to be a guide for the nature of the variables that are loaded into the config file.

Notebooks

I mostly start my workflow and exploration within notebooks as it enable easy interaction. As a result, I have included notebooks directory as a means of experimenting with the different components logics.

Testing

Testing plays a very important role in ensuring the robustness and reliability of machine learning workflows and pipelines. Comprehensive testing establishes a systematic approach to assess the functionality of each component and ensures that they behave as intended. This reduces the instances of errors and malfunctioning during the execution stage. I have included a test_mlpipeline.py script mostly as a guide for the testing process. It uses pytest to illustrate testing concept and provides a framework to build upon.

Project Dependencies

Managing dependencies can be a nightmare when developing enterprise scale applications. And given the myriads of packages required in a ML workflow, combined with the various software applications needed to operationalise it, it can become a Herculean task managing the dependencies in a sane manner. One package that is slowly gaining traction is Poetry. It is a tool for dependency management and packaging in Python. The key files generated by Poetry are pyproject.toml and poetry.lock. pyproject.tomlfile is a configuration file for storing project metadata and dependencies while the poetry.lock file locks the exact versions of dependencies, ensuring consistent and reproducible builds across different environments. Together, these two files enhance dependency resolution. I have demonstrated how the two files replace the use of requirement.txt within a container by using them to generate the training container image for this project.

Makefile

A Makefile is a build automation tool that facilitates the compilation and execution of a project’s tasks through a set of predefined rules. Developers commonly use Makefiles to streamline workflows, automate repetitive tasks, and ensure consistent and reproducible builds. The Makefile within mlops-platform has predefined commands to seamlessly run the entire pipeline and ensure the reliability of the components. For example, the all target, specified as the default, efficiently orchestrates the execution of both the ML pipeline (run_pipeline) and tests (run_tests). Additionally, the Makefile provides a clean target for tidying up temporary files while the help target offers a quick reference to the available commands.

Documentation

The project is documented in the README.md file, which provides a comprehensive guide to the project. It includes detailed instructions on installation, usage, and setting up Google Cloud Platform services.

Orchestration with CI/CD

GitHub Actions workflow defined in .github/workflows directory is crucial for automating the process of testing, building, and deploying the machine learning pipeline to Vertex AI. This CI/CD approach ensures that changes made to the codebase are consistently validated and deployed, enhancing the project’s reliability and reducing the likelihood of errors. The workflow triggers on each push to the main branch or can be manually executed, providing a seamless and reliable integration process.

Inference Pipeline

There are multiple ways to implement inference or prediction pipeline. I have gone the good old way here by loading in both the prediction features and the uploaded model, getting predictions from the model and writing the predictions to a BigQuery table. It is worth noting that for all the talk about prediction containers, they are not really needed if all is required is batch prediction. We might as well use the training container for our batch prediction as demonstrated in the platform. However, the prediction container is required for online prediction. I have also included modality for local testing of the batch prediction pipeline which can be generalised to test any of the other components or any scripts for that matter. Local testing can be done by navigating to batch_prediction/batch_prediction_test directory, substituting for placeholder variables and running the following commands:
```
# First build the image using Docker
docker build -f Dockerfile.batch -t batch_predict .

# The run batch prediction pipeline locally using the built image from above
docker run -it 
     -v {/local/path/to/service_acount-key.json}:/secrets/google/key.json 
     -e GOOGLE_APPLICATION_CREDENTIALS=/secrets/google/key.json 
     batch_predict 
     --model_gcs_path={gs://path/to/gcs/bucket/model.joblib} 
     --input_data_gcs_path={gs://path/to/gcs/bucket/prediction_data.csv} 
     --table_ref={project_id.dataset.table_name} 
     --project={project_id}
```
The service account needs proper access on GCP to execute the task above, it should have permission to read from the GCP bucket and write to the BigQuery table.

Challenges and Solutions: `Jailbreaking’ Google Vertex AI prebuilt containers

Some of the challenges encountered during the building of this project emanates from the use of container images and the associated package versions within the Google prebuilt containers. I presume the main goal of Google when creating prebuilt containers is to lift off major engineering tasks for the data scientists and enable them to focus mainly on ML logics. However, more work would be required to ensure this aim is achieved as the prebuilt containers have various versions mismatch requiring significant debugging effort to resolve. I have detailed some of the challenges and some possible fixes.
1. Multi-Architectural image build: While using macOS has its upsides, building container image on them to be deployed on cloud platforms might not be one of them. The main challenge is that most cloud platforms supports Linux running on amd64 architecture while latest macOS systems run on arm64 architecture. As a result, binaries compiled on macOS would ordinarily not be compatible with Linux. This means that built images that compile successfully on macOS might fail when run on most cloud platforms. And what is more, the log messages that result from this error is tacit and unhelpful, making it challenging to debug. It should be noted that this is an issue with most modern cloud platforms and not peculiar to GCP. As a result, there are multiple workarounds to overcome this challenge.
- Use BuildX: Buildx is a Docker CLI plugin that allows building a multi-architecture container image that can run on multiple platforms. Ensure Docker desktop is installed as it is required to build image locally. Alternatively, the image can be built from Google cloud shell. The following script would build a compatible container image on macOS and push it to GCP artifact registry.
```
# start Docker Desktop (can also open manually)
open -a Docker

# authentucate to GCP if desired to push the image to GCP artifact repo
gcloud auth login
gcloud auth configure-docker "{region}-docker.pkg.dev" --quiet

# create and use a buildx builder instance (only needed once)
docker buildx create --name mybuilder --use
docker buildx inspect --bootstrap

# build and push a multi-architecture Docker image with buildx
docker buildx build --platform linux/amd64,linux/arm64 -t "{region}-docker.pkg.dev/{project_id}/{artifact_repo}/{image-name}:latest" -f Dockerfile --push .
```
The name of the container follows Google specific format for naming containers.
- Set Docker environment variable : Set DOCKER_DEFAULT_PLATFORM permanently in the macOS system config file to ensure that Docker always build image compatible with Linux amd64.
```
# open Zsh config file (I use visual code but it could be other editor like nano)
code ~/.zshrc

# insert at the end of file
export DOCKER_DEFAULT_PLATFORM=linux/amd64

# save and close file then apply changes
source ~/.zshrc
```
2. Conflicting versions in prebuilt container images: Google maintains a host of prebuilt images for prediction and training tasks. These container images are available for common ML frameworks in different versions. However, I found that the documented versions sometimes don’t match the actual version and this constitute a major point of failure when using these container images. Giving what the community has gone through in standardising versions and dependencies and the fact that container technology is developed to mainly address reliable execution of applications, I think Google should strive to address the conflicting versions in the prebuilt container images. Make no mistake, battling with version mismatch can be frustrating which is why I encourage ‘jailbreaking’ the prebuilt images prior to using them. When developing this tutorial, I decided to useeurope-docker.pkg.dev/vertex-ai/training/sklearn-gpu.1-0:latest and europe-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest. From the naming conventions, both are supposed to be compatible and should havesklearn==1.0. In fact, this is confirmed on the site as shown in the screenshot below and also, on the container image artifact registry.

Screenshot from the training prebuilt image page

However, the reality is different. I ran into version mismatch errors when deploying the built model to an endpoint. A section of the error message is shown below.

Trying to unpickle estimator OneHotEncoder from version 1.0.2 when using version 1.0

Suprise! Suprise! Suprise! Basically, what the log says is that you have pickled with version 1.0.2 but attempting to unpickle with version 1.0. To make progress, I decided to do some ‘jailbreaking’ and looked under the hood of the prebuilt container images. It is a very basic procedure but opened many can of worms.
1. From the terminal or Google cloud shell
2. Pull the respective image from Google artifact registry
```
docker pull europe-docker.pkg.dev/vertex-ai/training/sklearn-cpu.1-0:latest
```
3. Run the image, overide its entrypoint command and drop onto its bash shell terminal
```
docker run -it --entrypoint /bin/bash europe-docker.pkg.dev/vertex-ai/training/sklearn-cpu.1-0:latest
```
4. Check the sklearn version
```
python -c "import sklearn; print(sklearn.__version__)"
```
The output, as of the time of writing this post, is shown in the screenshot below:

Conducting similar exercise for europe-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest , the sklearn version is 1.3.2 and 1.2.2 for the 1.2version. What is even more baffling is that pandas is missing from both version 1–2and 1-3 which begs the question of whether the prebuilt containers are being actively maintained. Of course, the issue is not the minor update but the fact that the corresponding prediction image did not have similar update which results in the mismatch error shown above.

When I contacted Google support to report the mismatch, the Vertex AI engineering team mentioned alternatives such as Custom prediction routines (CPR) and SklearnPredictor. And I was pointed to newer image versions with similar issues and missing pandas!

Moving on, if you are feeling like a Braveheart and want to explore further, you can access all the other files that Google runs when launching prebuilt containers by running ls command from within the container and looking through the files and folders.

Build Base Image

So having discovered the issue, what can be done in order to still take advantage of prebuilt containers? What I did was to extract all the relevant packages from the container.
```
pip freeze > requirement.txt
cat requirement.txt
```
The commands above will extract all the installed packages and print them to the container terminal. The packages can then be copied and used in creating a custom container image, ensuring that the ML framework version in both the training and prediction container matches. If you prefer to copy the file content to your local directory then use the following command:
```
# If on local terminal, copy requirements.txt into current directory
docker cp {running-container}:/requirements.txt .
```
Some of the packages in the prebuilt containers would not be needed for individual project so it is better to select the ones that matches your workflow. The most important one to lock down is the ML framework version whether it is sklearn or xgboost, making sure both training and prediction versions match.

I have basically locked the sklearn version to match the version of the prebuilt prediction image. In this case, it is version 1.0 and I have left all the other packages as they are.

Then to build the custom training image, use the following commands:
```
# commands to build the docker
#first authenticate to gcloud

# gcloud auth login
gcloud auth configure-docker

# Build the image using Docker
docker build -f docker/Dockerfile.poetry -t {region}-docker.pkg.dev/{gcp-project-id}/{gcp-artifact-repo}/{image-name}:latest .
```
The above is saying is:
- docker: hey Docker!
- build: build an image for me
- -f: use the following file
- -t: tag (or name) it the following
- . : use files in this directory (current directory in this case) if needed
Then the built image can be pushed to the artifact registry as follows:
```
# Push to artifact registry
docker push {region}-docker.pkg.dev/{gcp-project-id}/{gcp-artifact-repo}/{image-name}:latest
```
Vision

There are numerous extensions to be added to this project and I will invite willing contributors to actively pick on any of them. Some of my thoughts are detailed below but feel free to suggest any other improvements. Contributions are welcomed via PR. I hope the repo can be actively developed by those who wants to learn end to end MLOps as well as serve as a base on which small teams can build upon.
- Monitoring pipeline: Observability is integral to MLOps platform. It enables team to proactively monitors the state and behaviour of their platform and take appropriate action in the event of an anomaly. The mlops-platform is missing a monitoring pipeline and it would be a good addition. I plan to write on custom implementation of monitoring pipeline but in the mean time, Vertex AI has monitoring pipeline that can be integrated.
- Inference pipeline: Vertex AI has batch prediction method that could be integrated. An argument can be put forward on whether the current custom batch prediction in the mlops platform would scale. The main issue is that the prediction features are loaded into the predicting environment which might run into memory issue with very large dataset. I havent experienced this issue previously but it can be envisaged. Prior to Google rebranding aiplatform to Vertex AI, I have always deployed models to the aiplatform to benefit from its model versioning but would run the batch prediction pipeline within Composer. I prefer this approach as it gives flexibility in terms of pre and post processing. Moreover, Google batch prediction method is fiddly and tricky to debug when things go wrong. Nevertheless, I think it will improve with time so would be a good addition to the platform.
- Refactoring: While I have coupled together computing and logic code in the implementation on same file, I think it would be cleaner if they are separated. Decoupling both would improve the modularity of the code and enable reusability. In addition, there should be a pipeline directory for the different pipeline files with potential integration of monitoring pipeline.
- Full customisation: Containers should be fully customised in order to have fine-grained control and flexibility. This means having both training and prediction containers custom built.
- Testing: I have integrated a testing framework which runs successfully within the platform but it is not a functional test logic. It does provide a framework to build proper tests covering data quality, components and pipelines functional tests.
- Containerisation integration: The creation of the container base image is done manually at the moment but should be integrated in both the makefile and GitHub action workflow.
- Documentation: The documentation would need updating to reflect additional features being added and ensure people with different skill sets can easily navigate through the platform. Please update the READ.me file for now but this project should use Sphinx in the long run.
- Pre-commit hooks: This is an important automation tool that can be employed to good use. Pre-commit hooks are configuration scripts executed prior to actioning a commit to help enforce styles and policy. For example, the hooks in the platform enforced linting and prevent committing large files as well as committing to the main branch. However, my main thought was to use it for dynamically updating GitHub secrets from the values in .env file. The GitHub secrets are statically typed in the current implementation so when certain variables change, they don’t get automatically propagated to GitHub secrets. Similar thing would occur when new variables are added which then needs to be manually propagated to GitHub. Pre-commit can be used to address this problem by instructing it to automatically propagate changes in the local .envfile to GitHub secrets.
- Infrastructure provisioning: Artifact registry, GCP bucket, BigQuery table and service account are all provisioned manually but their creation should be automated via Terraform.
- Scheduler: If this is a batch prediction or continuous training pipeline, we would want to schedule it to run at some specified time and frequency. Vertex AI gives a number of options to configure schedules. Indeed, an orchestration platform would not be complete without this feature.
- Additional models: There are two models (Random forest and Decision trees) within the platfrom now but should be straightforward adding other frameworks, such as xgboost and light GBM, for modelling tabular data.
- Security: The GitHub action uses service account for authentication to GCP services but should ideally be using workflow identity federation.
- Distribution: The platform is suitable in the current state for educational purpose and perhaps individual projects. However, it would require adaptation for bigger team. Think about individuals that make up teams with different skill set and varying challenges. In this regard, the platform interface can be improved using click as detailed in this post. Afterwards, it can be packaged and distributed to ensure easy installation. Also, distribution enables us to make changes to the package and centralise its updates so that it propagates as needed. Poetry can be used for the packaging and distribution so using it for dependency management has laid a good foundation.
Summary

The MLOps platform provides a modular and scalable pipeline architecture for implementing different ML lifecycle stages. It includes various operations that enable such platform to work seamlessly. Most importantly, it provides a learning opportunity for would be contributors and should serve as a good base on which teams can build upon in their machine learning tasks.

Conclusion

Well, that is it people! Congratulations and well done if you are able to make it here. I hope you have benefited from this post. Your comments and feedback are most welcome and please lets connect on Linkedln. If you found this to be valuable, then don’t forget to like the post and give the MLOps platform repository a star.

References

MLOps repo: https://github.com/kbakande/mlops-platform

https://medium.com/google-cloud/machine-learning-pipeline-development-on-google-cloud-5cba36819058

https://medium.com/@piyushpandey282/model-serving-at-scale-with-vertex-ai-custom-container-deployment-with-pre-and-post-processing-12ac62f4ce76

https://medium.com/mlearning-ai/serverless-prediction-at-scale-part-2-custom-container-deployment-on-vertex-ai-103a43d0a290

https://datatonic.com/insights/vertex-ai-improving-debugging-batch-prediction/

https://econ-project-templates.readthedocs.io/en/v0.5.2/pre-commit.html

Extensible and Customisable Vertex AI MLOps Platform was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Extensible and Customisable Vertex AI MLOps Platform

Go Here to Read this Fast! Extensible and Customisable Vertex AI MLOps Platform
March 5, 2024
TARNet and Dragonnet: Causal Inference Between S- And T-Learners

Dr. Robert Kübler

Learn how to build neural networks for direct causal inference

Continue reading on Towards Data Science »

Originally appeared here:
TARNet and Dragonnet: Causal Inference Between S- And T-Learners

Go Here to Read this Fast! TARNet and Dragonnet: Causal Inference Between S- And T-Learners

March 5, 2024
How to Use SQLAlchemy to Make Database Requests Asynchronously

Lynn G. Kwong

Learn to use SQLAlchemy asynchronously in different scenarios

Continue reading on Towards Data Science »

Originally appeared here:
How to Use SQLAlchemy to Make Database Requests Asynchronously

Go Here to Read this Fast! How to Use SQLAlchemy to Make Database Requests Asynchronously

March 5, 2024
How to Create an RFM Model in BigQuery

Tom Ellyatt

Learn what an RFM model is, how to create one, and how to segment on the results

Continue reading on Towards Data Science »

Originally appeared here:
How to Create an RFM Model in BigQuery

Go Here to Read this Fast! How to Create an RFM Model in BigQuery

March 5, 2024
Understanding Concept Drift: A Simple Guide

Vitor Cerqueira

How changes in the distribution arise, and the impact of verification delay.

Continue reading on Towards Data Science »

Originally appeared here:
Understanding Concept Drift: A Simple Guide

Go Here to Read this Fast! Understanding Concept Drift: A Simple Guide

March 5, 2024
Generate Music Recommendations Utilizing LangChain Agents

Ram Vegiraju

Powered by Bedrock Claude and the Spotify API

Continue reading on Towards Data Science »

Originally appeared here:
Generate Music Recommendations Utilizing LangChain Agents

Go Here to Read this Fast! Generate Music Recommendations Utilizing LangChain Agents

March 5, 2024
Statistical Method scDEED Detects Dubious t-SNE and UMAP Embeddings and Optimizes Hyperparameters

Jingyi Jessica Li

scDEED assigns a reliability score to each 2D embedding to indicate how much the data point’s mid-range neighbors change in the 2D space

The article was co-authored with Christy Lee, Ph.D. student of Statistics at UCLA.

Photo by Viktor Forgacs on Unsplash

t-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are non-linear dimensionality reduction techniques for visualizing high-dimensional data, particularly in the context of single-cell analysis for visualizing cell clusters. However, it is important to note that t-SNE and UMAP may not always produce trustworthy representations of the relative distances between cell clusters.

In our Nature Communications paper [1], we provide a framework for (1) identifying data distortions in projection from a high-dimensional to two-dimensional (2D) space and (2) optimizing hyperparameter settings in a 2D dimension-reduction method.

Xia, L., Lee, C. & Li, J.J. Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. Nat Commun 15, 1753 (2024). https://doi.org/10.1038/s41467-024-45891-y

Consider a 3-dimensional (3D) globe vs a 2-dimensional (2D) map. It is impossible to represent an entire globe accurately in only 2D; distance may not be accurate, and the size of some countries may be distorted. Typically, land masses at the edge of the map, like Antarctica, are the most changed. Despite these distortions, 2D maps are useful for everyday use; students or the common traveler to the main continents will not be affected by the distortion in Antarctica, but an intrepid traveler to the poles will certainly require a different map.

Similarly, the representation of single-cell genomics data often requires moving from a high-dimensional to 2D space, so-called 2D embedding. As with the conversion of the globe, this can induce distortions. The 2D post-embedding space may not accurately represent the pre-embedding space. Adding to the problem, popular 2D embedding methods, like t-SNE and UMAP, are sensitive to hyperparameter selection. While general guidelines exist to tailor hyperparameters like perplexity and n.neighbors to the size of the dataset, these guidelines do not help answer the underlying question– what parts of the visualization are misleading?

Similar to cartographers selecting which landmasses to recreate faithfully and which to distort, researchers must prioritize which aspects of the pre-embedding space are most important to preserve post-embedding. Common uses of 2D visualization include annotation and analysis of cell trajectories and clusters. Although cell trajectories and clusters are generally calculated in the high-dimensional space, their results are often visualized through 2D embeddings, in which cells with similar gene expression are expected to be close to each other. Therefore, we concluded that the most important aspect of preservation is the position of cells relative to each other.

These ideas formed the motivation for scDEED, a single-cell dubious embedding detector (Fig. 1). The key idea is that a cell’s pre- and post-embedding neighbors should be similar. It is worth noting that the pre-embedding space is typically 20- to 50-dimensional in single-cell data analysis, usually the principal component space. For each cell, we calculate a reliability score that reflects the visual agreement between the neighbors found in the 2D-embedding space and the pre-embedding space. Cells whose 2D embedding neighbors have been drastically changed through the embedding process are called ‘dubious’; the cell’s relative location is misleading and does not reflect where the cell should be based on the pre-embedding space. Identification of these cells provides a mechanism to optimize hyperparameters by selecting the settings that result in the least amount of dubious cell embeddings.

Fig. 1. Illustration of the two functionalities of scDEED. Functionality I decides whether each cell has a trustworthy or dubious embedding by calculating a reliability score, which is defined as the Pearson correlation between the cell’s distances to its closest 50% neighboring cells in the 2D-embedding space and the same cell’s distances to its closest 50% neighboring cells in the pre-embedding space. Enabled by functionality I, functionality II optimizes the hyperparameter setting of an embedding method (e.g. t-SNE and UMAP) by minimizing the number of dubious embeddings. The image was created by the authors.

In our paper [1], we use a variety of datasets to show how the identification of dubious cells and optimization of hyperparameters can aid analysis. For example, in the original visualization of the single-cell RNA-seq Hydra dataset [2], the neuron ectodermal 1 (neuron ec1) cells are split into two clusters, one that scDEED marked as dubious and the other trustworthy (Fig. 2a). As confirmed by the similarity in gene expression (Fig. 2c) and the singular cell type assigned by the authors, these two clusters are not biologically distinct, making their separation in the t-SNE misleading. Further, if we compare the neuron ec1 cells to its neighboring clusters, like the highlighted ectodermal epithelial cells (ecEP_sc), the gene expression is very different, which is counterintuitive given their proximity in the visualization. However, under the optimized perplexity found by scDEED (Fig. 2b), the neuron ec1 cell type is now unified, further supporting that the original split of the cell type into clusters was a result of hyperparameter settings. Additionally, the neuron ec1 and ecEP_sc cells are now far apart, which is more appropriate given their differences in gene expression. This highlights two uses for scDEED: identification of dubious cells can help discern cells whose embedding positions are misleading, and optimization of hyperparameters can result in a more trustworthy visualization.

Fig. 2. Evaluation of t-SNE embeddings optimized by scDEED on the Hydra dataset. a–b, Comparative t-SNE plots with the ecEP_sc (ectodermal epithelial_single cell), trustworthy cell embeddings in neuron ec1, and dubious cell embeddings in neuron ec1 highlighted, at the original perplexity 40 (a) and the perplexity 230 optimized by scDEED (b). C, Gene expression heatmap of the highlighted cells in a and b, where the cells are ordered by the default hierarchical clustering found by the R function heatmap.2(). The image was created by the authors and is in the article.

An interesting application is RNA velocity [3], a downstream analysis task that relies on visualization. RNA velocity uses the amount of unspliced and spliced mRNA to estimate gene velocity– the change in gene expression. The estimated gene velocity can be used to calculate predicted gene expression for a future time point, which can be visualized with an arrow from the cell to the cell’s predicted state. For large datasets, it is not reasonable to plot each cell’s velocity vector; rather, cells are grouped based on their 2D embeddings, and their velocity vectors are aggregated. Changes to the 2D embedding will not affect the estimated gene velocities or predicted expression for the individual cells, but it will change the cell grouping for vector field calculations, and therefore affect the visualized RNA velocities and analysis. Using scDEED to optimize the hyperparameter perplexity of t-SNE (Fig. 3a) greatly enhanced the agreement among neighboring cells, and provided clearer RNA velocity results than using the default hyperparameter value (Fig. 3b). Additionally, the vectors are not exaggerated for the mature granules, an expected result because the cells are fully differentiated. Optimization of the hyperparameter enhanced only existing cell trajectories.

Fig. 3. Velocity analysis of the dentate gyrus dataset [4]. a-b Velocity visualization using the embeddings at the original t-SNE perplexity of 30 (a) and the perplexity 450 optimized by scDEED (b) with the default Velocyto [3] settings. Abbreviations for cell types are as follows: Neuro1: Neuroblast1; Neuro2: Neuroblast 2; nIPC: neuronal intermediate progenitor cells.

Recent work [5,6] has highlighted geometric qualities, like geodesics, manifolds, and distance, that cannot be fully recreated because the pre- and post-embedding spaces are not homeomorphic. scDEED can help reduce the inconsistencies by finding hyperparameter settings that accurately capture mid-range cell-cell relationships for the most number of possible cells and identifying cells whose mid-range neighbors have drastically changed. We hope that scDEED can be used as an add-on to existing analysis pipelines to provide a more trustworthy 2D visualization. It is worth pointing out that scDEED does not measure the preservation of all aspects of data; as cartographers deemed it most important to preserve the 5 main continents, we chose to prioritize the relative location of cells. With some adjustments to the definition of the reliability score (one per cell embedding), researchers interested in preserving other qualities of the pre-embedding space may still find the framework of scDEED useful.

References

1. Xia L, Lee C, Li JJ. Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. Nat Commun. 2024;15: 1753. doi:10.1038/s41467-024-45891-y

2. Siebert S, Farrell JA, Cazet JF, Abeykoon Y, Primack AS, Schnitzler CE, et al. Stem cell differentiation trajectories in Hydra resolved at single-cell resolution. Science. 2019;365. doi:10.1126/science.aav9314

3. La Manno G, Soldatov R, Zeisel A, Braun E, Hochgerner H, Petukhov V, et al. RNA velocity of single cells. Nature. 2018;560: 494–498.

4. Hochgerner H, Zeisel A, Lönnerberg P, Linnarsson S. Conserved properties of dentate gyrus neurogenesis across postnatal development revealed by single-cell RNA sequencing. Nat Neurosci. 2018;21: 290–299.

5. Wang S, Sontag ED, Lauffenburger DA. What cannot be seen correctly in 2D visualizations of single-cell ’omics data? Cell Syst. 2023;14: 723–731.

6. Chari T, Pachter L. The specious art of single-cell genomics. PLoS Comput Biol. 2023;19: e1011288.

Statistical Method scDEED Detects Dubious t-SNE and UMAP Embeddings and Optimizes Hyperparameters was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Originally appeared here:
Statistical Method scDEED Detects Dubious t-SNE and UMAP Embeddings and Optimizes Hyperparameters

Go Here to Read this Fast! Statistical Method scDEED Detects Dubious t-SNE and UMAP Embeddings and Optimizes Hyperparameters

March 5, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Category: AI

Understanding the mathematical background of graph neural networks and implementation for a regression problem in pytorch

Introduction

Mathematical Description of GNNs

GCN Regression Example

MLOps Platform

Building scalable Kubeflow ML pipelines on Vertex AI and ‘jailbreaking’ Google prebuilt containers

Documentation

Challenges and Solutions: `Jailbreaking’ Google Vertex AI prebuilt containers

Build Base Image

Vision

Summary

Conclusion

scDEED assigns a reliability score to each 2D embedding to indicate how much the data point’s mid-range neighbors change in the 2D space