
Originally appeared here:
Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker
Go Here to Read this Fast! Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker


Originally appeared here:
Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker
Go Here to Read this Fast! Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker



As tools for Python type annotations (or hints) have evolved, more complex data structures can be typed, improving maintainability and static analysis. Arrays and DataFrames, as complex containers, have only recently supported complete type annotations in Python. NumPy 1.22 introduced generic specification of arrays and dtypes. Building on NumPy’s foundation, StaticFrame 2.0 introduced complete type specification of DataFrames, employing NumPy primitives and variadic generics. This article demonstrates practical approaches to fully type-hinting arrays and DataFrames, and shows how the same annotations can improve code quality with both static analysis and runtime validation.
StaticFrame is an open-source DataFrame library of which I am an author.
Type hints (see PEP 484) improve code quality in a number of ways. Instead of using variable names or comments to communicate types, Python-object-based type annotations provide maintainable and expressive tools for type specification. These type annotations can be tested with type checkers such as mypy or pyright, quickly discovering potential bugs without executing code.
The same annotations can be used for runtime validation. While reliance on duck-typing over runtime validation is common in Python, runtime validation is more often needed with complex data structures such as arrays and DataFrames. For example, an interface expecting a DataFrame argument, if given a Series, might not need explicit validation as usage of the wrong type will likely raise. However, an interface expecting a 2D array of floats, if given an array of Booleans, might benefit from validation as usage of the wrong type may not raise.
Many important typing utilities are only available with the most-recent versions of Python. Fortunately, the typing-extensions package back-ports standard library utilities for older versions of Python. A related challenge is that type checkers can take time to implement full support for new features: many of the examples shown here require at least mypy 1.9.0.
Without type annotations, a Python function signature gives no indication of the expected types. For example, the function below might take and return any types:
def process0(v, q): ... # no type information
By adding type annotations, the signature informs readers of the expected types. With modern Python, user-defined and built-in classes can be used to specify types, with additional resources (such as Any, Iterator, cast(), and Annotated) found in the standard library typing module. For example, the interface below improves the one above by making expected types explicit:
def process0(v: int, q: bool) -> list[float]: ...
When used with a type checker like mypy, code that violates the specifications of the type annotations will raise an error during static analysis (shown as comments, below). For example, providing an integer when a Boolean is required is an error:
x = process0(v=5, q=20)
# tp.py: error: Argument "q" to "process0"
# has incompatible type "int"; expected "bool" [arg-type]
Static analysis can only validate statically defined types. The full range of runtime inputs and outputs is often more diverse, suggesting some form of runtime validation. The best of both worlds is possible by reusing type annotations for runtime validation. While there are libraries that do this (e.g., typeguard and beartype), StaticFrame offers CallGuard, a tool specialized for comprehensive array and DataFrame type-annotation validation.
A Python decorator is ideal for leveraging annotations for runtime validation. CallGuard offers two decorators: @CallGuard.check, which raises an informative Exception on error, or @CallGuard.warn, which issues a warning.
Further extending the process0 function above with @CallGuard.check, the same type annotations can be used to raise an Exception (shown again as comments) when runtime objects violate the requirements of the type annotations:
import static_frame as sf
@sf.CallGuard.check
def process0(v: int, q: bool) -> list[float]:
return [x * (0.5 if q else 0.25) for x in range(v)]
z = process0(v=5, q=20)
# static_frame.core.type_clinic.ClinicError:
# In args of (v: int, q: bool) -> list[float]
# └── Expected bool, provided int invalid
While type annotations must be valid Python, they are irrelevant at runtime and can be wrong: it is possible to have correctly verified types that do not reflect runtime reality. As shown above, reusing type annotations for runtime checks ensures annotations are valid.
Python classes that permit component type specification are “generic”. Component types are specified with positional “type variables”. A list of integers, for example, is annotated with list[int]; a dictionary of floats keyed by tuples of integers and strings is annotated dict[tuple[int, str], float].
With NumPy 1.20, ndarray and dtype become generic. The generic ndarray requires two arguments, a shape and a dtype. As the usage of the first argument is still under development, Any is commonly used. The second argument, dtype, is itself a generic that requires a type variable for a NumPy type such as np.int64. NumPy also offers more general generic types such as np.integer[Any].
For example, an array of Booleans is annotated np.ndarray[Any, np.dtype[np.bool_]]; an array of any type of integer is annotated np.ndarray[Any, np.dtype[np.integer[Any]]].
As generic annotations with component type specifications can become verbose, it is practical to store them as type aliases (here prefixed with “T”). The following function specifies such aliases and then uses them in a function.
from typing import Any
import numpy as np
TNDArrayInt8 = np.ndarray[Any, np.dtype[np.int8]]
TNDArrayBool = np.ndarray[Any, np.dtype[np.bool_]]
TNDArrayFloat64 = np.ndarray[Any, np.dtype[np.float64]]
def process1(
v: TNDArrayInt8,
q: TNDArrayBool,
) -> TNDArrayFloat64:
s: TNDArrayFloat64 = np.where(q, 0.5, 0.25)
return v * s
As before, when used with mypy, code that violates the type annotations will raise an error during static analysis. For example, providing an integer when a Boolean is required is an error:
v1: TNDArrayInt8 = np.arange(20, dtype=np.int8)
x = process1(v1, v1)
# tp.py: error: Argument 2 to "process1" has incompatible type
# "ndarray[Any, dtype[floating[_64Bit]]]"; expected "ndarray[Any, dtype[bool_]]" [arg-type]
The interface requires 8-bit signed integers (np.int8); attempting to use a different sized integer is also an error:
TNDArrayInt64 = np.ndarray[Any, np.dtype[np.int64]]
v2: TNDArrayInt64 = np.arange(20, dtype=np.int64)
q: TNDArrayBool = np.arange(20) % 3 == 0
x = process1(v2, q)
# tp.py: error: Argument 1 to "process1" has incompatible type
# "ndarray[Any, dtype[signedinteger[_64Bit]]]"; expected "ndarray[Any, dtype[signedinteger[_8Bit]]]" [arg-type]
While some interfaces might benefit from such narrow numeric type specifications, broader specification is possible with NumPy’s generic types such as np.integer[Any], np.signedinteger[Any], np.float[Any], etc. For example, we can define a new function that accepts any size signed integer. Static analysis now passes with both TNDArrayInt8 and TNDArrayInt64 arrays.
TNDArrayIntAny = np.ndarray[Any, np.dtype[np.signedinteger[Any]]]
def process2(
v: TNDArrayIntAny, # a more flexible interface
q: TNDArrayBool,
) -> TNDArrayFloat64:
s: TNDArrayFloat64 = np.where(q, 0.5, 0.25)
return v * s
x = process2(v1, q) # no mypy error
x = process2(v2, q) # no mypy error
Just as shown above with elements, generically specified NumPy arrays can be validated at runtime if decorated with CallGuard.check:
@sf.CallGuard.check
def process3(v: TNDArrayIntAny, q: TNDArrayBool) -> TNDArrayFloat64:
s: TNDArrayFloat64 = np.where(q, 0.5, 0.25)
return v * s
x = process3(v1, q) # no error, same as mypy
x = process3(v2, q) # no error, same as mypy
v3: TNDArrayFloat64 = np.arange(20, dtype=np.float64) * 0.5
x = process3(v3, q) # error
# static_frame.core.type_clinic.ClinicError:
# In args of (v: ndarray[Any, dtype[signedinteger[Any]]],
# q: ndarray[Any, dtype[bool_]]) -> ndarray[Any, dtype[float64]]
# └── ndarray[Any, dtype[signedinteger[Any]]]
# └── dtype[signedinteger[Any]]
# └── Expected signedinteger, provided float64 invalid
StaticFrame provides utilities to extend runtime validation beyond type checking. Using the typing module’s Annotated class (see PEP 593), we can extend the type specification with one or more StaticFrame Require objects. For example, to validate that an array has a 1D shape of (24,), we can replace TNDArrayIntAny with Annotated[TNDArrayIntAny, sf.Require.Shape(24)]. To validate that a float array has no NaNs, we can replace TNDArrayFloat64 with Annotated[TNDArrayFloat64, sf.Require.Apply(lambda a: ~a.insna().any())].
Implementing a new function, we can require that all input and output arrays have the shape (24,). Calling this function with the previously created arrays raises an error:
from typing import Annotated
@sf.CallGuard.check
def process4(
v: Annotated[TNDArrayIntAny, sf.Require.Shape(24)],
q: Annotated[TNDArrayBool, sf.Require.Shape(24)],
) -> Annotated[TNDArrayFloat64, sf.Require.Shape(24)]:
s: TNDArrayFloat64 = np.where(q, 0.5, 0.25)
return v * s
x = process4(v1, q) # types pass, but Require.Shape fails
# static_frame.core.type_clinic.ClinicError:
# In args of (v: Annotated[ndarray[Any, dtype[int8]], Shape((24,))], q: Annotated[ndarray[Any, dtype[bool_]], Shape((24,))]) -> Annotated[ndarray[Any, dtype[float64]], Shape((24,))]
# └── Annotated[ndarray[Any, dtype[int8]], Shape((24,))]
# └── Shape((24,))
# └── Expected shape ((24,)), provided shape (20,)
Just like a dictionary, a DataFrame is a complex data structure composed of many component types: the index labels, column labels, and the column values are all distinct types.
A challenge of generically specifying a DataFrame is that a DataFrame has a variable number of columns, where each column might be a different type. The Python TypeVarTuple variadic generic specifier (see PEP 646), first released in Python 3.11, permits defining a variable number of column type variables.
With StaticFrame 2.0, Frame, Series, Index and related containers become generic. Support for variable column type definitions is provided by TypeVarTuple, back-ported with the implementation in typing-extensions for compatibility down to Python 3.9.
A generic Frame requires two or more type variables: the type of the index, the type of the columns, and zero or more specifications of columnar value types specified with NumPy types. A generic Series requires two type variables: the type of the index and a NumPy type for the values. The Index is itself generic, also requiring a NumPy type as a type variable.
With generic specification, a Series of floats, indexed by dates, can be annotated with sf.Series[sf.IndexDate, np.float64]. A Frame with dates as index labels, strings as column labels, and column values of integers and floats can be annotated with sf.Frame[sf.IndexDate, sf.Index[np.str_], np.int64, np.float64].
Given a complex Frame, deriving the annotation might be difficult. StaticFrame offers the via_type_clinic interface to provide a complete generic specification for any component at runtime:
>>> v4 = sf.Frame.from_fields([range(5), np.arange(3, 8) * 0.5],
columns=('a', 'b'), index=sf.IndexDate.from_date_range('2021-12-30', '2022-01-03'))
>>> v4
<Frame>
<Index> a b <<U1>
<IndexDate>
2021-12-30 0 1.5
2021-12-31 1 2.0
2022-01-01 2 2.5
2022-01-02 3 3.0
2022-01-03 4 3.5
<datetime64[D]> <int64> <float64>
# get a string representation of the annotation
>>> v4.via_type_clinic
Frame[IndexDate, Index[str_], int64, float64]
As shown with arrays, storing annotations as type aliases permits reuse and more concise function signatures. Below, a new function is defined with generic Frame and Series arguments fully annotated. A cast is required as not all operations can statically resolve their return type.
TFrameDateInts = sf.Frame[sf.IndexDate, sf.Index[np.str_], np.int64, np.int64]
TSeriesYMBool = sf.Series[sf.IndexYearMonth, np.bool_]
TSeriesDFloat = sf.Series[sf.IndexDate, np.float64]
def process5(v: TFrameDateInts, q: TSeriesYMBool) -> TSeriesDFloat:
t = v.index.iter_label().apply(lambda l: q[l.astype('datetime64[M]')]) # type: ignore
s = np.where(t, 0.5, 0.25)
return cast(TSeriesDFloat, (v.via_T * s).mean(axis=1))
These more complex annotated interfaces can also be validated with mypy. Below, a Frame without the expected column value types is passed, causing mypy to error (shown as comments, below).
TFrameDateIntFloat = sf.Frame[sf.IndexDate, sf.Index[np.str_], np.int64, np.float64]
v5: TFrameDateIntFloat = sf.Frame.from_fields([range(5), np.arange(3, 8) * 0.5],
columns=('a', 'b'), index=sf.IndexDate.from_date_range('2021-12-30', '2022-01-03'))
q: TSeriesYMBool = sf.Series([True, False],
index=sf.IndexYearMonth.from_date_range('2021-12', '2022-01'))
x = process5(v5, q)
# tp.py: error: Argument 1 to "process5" has incompatible type
# "Frame[IndexDate, Index[str_], signedinteger[_64Bit], floating[_64Bit]]"; expected
# "Frame[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]]" [arg-type]
To use the same type hints for runtime validation, the sf.CallGuard.check decorator can be applied. Below, a Frame of three integer columns is provided where a Frame of two columns is expected.
# a Frame of three columns of integers
TFrameDateIntIntInt = sf.Frame[sf.IndexDate, sf.Index[np.str_], np.int64, np.int64, np.int64]
v6: TFrameDateIntIntInt = sf.Frame.from_fields([range(5), range(3, 8), range(1, 6)],
columns=('a', 'b', 'c'), index=sf.IndexDate.from_date_range('2021-12-30', '2022-01-03'))
x = process5(v6, q)
# static_frame.core.type_clinic.ClinicError:
# In args of (v: Frame[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]],
# q: Series[IndexYearMonth, bool_]) -> Series[IndexDate, float64]
# └── Frame[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]]
# └── Expected Frame has 2 dtype, provided Frame has 3 dtype
It might not be practical to annotate every column of every Frame: it is common for interfaces to work with Frame of variable column sizes. TypeVarTuple supports this through the usage of *tuple[] expressions (introduced in Python 3.11, back-ported with the Unpack annotation). For example, the function above could be defined to take any number of integer columns with that annotation Frame[IndexDate, Index[np.str_], *tuple[np.int64, …]], where *tuple[np.int64, …]] means zero or more integer columns.
The same implementation can be annotated with a far more general specification of columnar types. Below, the column values are annotated with np.number[Any] (permitting any type of numeric NumPy type) and a *tuple[] expression (permitting any number of columns): *tuple[np.number[Any], …]. Now neither mypy nor CallGuard errors with either previously created Frame.
TFrameDateNums = sf.Frame[sf.IndexDate, sf.Index[np.str_], *tuple[np.number[Any], ...]]
@sf.CallGuard.check
def process6(v: TFrameDateNums, q: TSeriesYMBool) -> TSeriesDFloat:
t = v.index.iter_label().apply(lambda l: q[l.astype('datetime64[M]')]) # type: ignore
s = np.where(t, 0.5, 0.25)
return tp.cast(TSeriesDFloat, (v.via_T * s).mean(axis=1))
x = process6(v5, q) # a Frame with integer, float columns passes
x = process6(v6, q) # a Frame with three integer columns passes
As with NumPy arrays, Frame annotations can wrap Require specifications in Annotated generics, permitting the definition of additional run-time validations.
While StaticFrame might be the first DataFrame library to offer complete generic specification and a unified solution for both static type analysis and run-time type validation, other array and DataFrame libraries offer related utilities.
Neither the Tensor class in PyTorch (2.4.0), nor the Tensor class in TensorFlow (2.17.0) support generic type or shape specification. While both libraries offer a TensorSpec object that can be used to perform run-time type and shape validation, static type checking with tools like mypy is not supported.
As of Pandas 2.2.2, neither the Pandas Series nor DataFrame support generic type specifications. A number of third-party packages have offered partial solutions. The pandas-stubs library, for example, provides type annotations for the Pandas API, but does not make the Series or DataFrame classes generic. The Pandera library permits defining DataFrameSchema classes that can be used for run-time validation of Pandas DataFrames. For static-analysis with mypy, Pandera offers alternative DataFrame and Series subclasses that permit generic specification with the same DataFrameSchema classes. This approach does not permit the expressive opportunities of using generic NumPy types or the unpack operator for supplying variadic generic expressions.
Python type annotations can make static analysis of types a valuable check of code quality, discovering errors before code is even executed. Up until recently, an interface might take an array or a DataFrame, but no specification of the types contained in those containers was possible. Now, complete specification of component types is possible in NumPy and StaticFrame, permitting more powerful static analysis of types.
Providing correct type annotations is an investment. Reusing those annotations for runtime checks provides the best of both worlds. StaticFrame’s CallGuard runtime type checker is specialized to correctly evaluate fully specified generic NumPy types, as well as all generic StaticFrame containers.
Improving Code Quality with Array and DataFrame Type Hints was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Improving Code Quality with Array and DataFrame Type Hints
Go Here to Read this Fast! Improving Code Quality with Array and DataFrame Type Hints


In this article I describe a distance metric called Shared Nearest Neighbors (SNN) and describe its application to outlier detection. I’ll also cover quickly its application to prediction and clustering, but will focus on outlier detection, and specifically on SNN’s application to the k Nearest Neighbors outlier detection algorithm (though I will also cover SNN’s application to outlier detection more generally).
This article continues a series on outlier detection, including articles on Frequent Patterns Outlier Factor, Counts Outlier Detector, Doping, and Distance Metric Learning. It also includes another excerpt from my book Outlier Detection in Python.
In data science, when working with tabular data, it’s a very common task to measure the distances between rows. This is done, for example, in some predictive models such as KNN: when predicting the target value of an instance using KNN, we first identify the most similar records from the training data (which requires having a way to measure the similarity between rows). We then look at the target values of these similar rows, with the idea that the test record is most likely to have the same target value as the majority of the most similar records (for classification), or the average target value of the most similar records (for regression).
A few other predictive models use distance metrics as well, for example Radius-based methods such as RadiusNeighborsClassifier. But, where distance metrics are used by far the most often is with clustering. In fact, distance calculations are virtually universal in clustering: to my knowledge, all clustering algorithms rely in some way on calculating the distances between pairs of records.
And distance calculations are used by many outlier detection algorithms, including many of the most popular (such as kth Nearest Neighbors, Local Outlier Factor (LOF), Radius, Local Outlier Probabilities (LoOP), and numerous others). This is not true of all outlier detection algorithms: many identify outliers in quite different ways (for example Isolation Forest, Frequent Patterns Outlier Factor, Counts Outlier Detector, ECOD, HBOS), but many detectors do utilize distance calculations between rows in one way or another.
Clustering and outlier detection algorithms (that work with distances) typically start with calculating the pairwise distances, the distances between every pair of rows in the data. At least this is true in principle: to execute more efficiently, distance calculations between some pairs of rows may be skipped or approximated, but theoretically, we very often start by calculating an n x n matrix of distances between rows, where n is the number of rows in the data.
This, then, requires having a way to measure the distances between any two records. But, as covered in a related article on Distance Metric Learning (DML), it can be difficult to determine a good means to identify how similar, or dissimilar, two rows are.
The most common method, at least with numeric data, is the Euclidean distance. This can work quite well, and has strong intuitive appeal, particularly when viewing the data geometrically: that is, as points in space, as may be seen in a scatter plot such as is shown below. In two dimensional plots, where each record in the data is represented as a dot, it’s natural to view the similarity of records in terms of their Euclidean distances.

However, real world tabular data often has very many features and one of the key difficulties when dealing with this is the curse of dimensionality. This manifests in a number of ways, but one of the most problematic is that, with enough dimensions, the distances between records start to become meaningless.
In the plots shown here, we have a point (shown in red) that is unusual in dimension 0 (shown on the x-axis of the left pane), but normal in dimensions 1, 2, and 3. Assuming this dataset has only these four dimensions, calculating the Euclidean distances between each pair of records, we’d see the red point as having an unusually large distance from all other points. And so, it could reliably be flagged as an outlier.
However, if there were hundreds of dimensions, and the red point is fairly typical in all dimensions besides dimension 0, it could not reliably be flagged as an outlier: the large distance to the other points in dimension 0 would be averaged in with the distances in all other dimensions and would eventually become irrelevant.
This is a huge issue for predictive, clustering, and outlier detection methods that rely on distance metrics.
SNN is used at times to mitigate this effect. However, I’ll show in experiments below, where SNN is most effective (at least with the kth Nearest Neighbors outlier detector I use below) is not necessarily where there are many dimensions (though this is quite relevant too), but where the density of the data varies from one region to another. I’ll explain below what this means and how it affects some outlier detectors.
SNN is used to define a distance between any two records, the same as Euclidean, Manhattan, Canberra, cosine, and any number of other distance metrics. As the name implies, the specific distances calculated have to do with the number of shared neighbors any two records have.
In this way, SNN is quite different from other distance metrics, though it is still more similar to Euclidean and other standard metrics than is Distance Metric Learning. DML seeks to find logical distances between records, unrelated to the specific magnitudes of the values in the rows.
SNN, on the other hand, actually starts by calculating the raw distances between rows using a standard distance metric. If Euclidean distances are used for this first step, the SNN distances are related to the Euclidean distances; if cosine distances are used to calculate the raw distance, the SNN distances are related to cosine distances; and so on.
However, before we get into the details, or show how this may be applied to outlier detection, we’ll take a quick look at SNN for clustering, as it’s actually with clustering research that SNN was first developed. The general process described there is what is used to calculate SNN distances in other contexts as well, including outlier detection.
The terminology can be slightly confusing, but there’s also a clustering method often referred to as SNN, which uses SNN distances and works very similarly to DBSCAN clustering. In fact, it can be considered an enhancement to DBSCAN.
The main paper describing this can be viewed here: https://www-users.cse.umn.edu/~kumar001/papers/siam_hd_snn_cluster.pdf. Though, the idea of enhancing DBSCAN to use SNN goes back to a paper written by Jarvis-Patrick in 1973. The paper linked here uses a similar, but improved approach.
DBSCAN is a strong clustering algorithm, still widely used. It’s able to handle well clusters of different sizes and shapes (even quite arbitrary shapes). It can, though, struggle where clusters have different densities (it effectively assumes all clusters have similar densities). Most clustering algorithms have some such limitations. K-means clustering, for example, effectively assumes all clusters are similar sizes, and Gaussian Mixture Models clustering, that all clusters have roughly Gaussian shapes.
I won’t describe the full DBSCAN algorithm here, but as a very quick sketch: it works by identifying what it calls core points, which are points in dense regions, that can safely be considered inliers. It then identifies the points that are close to these, creating clusters around each of the core points. It runs over a series of steps, each time expanding and merging the clusters discovered so far (merging clusters where they overlap). Points that are close to existing clusters (even if they are not close to the original core points, just to points that have been added to a cluster) are added to that cluster. Eventually every point is either in a single cluster, or is left unassigned to any cluster.
As with outlier detection, clustering can also struggle with high dimensional datasets, again, due to the curse of dimensionality, and particularly the break-down in standard distance metrics. At each step, DBSCAN works based on the distances between the points that are not yet in clusters and those in clusters, and where these distance calculations are unreliable, the clustering is, in turn, unreliable. With high dimensions, core points can be indistinguishable from any other points, even the noise points that really aren’t part of any cluster.
As indicated, DBSCAN also struggles where different regions of the data have different densities. The issue is that DBSCAN uses a global sense of what points are close to each other, but different regions can quite reasonably have different densities.
Take, for example, where the data represents financial transactions. This may include sales, expense, payroll, and other types of transactions, each with different densities. The transactions may be created at different rates in time, may have different dollar values, different counts, and different ranges of numeric values. For example, it may be that there are many more sales transactions than expense transactions. And the ranges in dollar values may be quite different: perhaps the largest sales are only about 10x the size of the smallest sales, but the largest expenses 1000x as large as the smallest. So, there can be quite different densities in the sales transactions compared to expenses.
Assuming different types of transactions are located in different regions of the space (if, again, viewing the data as points in high-dimensional space, with each dimension representing a feature from the data table, and each record as a point), we may have a plot such as is shown below, with sales transactions in the lower-left and expenses in the upper-right.
Many clustering algorithms (and many predictive and outlier detection algorithms) could fail to handle this data well given these differences in density. DBSCAN may leave all points in the upper-right unclustered if it goes by the overall average of distances between points (which may be dominated by the distances between sales transactions if there are many more sales transactions in the data).

The goal of SNN is to create a more reliable distance metric, given high dimensionality and varying density.
The central idea of SNN is: if point p1 is close to p2 using a standard distance metric, we can say that likely they’re actually close, but this can be unreliable. However, if p1 and p2 also have many of the same nearest neighbors, we can be significantly more confident they are truly close. Their shared neighbors can be said to confirm the similarity.
Using shared neighbors, in the graph above, points in the upper-right would be correctly recognized as being in a cluster, as they typically share many of the same nearest neighbors with each other.
Jarvis-Patrick explained this in terms of a graph, which is a useful way to look at the data. We can view each record as a point in space (as in the scatter plot above), with an edge between each pair indicating how similar they are. For this, we can simply calculate the Euclidean distances (or another such metric) between each pair of records.
As graphs are often represented as adjacency matrices (n x n matrices, where n is the number of rows, giving the distances between each pair of rows), we can view the process in terms of an adjacency matrix as well.
Considering the scatter plot above, we may have an n x n matrix such as:
Point 1 Point 2 Point 3 ... Point n
Point 1 0.0 3.3 2.9 ... 1.9
Point 2 3.3 0.0 1.8 ... 4.0
Point 3 2.9 1.8 0.0 ... 2.7
... ... ... ... ... ...
Point n 1.9 4.0 2.7 ... 0.0
The matrix is symmetric across the main diagonal (the distance from Point 1 to Point 2 is the same as from Point 2 to Point 1) and the distances of points to themselves is 0.0 (so the main diagonal is entirely zeros).
The SNN algorithm is a two-step process, and starts by calculating these raw pair-wise distances (generally using Euclidean distances). It then creates a second matrix, with the shared nearest neighbors distances.
To calculate this, it first uses a process called sparcification. For this, each pair of records, p and q, get a link (will have a non-zero distance) only if p and q are each in each other’s k nearest neighbors lists. This is straightforward to determine: for p, we have the distances to all other points. For some k (specified as a parameter, but lets assume a value of 10), we find the 10 points that are closest to p. This may or may not include q. Similarly for q: we find it’s k nearest neighbors and see if p is one of them.
We now have a matrix like above, but with many cells now containing zeros.
We then consider the shared nearest neighbors. For the specified k, p has a set of k nearest neighbors (we’ll call this set S1), and q also has a set of k nearest neighbors (we’ll call this set S2). We can then determine how similar p and q are (in the SNN sense) based on the size of the overlap in S1 and S2.
In a more complicated form, we can also consider the order of the neighbors in S1 and S2. If p and q not only have roughly the same set of nearest neighbors (for example, they are both close to p243, p873, p3321, and p773), we can be confident that p and q are close. But if, further, they are both closest to p243, then to p873, then to p3321, and then to p773 (or at least have a reasonably similar order of closeness), we can be even more confident p and q are similar. For this article, though, we will simply count the number of shared nearest neighbors p and q have (within the set of k nearest neighbors that each has).
The idea is: we do require a standard distance metric to start, but once this is created, we use the rank order of the distances between points, not the actual magnitudes, and this tends to be more stable.
For SNN clustering, we first calculate the SNN distances in this way, then proceed with the standard DBSCAN algorithm, identifying the core points, finding other points close enough to be in the same cluster, and growing and merging the clusters.
There are at least two implementations of SNN clustering available on github: https://github.com/albert-espin/snn-clustering and https://github.com/felipeangelimvieira/SharedNearestNeighbors.
Despite its origins with clustering (and its continued importance with clustering), SNN as a distance metric is, as indicated above, relevant to other areas of machine learning, including outlier detection, which we’ll return to now.
Before describing the Python implementation of the SNN distance metric, I’ll quickly present a simple implementation of a KNN outlier detector:
import pandas as pd
from sklearn.neighbors import BallTree
import statistics
class KNN:
def __init__(self, metric='euclidian'):
self.metric = metric
def fit_predict(self, data, k):
data = pd.DataFrame(data)
balltree = BallTree(data, metric=self.metric)
# Get the distances to the k nearest neighbors for each record
knn = balltree.query(data, k=k)[0]
# Get the mean distance to the k nearest neighbors for each record
scores = [statistics.mean(x[:k]) for x in knn]
return scores
Given a 2d table of data and a specified k, the fit_predict() method will provide an outlier score for each record. This score is the average distance to the k nearest neighbors. A variation on this, where the maximum distance (as opposed to the mean distance) to the k nearest neighbors is used, is sometimes called kth Nearest Neighbors, while this variation is often called k Nearest Neighbors, though the terminology varies.
The bulk of the work here is actually done by scikit-learn’s BallTree class, which calculates and stores the pairwise distances for the passed dataframe. Its query() method returns, for each element passed in the data parameter, two things:
For this detector, we need only the distances, so take element [0] of the returned structure.
fit_predict() then returns the average distance to the k closest neighbors for each record in the data frame, which is an estimation of their outlierness: the more distant a record is from its closes neighbors, the more of an outlier it can be assumed to be (though, as indicated, this works poorly where different regions have different densities, which is to say, different average distances to their neighbors).
This would not be a production-ready implementation, but does provide the basic idea. A full implementation of KNN outlier detection is provided in PyOD.
Using SNN distance metrics, an implementation of a simple outlier detector is:
class SNN:
def __init__(self, metric='euclidian'):
self.metric = metric
def get_pairwise_distances(self, data, k):
data = pd.DataFrame(data)
balltree = BallTree(data, metric=self.metric)
knn = balltree.query(data, k=k+1)[1]
pairwise_distances = np.zeros((len(data), len(data)))
for i in range(len(data)):
for j in range(i+1, len(data)):
if (j in knn[i]) and (i in knn[j]):
weight = len(set(knn[i]).intersection(set(knn[j])))
pairwise_distances[i][j] = weight
pairwise_distances[j][i] = weight
return pairwise_distances
def fit_predict(self, data, k):
data = pd.DataFrame(data)
pairwise_distances = self.get_pairwise_distances(data, k)
scores = [statistics.mean(sorted(x, reverse=True)[:k]) for x in pairwise_distances]
min_score = min(scores)
max_score = max(scores)
scores = [min_score + (max_score - x) for x in scores]
return scores
The SNN detector here can actually also be considered a KNN outlier detector, simply using SNN distances. But, for simplicity, we’ll refer to the two outliers as KNN and SNN, and assume the KNN detector uses a standard distance metric such as Manhattan or Euclidean, while the SNN detector uses an SNN distance metric.
As with the KNN detector, the SNN detector returns a score for each record passed to fit_predict(), here the average SNN distance to the k nearest neighbors, as opposed to the average distance using a standard distance metric.
This class also provides the get_pairwise_distances() method, which is used by fit_predict(), but can be called directly where calculating the pairwise SNN distances is useful (we see an example of this later, using DBSCAN for outlier detection).
In get_pairwise_distances(), we take element [1] of the results returned by BallTree’s query() method, as it’s the nearest neighbors we’re interested in, not their specific distances.
As indicated, we set all distances to zero unless the two records are within the closest k of each other. We then calculate the specific SNN distances as the number of shared neighbors within the sets of k nearest neighbors for each pair of points.
It would be possible to use a measure such as Jaccard or Dice to quantify the overlap in the nearest neighbors of each pair of points, but given that both are of the same size, k, we can simply count the size of the overlap for each pair.
In the other provided method, fit_predict(), we first get the pairwise distances. These are actually a measure of normality, not outlierness, so these are reversed before returning the scores.
The final score is then the average overlap with the k nearest points for each record.
So, k is actually being used for two different purposes here: it’s used to identify the k nearest neighbors in the first step (where we calculate the KNN distances, using Euclidean or other such metric) and again in the second step (where we calculate the SNN distances, using the average overlap). It’s possible to use two different parameters for these, and some implementations do, sometimes referring to the second as eps (this comes from the history with DBSCAN where eps is used to define the maximum distance between two points for one to be considered in the same neighborhood as the other).
Again, this is not necessarily production-ready, and is far from optimized. There are techniques to improve the speed, and this is an active area of research, particularly for the first step, calculating the raw pairwise distances. Where you have very large volumes of data, it may be necessary to look at alternatives to BallTree, such as faiss, or otherwise speed up the processing. But, for moderately sized datasets, code such as here will generally be sufficient.
I’ve tested the above KNN and SNN outlier detectors in a number of ways, both with synthetic and real data. I’ve also used SNN distances in a number of outlier detection projects over the years.
On the whole, I’ve actually not found SNN to necessarily work preferably to KNN with respect to high dimensions, though SNN is preferable at times.
Where I have, however, seen SNN to provide a clear benefit over standard KNN is where the data has varying densities.
To be more precise, it’s the combination of high dimensionality and varying densities where SNN tends to most strongly outperform other distance metrics with KNN-type detectors, more so than if there are just high dimensions, or just varying densities.
This can be seen with the following test code. This uses (fairly) straightforward synthetic data to present this more clearly.
def test_variable_blobs(nrows=1000, ncols=500, nclusters=60, outlier_multiplier=2.0, k=30, metric='manhattan'):
np.random.seed(1)
# ########################################################
# Create the test data
# Determine the size of each cluster
n_samples_arr = []
remaining_count = nrows
for i in range(nclusters-1):
cluster_size = np.random.randint(1, remaining_count // (nclusters - i))
n_samples_arr.append(cluster_size)
remaining_count -= cluster_size
n_samples_arr.append(remaining_count)
# Determine the density of each cluster
cluster_std_arr = []
for i in range(nclusters):
cluster_std_arr.append(np.random.uniform(low=0.1, high=2.0))
# Determine the center location of each cluster
cluster_centers_arr = []
for i in range(nclusters):
cluster_centers_arr.append(np.random.uniform(low=0.0, high=10.0, size=ncols))
# Create the sample data using the specified cluster sizes, densities, and locations
x, y = make_blobs(n_samples=n_samples_arr,
cluster_std=cluster_std_arr,
centers=cluster_centers_arr,
n_features=ncols,
random_state=0)
df = pd.DataFrame(x)
# Add a single known outlier to the data
avg_row = [x[:, i].mean() for i in range(ncols)]
outlier_row = avg_row.copy()
outlier_row[0] = x[:, 0].max() * outlier_multiplier
df = pd.concat([df, pd.DataFrame([outlier_row])])
df = df.reset_index()
# ########################################################
# Compare standard distance metrics to SNN
# Calculate the outlier scores using standard KNN
scored_df = df.copy()
knn = KNN(metric=metric)
scored_df['knn_scores'] = knn.fit_predict(df, k=k)
# Calculate the outlier scores using SNN
snn = SNN(metric=metric)
scored_df['snn_scores'] = snn.fit_predict(df, k=k)
# Plot the distribution of scores for both detectors and show
# the score for the known outlier (in context of the range of
# scores assigned to the full dataset)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
sns.histplot(scored_df['knn_scores'], ax=ax[0])
ax[0].axvline(scored_df.loc[nrows, 'knn_scores'], color='red')
sns.histplot(scored_df['snn_scores'], ax=ax[1])
ax[1].axvline(scored_df.loc[nrows, 'snn_scores'], color='red')
plt.suptitle(f"Number of columns: {ncols}")
plt.tight_layout()
plt.show()
In this method, we generate test data, add a single, known outlier to the dataset, get the KNN outlier scores, get the SNN outlier scores, and plot the results.
The test data is generated using scikit-learn’s make_blobs(), which creates a set of high-dimensional clusters. The one outlier generated will be outside of these clusters (and will also have, by default, one extreme value in column 0).
Much of the complication in the code is in generating the test data. Here, instead of simply calling make_blobs() with default parameters, we specify the sizes and densities of each cluster, to ensure they are all different. The densities are specified using an array of standard deviations (which describes how spread out each cluster is).
This produces data such as:

This shows only four dimensions, but typically we would call this method to create data with many dimensions. The known outlier point is shown in red. In dimension 0 it has an extreme value, and in most other dimensions it tends to fall outside the clusters, so is a strong outlier.
Testing can be done, with:
test_variable_blobs(nrows=1000, ncols=20, nclusters=1, k=30, metric='euclidean')
test_variable_blobs(nrows=1000, ncols=100, nclusters=5, k=30, metric='euclidean')
test_variable_blobs(nrows=1000, ncols=250, nclusters=10, k=30, metric='euclidean')
test_variable_blobs(nrows=1000, ncols=400, nclusters=15, k=30, metric='euclidean')
test_variable_blobs(nrows=1000, ncols=450, nclusters=20, k=30, metric='euclidean')
test_variable_blobs(nrows=1000, ncols=500, nclusters=20, k=30, metric='euclidean')
test_variable_blobs(nrows=1000, ncols=750, nclusters=20, k=30, metric='euclidean')
test_variable_blobs(nrows=1000, ncols=1000, nclusters=20, k=30, metric='euclidean')
test_variable_blobs(nrows=1000, ncols=2000, nclusters=20, k=30, metric='euclidean')
test_variable_blobs(nrows=1000, ncols=3000, nclusters=20, k=30, metric='euclidean')
test_variable_blobs(nrows=1000, ncols=20, nclusters=1, k=30)
test_variable_blobs(nrows=1000, ncols=100, nclusters=5, k=30)
test_variable_blobs(nrows=1000, ncols=250, nclusters=10, k=30)
test_variable_blobs(nrows=1000, ncols=400, nclusters=15, k=30)
test_variable_blobs(nrows=1000, ncols=450, nclusters=20, k=30)
test_variable_blobs(nrows=1000, ncols=500, nclusters=20, k=30)
test_variable_blobs(nrows=1000, ncols=750, nclusters=20, k=30)
test_variable_blobs(nrows=1000, ncols=1000, nclusters=20, k=30)
test_variable_blobs(nrows=1000, ncols=2000, nclusters=20, k=30)
test_variable_blobs(nrows=1000, ncols=3000, nclusters=20, k=30)
This first executes a series of tests using Euclidean distances (used by both the KNN detector, and for the first step of the SNN detector), and then executes a series of tests using Manhattan distances (the default for the test_variable_blobs() method) —using Manhattan for both for the KNN detector and for the first step with the SNN detector.
For each, we test with increasing numbers of columns (ranging from 20 to 3000).
Starting with Euclidian distances, using only 20 features, both KNN and SNN work well, in that they both assign a high outlier score to the known outlier. Here we see the distribution of outlier scores produced by each detector (the KNN detector is shown in the left pane and the SNN detector in the right pane) and a red vertical line indicating the outlier score given to the known outlier by each detector. In both cases, the known outlier received a significantly higher score than the other records: both detectors do well.

But, using Euclidean distances tends to degrade quickly as features are added, and works quite poorly even with only 100 features. This is true with both the KNN and SNN detectors. In both cases, the known outlier received a fairly normal score, not indicating any outlierness, as seen here:

Repeating using Manhattan distances, we see that KNN works well with smaller numbers of features, but breaks down as the numbers of features increases. KNN does, however, do much better with Manhattan distances that Euclidean once we get much beyond about 50 or so features (with small numbers of features, almost any distance metric will work reasonably well).
In all cases below (using Manhattan & SNN distances), we show the distribution of KNN outlier scores (and the outlier score assigned to the known outlier by the KNN detector) in the left pane, and the distribution of SNN scores (and the outlier score given to the known outlier by the SNN detector) in the right pane.
With 20 features, both work well:

With 100 features, KNN is still giving the known outlier a high score, but not very high. SNN is still doing very well (and does in all cases below as well):

With 250 features the score given to the known outlier by KNN is fairly poor and the distribution of scores is odd:

With 500 features:

With 1000 features:

With 2000 features:

With 3000 features:

With the KNN detector, even using Manhattan distances, we can see that the distribution of scores is quite odd by 100 features and, more relevantly, that by 100 features the KNN score given to the known outlier is poor: much too low and not reflecting its outlierness.
The distribution of SNN scores, on the other hand, remains reasonable even up to 3000 features, and the SNN score given to the known outlier remains very high up until almost 2000 features (for 2000 and 3000 features, it’s score is high, but not quite the highest-scored record).
The SNN detector (essentially the KNN outlier detection algorithm with SNN distances) worked much more reliably than KNN with Manhattan distances.
One key point here (outside of considering SNN distances) is that Manhattan distances can be much more reliable for outlier detection than Euclidean where we have large numbers of features. The curse of dimensionality still takes affect (all distance metrics eventually break down), but much less severely where there are dozens or hundreds of features than with Euclidean.
In fact, while very suitable in lower dimensions, Euclidean distances can break down even with moderate numbers of features (sometimes with as few as 30 or 40). Manhattan distances can be a fairer comparison in these cases, which is what is done here.
In general, we should be mindful of evaluations of distance metrics that compare themselves to Euclidean distances, as these can be misleading. It’s standard to assume Euclidean distances when working with distance calculations, but this is something we should question.
In the case identified here (where data is simply clustered, but in clusters with varying sizes and densities), SNN did strongly outperform KNN (and, impressively, remained reliable even to close to 2000 features). This is a more meaningful finding given that we compared to KNN based on Manhattan distances, not Euclidean.
However, in many other scenarios, particularly where the data is in a single cluster, or where the clusters have similar densities to each other, KNN can work as well as, or even preferably to, SNN.
It’s not the case that SNN should always be favoured to other distance metrics, only that there are scenarios where it can do significantly better.
In other cases, other distance metrics may work preferably as well, including cosine distances, Canberra, Mahalanobis, Chebyshev, and so on. It is very often worth experimenting with these when performing outlier detection.
Where KNN breaks down here is, much like the case when using DBSCAN for clustering, where different regions (in this case, different clusters) have different densities.
KNN is an example of a type of detector known as a global outlier detector. If you’re familiar with the idea of local and global outliers, the idea is related, but different. In this case, the ‘global’ in global outlier detector means that there is a global sense of normal. This is the same limitation described above with DBSCAN clustering (where there is a global sense of normal distances between records). Every record in the data is compared to this assessment of normal. In the case of KNN outlier detectors, there is a global sense of the normal average distance to the k nearest neighbors.
But, this global norm is not meaningful where the data has different densities in different regions. In the plot below (repeated from above), there are two clusters, with the one in the lower-left being much more dense that the one in the upper-right.
What’s relevant, in terms of identifying outliers, is how close a point is to its neighbors relative to what’s normal for that region, not relative to what’s normal in the other clusters (or in the dataset as a whole).

This is the problem another important outlier detector, Local Outlier Factor (LOF) was created to solve (the original LOF paper actually describes a situation very much like this). Contrary to global outlier detectors, LOF is an example of a local outlier detector: a detector that compares points to other points in the local area, not to the full dataset, so compares each point to a local sense of what’s normal. In the case of LOF, it compares each point to a local sense of the average distance to the nearby points.
Local outlier detectors also provide a valuable approach to identifying outliers where the densities vary throughout the data space, which I cover in Outlier Detection in Python, and I’ll try to cover in future articles.
SNN also provides an important solution to this problem of varying densities. With SNN distances, the changes in density aren’t relevant. Each record here is compared against a global standard of the average number of shared neighbors a record has with its closest neighbors. This is a quite robust calculation, and able to work well where the data is clustered, or just populated more densely in some areas than others.
In this article, we’ve looked primarily at the KNN algorithm for outlier detection, but SNN can be used with any outlier detector that is based on the distances between rows. This includes Radius, Local Outlier Factor (LOF), and numerous others. It also includes any outlier detection algorithm based on clustering.
There are a number of ways to identify outliers using clustering (for example, identifying the points in very small clusters, points that are far from their cluster centers, and so on). Here, though, we’ll look at a very simple approach to outlier detection: clustering the data and then identifying the points not placed in any cluster.
DBSCAN is one of the clustering algorithms most commonly used for this type of outlier detection, as it has the convenient property (not shared by all clustering algorithms) of allowing points to not be placed in any cluster.
DBSCAN (at least scikit-learn’s implementation) also allows us to easily work with SNN distances.
So, as well as being a useful clustering algorithm, DBSCAN is widely used for outlier detection, and we’ll use it here as another example of outlier detection with SNN distances.
Before looking at using SNN distances, though, we’ll show an example using DBSCAN as it’s more often used to identify outliers in data (here using the default Euclidean distances). This uses the same dataset created above, where the last row is the single known outlier.
clustering = DBSCAN(eps=20, min_samples=2).fit(df.values)
print(clustering.labels_)
print(pd.Series(clustering.labels_).value_counts())
The parameters for DBSCAN can take some experimentation to set well. In this case, I adjusted them until the algorithm identified a single outlier, which I confirmed is the last row by printing the labels_ attribute. The labels are:
[ 0 1 1 ... 1 0 -1]
-1 indicates records not assigned to any cluster. As well, value_counts() indicated there’s only one record assigned to cluster -1. So, DBSCAN works well in this example. Which means we can’t improve on it by using SNN, but this does provide a clear example of using DBSCAN for outlier detection, and ensures the dataset is solvable using clustering-based outlier detection.
To work with SNN distances, it’s necessary to first calculate the pairwise SNN distances (DBSCAN cannot calculate these on its own). Once these are created, they can be passed to DBSCAN in the form of an n x n matrix.
Here we calculate the SNN pairwise distances:
snn = SNN(metric='manhattan')
pairwise_dists = snn.get_pairwise_distances(df, k=100)
print(pairwise_dists)
The pairwise distances look like:
array([[ 0., 0., 0., ..., 0., 57., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[57., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
As a quick and simple way to reverse these distances (to be better suited for DBSCAN), we call:
d = pd.DataFrame(pairwise_dists).apply(lambda x: 1000-x)
Here 1000 is simply a value larger than any in the actual data. Then we call DBSCAN, using ‘precomputed’ as the metric and passing the pairwise distances to fit().
clustering = DBSCAN(eps=975, min_samples=2, metric='precomputed').fit(d.values)
print(clustering.labels_)
display(pd.Series(clustering.labels_).value_counts())
Again, this identifies only the single outlier (only one record is given the cluster id -1, and this is the last row). In general, DBSCAN, and other tools that accept ‘precomputed’ as the metric can work with SNN distances, and potentially produce more robust results.
In the case of DBSCAN, using SNN distances can work well, as outliers (referred to as noise points in DBSCAN) and inliers tend to have almost all of their links broken, and so outliers end up in no clusters. Some outliers (though outliers that are less extreme) will have some links to other records, but will tend to have zero, or very few, shared neighbors with these, so will get high outlier scores (though not as high as those with no links, as is appropriate).
This can take some experimenting, and in some cases the value of k, as well as the DBSCAN parameters, will need to be adjusted, though not to an extent unusual in outlier detection — it’s common for some tuning to be necessary.
SNN is not as widely used in outlier detection as it ideally would be, but there is one well-known detector that uses it: SOD, which is provided in the PyOD library.
SOD is an outlier detector that focusses on finding useful subspaces (subsets of the features available) for outlier detection, but does use SNN as part of the process, which, it argues in the paper introducing SOD, provides more reliable distance calculations.
SOD works (similar to KNN and LOF), by identifying a neighborhood of k neighbors for each point, known with SOD as the reference set. The reference set is found using SNN. So, neighborhoods are identified, not by using the points with the smallest Euclidean distances, but by the points with the most shared neighbors.
The authors found this tends to be robust not only in high dimensions, but also where there are many irrelevant features: the rank order of neighbors tends to remain meaningful, and so the set of nearest neighbors can be reliably found even where specific distances are not reliable.
Once we have the reference set for a point, SOD uses this to determine the subspace, which is the set of features that explain the greatest amount of variance for the reference set. And, once SOD identifies these subspaces, it examines the distances of each point to the data center, which then provides an outlier score.
An obvious application of SNN is to embeddings (for example, vector representations of images, video, audio, text, network, or data of other modalities), which tend to have very high dimensionality. We look at this in more depth in Outlier Detection in Python, but will indicate here quickly: standard outlier detection methods intended for numeric tabular data (Isolation Forest, Local Outlier Factor, kth Nearest Neighbors, and so on), actually tend to perform poorly on embeddings. The main reason appear to be the high numbers of dimensions, along with the presence of many dimensions in the embeddings that are irrelevant for outlier detection.
There are other, well-established techniques for outlier detection with embeddings, for example methods based on auto-encoders, variational auto-encoders, generative adversarial networks, and a number of other techniques. As well, it’s possible to apply dimensionality reduction to embeddings for more effective outlier detection. These are also covered in the book and, I hope, a future Medium article. As well, I’m now investigating the use of distance metrics other than Euclidean, cosine, and other standard metrics, including SNN. If these can be useful is currently under investigation.
Similar to Distance Metric Learning, Shared Nearest Neighbors will be more expensive to calculate than standard distance metrics such as Manhattan and Euclidean distances, but can be more robust with large numbers of features, varying densities, and (as the SOD authors found), irrelevant features.
So, in some situations, SNN can be a preferable distance metric to more standard distance metrics and may be a more appropriate distance metric for use with outlier detection. We’ve seen here where it can be used as the distance metric for kth Nearest Neighbors outlier detection and for DBSCAN outlier detection (as well as when simply using DBSCAN for clustering).
In fact, SNN can, be used with any outlier detection method based on distances between records. That is, it can be used with any distance-based, density-based, or clustering-based outlier detector.
We’ve also indicated that SNN will not always work favorably compared to other distance metrics. The issue is more complicated when considering categorical, date, and text columns (as well as potentially other types of features we may see in tabular data). But even considering strictly numeric data, it’s quite possible to have datasets, even with large numbers of features, where plain Manhattan distances work preferably to SNN, and other cases where SNN is preferable. The number of rows, number of features, relevance of the features, distributions of the features, associations between features, clustering of the data, and so on are all relevant, and it usually can’t be predicted ahead of time what will work best.
SNN is only one solution to problems such as high dimensionality, varying density, and irrelevant features, but is is a useful tool, easy enough to implement, and quite often worth experimenting with.
This article was just an introduction to SNN and future articles may explore SNN further, but in general, when determining the distance metric used (and other such modeling decisions) with outlier detection, the best approach is to use a technique called doping (described in this article), where we create data similar to the real data, but modified so to contain strong, but realistic, anomalies. Doing this, we can try to estimate what appears to be most effective at detecting the sorts of outliers you may have.
Here we used an example with synthetic data, which can help describe where one outlier detection approach works better than another, and can be very valuable (for example, here we found that when varying the densities and increasing the number of features, SNN outperformed Manhattan distances, but with consistent densities and low numbers of features, both did well). But, using synthetic, as important as it is, is only one step to understanding where different approaches will work better for data similar to the data you have. Doping will tend to work better for this purpose, or at least as part of the process.
As well, it’s generally accepted in outlier detection that no single detector will reliably identify all the outliers you’re interested in detecting. Each detector will detect a fairly specific type of outlier, and very often we’re interested in detecting a wide range of outliers (in fact, quite often we’re interested simply in identifying anything that is statistically substantially different from normal — especially when first examining a dataset).
Given that, it’s common to use multiple detectors for outlier detection, combining their results into an ensemble. One useful technique to increase diversity within an ensemble is to use a variety of distance metrics. For example, if Manhattan, Euclidean, SNN, and possibly even others (perhaps Canberra, cosine, or other metrics) all work well (all producing different, but sensible results), it may be worthwhile to use all of these. Often though, we will find that only one or two distance metrics produce meaningful results given the dataset we have and the types of outliers we are interested in. Although not the only one, SNN is a useful distance metric to try, especially where the detectors are struggling when working with other distance metrics.
All images by author.
Shared Nearest Neighbors: A More Robust Distance Metric was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Shared Nearest Neighbors: A More Robust Distance Metric
Go Here to Read this Fast! Shared Nearest Neighbors: A More Robust Distance Metric


A better and faster option than the ADAM optimizer, from Apple Research
Originally appeared here:
AdEMAMix: A Deep Dive into a New Optimizer for Your Deep Neural Network
Go Here to Read this Fast! AdEMAMix: A Deep Dive into a New Optimizer for Your Deep Neural Network



We’ve witnessed remarkable strides in AI image generation. But what happens when we add the dimension of time? Videos are moving images, after all.
Text-to-video generation is a complex task that requires AI to understand not just what things look like, but how they move and interact over time. It is an order of magnitude more complex than text-to-image.
To produce a coherent video, a neural network must:
1. Comprehend the input prompt
2. Understand how the world works
3. Know how objects move and how physics applies
4. Generate a sequence of frames that make sense spatially, temporally, and logically
Despite these challenges, today’s diffusion neural networks are making impressive progress in this field. In this article, we will cover the main ideas behind video diffusion models — main challenges, approaches, and the seminal papers in the field.
To understand text-to-video generation, we need to start with its predecessor: text-to-image diffusion models. These models have a singular goal — to transform random noise and a text prompt into a coherent image. In general, all generative image models do this — Variational Autoencoders (VAE), Generative Adversarial Neural Nets (GANs), and yes, Diffusion too.

Diffusion, in particular, relies on a gradual denoising process to generate images.
1. Start with a randomly generated noisy image
2. Use a neural network to progressively remove noise
3. Condition the denoising process on text input
4. Repeat until a clear image emerges

During training, we start with real images and progressively add noise to it in small steps — this is called forward diffusion. This generates a lot of samples of clear image and their slightly noisier versions. The neural network is then trained to reverse this process by inputting the noisy image and predicting how much noise to remove to retrieve the clearer version. In text-conditional models, we train attention layers to attend to the inputted prompt for guided denoising.

This iterative approach allows for the generation of highly detailed and diverse images. You can watch the following YouTube video where I explain text to image in much more detail — concepts like Forward and Reverse Diffusion, U-Net, CLIP models, and how I implemented them in Python and Pytorch from scratch.
If you are comfortable with the core concepts of Text-to-Image Conditional Diffusion, let’s move to videos next.
In theory, we could follow the same conditioned noise-removal idea to do text-to-video diffusion. However, adding time into the equation introduces several new challenges:
1. Temporal Consistency: Ensuring objects, backgrounds, and motions remain coherent across frames.
2. Computational Demands: Generating multiple frames per second instead of a single image.
3. Data Scarcity: While large image-text datasets are readily available, high-quality video-text datasets are scarce.

Because of the lack of high quality datasets, text-to-video cannot rely just on supervised training. And that is why people usually also combine two more data sources to train video diffusion models — one — paired image-text data, which is much more readily available, and two — unlabelled video data, which are super-abundant and contains lots of information about how the world works. Several groundbreaking models have emerged to tackle these challenges. Let’s discuss some of the important milestone papers one by one.
We are about to get into the technical nitty gritty! If you find the material ahead difficult, feel free to watch this companion video as a visual side-by-side guide while reading the next section.
VDM Uses a 3D U-Net architecture with factorized spatio-temporal convolution layers. Each term is explained in the picture below.

VDM is jointly trained on both image and video data. VDM replaces the 2D UNets from Image Diffusion models with 3D UNet models. The video is input into the model as a time sequence of 2D frames. The term Factorized basically means that the spatial and temporal layers are decoupled and processed separately from each other. This makes the computations much more efficient.
What is a 3D-UNet?
3D U-Net is a unique computer vision neural network that first downsamples the video through a series of these factorized spatio-temporal convolutional layers, basically extracting video features at different resolutions. Then, there is an upsampling path that expands the low-dimensional features back to the shape of the original video. While upsampling, skip connections are used to reuse the generated features during the downsampling path.

Remember in any convolutional neural network, the earlier layers always capture detailed information about local sections of the image, while latter layers pick up global level pattern by accessing larger sections — so by using skip connections, U-Net combines local details with global features to be a super-awesome network for feature learning and denoising.
VDM is jointly trained on paired image-text and video-text datasets. While it’s a great proof of concept, VDM generates quite low-resolution videos for today’s standards.
You can read more about VDM here.
Make-A-Video by Meta AI takes the bold approach of claiming that we don’t necessarily need labeled-video data to train video diffusion models. WHHAAA?! Yes, you read that right.
Make A Video first trains a regular text-to-image diffusion model, just like Dall-E or Stable Diffusion with paired image-text data. Next, unsupervised learning is done on unlabelled video data to teach the model temporal relationships. The additional layers of the network are trained using a technique called masked spatio-temporal decoding, where the network learns to generate missing frames by processing the visible frames. Note that no labelled video data is needed in this pipeline (although further video-text fine-tuning is possible as an additional third step), because the model learns spatio-temporal relationships with paired text-image and raw unlabelled video data.

The video outputted by the above model is 64×64 with 16 frames. This video is then upsampled along the time and pixel axis using separate neural networks called Temporal Super Resolution or TSR (insert new frames between existing frames to increase frames-per-second (fps)), and Spatial Super Resolution or SSR (super-scale the individual frames of the video to be higher resolution). After these steps, Make-A-Video outputs 256×256 videos with 76 frames.
You can learn more about Make-A-Video right here.
Imagen video employs a cascade of seven models for video generation and enhancement. The process starts with a base video generation model that creates low-resolution video clips. This is followed by a series of super-resolution models — three SSR (Spatial Super Resolution) models for spatial upscaling and three TSR (Temporal Super Resolution) models for temporal upscaling. This cascaded approach allows Imagen Video to generate high-quality, high-resolution videos with impressive temporal consistency. Generates high-quality, high-resolution videos with impressive temporal consistency

Models like Nvidia’s VideoLDM tries to address the temporal consistency issue by using latent diffusion modelling. First they train a latent diffusion image generator. The basic idea is to train a Variational Autoencoder or VAE. The VAE consists of an encoder network that can compress input frames into a low dimensional latent space and another decoder network that can reconstruct it back to the original images. The diffusion process is done entirely on this low dimensional space instead of the full pixel-space, making it much more computationally efficient and semantically powerful.

The diffusion model is trained entirely in the low dimensional latent space, i.e. the diffusion model learns to denoise the low dimensional latent space images instead of the full resolution frames. This is why we call it Latent Diffusion Models. The resulting latent space outputs is then pass through the VAE decoder to convert it back to pixel-space.
The decoder of the VAE is enhanced by adding new temporal layers in between it’s spatial layers. These temporal layers are fine-tuned on video data, making the VAE produce temporally consistent and flicker-free videos from the latents generated by the image diffusion model. This is done by freezing the spatial layers of the decoder and adding new trainable temporal layers that are conditioned on previously generated frames.

You can learn more about Video LDMs here.
While Video LDM compresses individual frames of the video to train an LDM, SORA compresses video both spatially and temporally. Recent papers like CogVideoX have demonstrated that 3D Causal VAEs are great at compressing videos making diffusion training computationally efficient, and able to generate flicker-free consistent videos.

A transformer model is used as the diffusion network instead of the more traditional UNEt model. Of course, transformers need the input data to be presented as a sequence of tokens. That’s why the compressed video encodings are flattened into a sequence of patches. Observe that each patch and its location in the sequence represents a spatio-temporal feature of the original video.

It is speculated that OpenAI has collected a rather large annotation dataset of video-text data which they are using to train conditional video generation models.
Combining all the strengths listed below, plus more tricks that the ironically-named OpenAI may never disclose, SORA promises to be a giant leap in video generation AI models.
The future of AI is easy to predict. In 2024, Data + Compute = Intelligence. Large corporations will invest computing resources to train large diffusion transformers. They will hire annotators to label high-quality video-text data. Large-scale text-video datasets probably already exist in the closed-source domain (looking at you OpenAI), and they may become open-source within the next 2–3 years, especially with recent advancements in AI video understanding. It remains to be seen if the upcoming huge computing and financial investments could on their own solve video generation. Or will further architectural and algorithmic advancements be needed from the research community?
Author’s Youtube Channel: https://www.youtube.com/@avb_fj
Video on this topic: https://youtu.be/KRTEOkYftUY
15-step Zero-to-Hero on Conditional Image Diffusion: https://youtu.be/w8YQc
Video Diffusion Models: https://arxiv.org/abs/2204.03458
Imagen: https://imagen.research.google/video/
Make A Video: https://makeavideo.studio/
Video LDM: https://research.nvidia.com/labs/toronto-ai/VideoLDM/index.html
CogVideoX: https://arxiv.org/abs/2408.06072
OpenAI SORA article: https://openai.com/index/sora/
Diffusion Transformers: https://arxiv.org/abs/2212.09748
Useful article: https://lilianweng.github.io/posts/2024-04-12-diffusion-video/
The Evolution of Text to Video Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
The Evolution of Text to Video Models
Go Here to Read this Fast! The Evolution of Text to Video Models



On the 12th of September at 10:00 a.m., I was in the class “Frontier Topics in Generative AI,” a graduate-level course at Arizona State University. A day before this, on the 11th of September, I submitted a team assignment that involved trying to identify flaws and erroneous outputs generated by GPT-4 (essentially trying to prompt GPT-4 to see if it makes mistakes on trivial questions or high-school-level reasoning questions) as part of another graduate-level class “Topics in Natural Language Processing.” We identified several trivial mistakes that GPT-4 made, one of them being unable to count the number of r’s in the word strawberry. Before submitting this assignment, I researched several peer-reviewed papers on the internet that identified where and why GPT -4 made mistakes and how you could rectify them. Most of the documents I came across identified two main domains where GPT-4 erred, and they dealt with planning and reasoning.
This paper¹ (although almost a year old) goes in depth through several cases where GPT-4 fails to answer trivial questions that involve simple counting, simple arithmetic, elementary logic, and even common sense. The paper¹ reasons that these questions require some level of reasoning and that because GPT-4 is utterly incapable of reasoning, it almost always gets these questions wrong. The author also states that reasoning is a (very) computationally hard problem. Although GPT-4 is very compute-intensive, its compute-intensive nature is not geared towards involving reasoning in solving the questions that it’s prompted with. Several other papers echo this notion of GPT-4 being unable to reason or plan²³.
Well, let’s get back to the 12th of September. My class ends at around 10:15 a.m., and I come back straight home from class and open up YouTube on my phone as I dig into my morning brunch. The first recommendation on my YouTube homepage was a video from OpenAI announcing the release of GPT-o1 named “Building OpenAI o1”. They announced that this model is a straight-up a reasoning model and that it would take more time to reason and answer your questions providing more accurate answers. They state that they have put more compute time into RL (Reinforcement Learning) than previous models to generate coherent chains-of-thoughts⁴. Essentially, they have trained the chain of thought generation process using Reinforcement learning (to generate and hone its own generated chain of thought process). In the o1 models, the engineers were able to ask the model questions as to why it was wrong (whenever it was wrong) in its chain-of-thought process and it could identify the mistakes and correct itself from them. The model could question itself and have to reflect (see “Reflection in LLMs”) on its outputs and correct itself.
In another video “Reasoning with OpenAI o1”, Jerry Tworek demonstrates how previous OpenAI and most other LLMs in the market tend to fail on the following prompt:
“Assume the laws of physics on earth. A small strawberry is put into a normal cup and the cup is placed upside down on a table. Someone then takes the cup and puts it inside the microwave. Where is the strawberry now? Explain your reasoning step by step.”
Legacy GPT-4 answers as follows:

The relatively newer GPT-4o also gets it wrong:

GPT o1 gets the answer right:

If you click on the dropdown at the beginning of the model’s response (see Figure 4), you know that it elicits its thought process (chain-of-thought), and the researchers at OpenAI claim that the o1 model has been trained with reinforcement learning to get this chain of thought better. Also, it’s interesting to note that Jason Wei (you can see him sitting third from the right on the bottom row in the video “Building OpenAI o1”), the author of the chain-of-thought paper⁴ that he published back at Google, is now an OpenAI employee that is working on the o1 model to integrate the chain-of-thought (that he discovered at Google) process into this model.

Now, let’s get back to the counting question that my team found out about as part of my assignment.
How many r’s are in the word strawberry?
Let’s run this question on GPT-4o:

A very simple counting problem that it get’s wrong.
Let’s run this on the new GPT o1:

GPT o1 gets the answer right by thinking for a couple of seconds. The researchers at OpenAI say that it goes through its response repeatedly and thinks its way to the right answer. There does seem to be really significant improvements in terms of the model’s ability to solve a lot of academic exam questions.
Anyways after I opened up X.com (Formerly Twitter), I came across several people showcasing their attempts at trying to make the o1 model fail. This is a fascinating one I came across (this tweet by @creeor) where the model fails to answer a trivial question in which the answer lies in the question itself. So I tried the exact same prompt on my account and it gave me the wrong answer (see Figure 7).

When I ask it what this classic riddle is that it is talking about it tells me about a riddle that it memorized from the internet. It’s interesting to see how these models can sometimes fall back on memorized content rather than truly reasoning through a problem. Despite the significant advancements and benchmark improvements, there are still areas where AI models struggle, especially with tasks that require deeper reasoning or understanding context in a nuanced way. While benchmarks can show progress, real-world applications often reveal the limitations. It’s through continuous testing, feedback, and real-world use cases that these models can be further refined.

There was a compilation of ChatGPT failures done about a year and a half back. Compilations of model errors are invaluable for understanding and improving AI systems. I’m sure people will come up with another compilation of errors for the o1 model soon.
Although I agree entirely that the chain-of-thought process benefits both AI and human learning, real learning indeed comes from experience and making mistakes.
I will continue to keep posting my findings on the o1 model on my Medium page. Follow my account to keep posted. And thank you for taking the time to read my Medium post.
[1] Arkoudas, Konstantine. “GPT-4 can’t reason.” arXiv preprint arXiv:2308.03762 (2023).
[2] Aghzal, Mohamed, Erion Plaku, and Ziyu Yao. “Look Further Ahead: Testing the Limits of GPT-4 in Path Planning.” arXiv preprint arXiv:2406.12000 (2024).
[3] Kambhampati, Subbarao, et al. “LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks.” arXiv preprint arXiv:2402.01817 (2024).
[4] Wei, Jason, et al. “Chain-of-thought prompting elicits reasoning in large language models.” Advances in neural information processing systems 35 (2022): 24824–24837.
OpenAI o1: Is This the Enigmatic Force That Will Reshape Every Knowledge Sector We Know? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
OpenAI o1: Is This the Enigmatic Force That Will Reshape Every Knowledge Sector We Know?
Using Polars with NVIDIA GPU can speed up your data pipelines
Originally appeared here:
Polars + NVIDIA GPU Tutorial


Originally appeared here:
Integrate Amazon Bedrock Knowledge Bases with Microsoft SharePoint as a data source
Feeling inspired to write your first TDS post? We’re always open to contributions from new authors.
The process and requirements for landing your first data science or machine learning job have shifted considerably in recent years. So has the definition of excelling at your existing role. We can attribute this to any number of factors: the rise of LLMs and AI-powered tools, unfavorable economic conditions (and the layoffs and hiring freezes that came in their wake), and the shifting terrain around remote work all come to mind.
Periods of transition and uncertainty can be difficult to navigate—especially if you entered the field expecting a smooth ride in a booming, lucrative industry. But there’s no reason to despair: individual data professionals might not be able to turn the tide on their own, but they can take action to become more professionally resilient and shock-proof their career trajectory.
The articles we selected for you this week focus on the core skills you should develop to become more immune to unpredictable trends—and lay out concrete steps you can take to cultivate them. From advice for recent grads on snagging your first internship to insights on effective management of data teams, they cater to practitioners across a wide range of roles and seniority levels. Let’s dive in.
Ready to branch out into some other topics? Here are a few more standout articles from the past couple of weeks:
Thank you for supporting the work of our authors! As we mentioned above, we love publishing articles from new authors, so if you’ve recently written an interesting project walkthrough, tutorial, or theoretical reflection on any of our core topics, don’t hesitate to share it with us.
Until the next Variable,
TDS Team
How to Build Your Own Roadmap for a Successful Data Science Career was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
How to Build Your Own Roadmap for a Successful Data Science Career
Go Here to Read this Fast! How to Build Your Own Roadmap for a Successful Data Science Career


Learn to build a neutral network that can drive using PyTorch in Python
Originally appeared here:
The Basics Behind AI Models for Self-Driving Cars
Go Here to Read this Fast! The Basics Behind AI Models for Self-Driving Cars