Tag: AI

The Math Behind “The Curse of Dimensionality”
Maxime Wolf
Dive into the “Curse of Dimensionality” concept and understand the math behind all the surprising phenomena that arise in high dimensions.

Image from Dall-E

In the realm of machine learning, handling high-dimensional vectors is not just common; it’s essential. This is illustrated by the architecture of popular models like Transformers. For instance, BERT uses 768-dimensional vectors to encode the tokens of the input sequences it processes and to better capture complex patterns in the data. Given that our brain struggles to visualize anything beyond 3 dimensions, the use of 768-dimensional vectors is quite mind-blowing!

While some Machine and Deep Learning models excel in these high-dimensional scenarios, they also present many challenges. In this article, we will explore the concept of the “curse of dimensionality”, explain some interesting phenomena associated with it, delve into the mathematics behind these phenomena, and discuss their general implications for your Machine Learning models.

Note that detailed mathematical proofs related to this article are available on my website as a supplementary extension to this article.

What is the curse of dimensionality?

People often assume that geometric concepts familiar in three dimensions behave similarly in higher-dimensional spaces. This is not the case. As dimension increases, many interesting and counterintuitive phenomena arise. The “Curse of Dimensionality” is a term invented by Richard Bellman (famous mathematician) that refers to all these surprising effects.

What is so special about high-dimension is how the “volume” of the space (we’ll explore that in more detail soon) is growing exponentially. Take a graduated line (in one dimension) from 1 to 10. There are 10 integers on this line. Extend that in 2 dimensions: it is now a square with 10 × 10 = 100 points with integer coordinates. Now consider “only” 80 dimensions: you would already have 10⁸⁰ points which is the number of atoms in the universe.

In other words, as the dimension increases, the volume of the space grows exponentially, resulting in data becoming increasingly sparse.

High-dimensional spaces are “empty”

Consider this other example. We want to calculate the farthest distance between two points in a unit hypercube (where each side has a length of 1):
- In 1 dimension (the hypercube is a line segment from 0 to 1), the maximum distance is simply 1.
- In 2 dimensions (the hypercube forms a square), the maximum distance is the distance between the opposite corners [0,0] and [1,1], which is √2, calculated using the Pythagorean theorem.
- Extending this concept to n dimensions, the distance between the points at [0,0,…,0] and [1,1,…,1] is √n. This formula arises because each additional dimension adds a square of 1 to the sum under the square root (again by the Pythagorean theorem).
Interestingly, as the number of dimensions n increases, the largest distance within the hypercube grows at an O(√n) rate. This phenomenon illustrates a diminishing returns effect, where increases in dimensional space lead to proportionally smaller gains in spatial distance. More details on this effect and its implications will be explored in the following sections of this article.

The notion of distance in high dimensions

Let’s dive deeper into the notion of distances we started exploring in the previous section.

We had our first glimpse of how high-dimensional spaces render the notion of distance almost meaningless. But what does this really mean, and can we mathematically visualize this phenomenon?

Let’s consider an experiment, using the same n-dimensional unit hypercube we defined before. First, we generate a dataset by randomly sampling many points in this cube: we effectively simulate a multivariate uniform distribution. Then, we sample another point (a “query” point) from that distribution and observe the distance from its nearest and farthest neighbor in our dataset.

Here is the corresponding Python code.
```
def generate_data(dimension, num_points):
 ''' Generate random data points within [0, 1] for each coordinate in the given dimension '''
 data = np.random.rand(num_points, dimension)
 return data


def neighbors(data, query_point):
 ''' Returns the nearest and farthest point in data from query_point '''
 nearest_distance = float('inf')
 farthest_distance = 0
 for point in data:
 distance = np.linalg.norm(point - query_point)
 if distance < nearest_distance:
 nearest_distance = distance
 if distance > farthest_distance:
 farthest_distance = distance
 return nearest_distance, farthest_distance
```
We can also plot these distances:

Distances to nearest and farthest points as n increases (Image by the author)

Using a log scale, we observe that the relative difference between the distance to the nearest and farthest neighbor tends to decrease as the dimension increases.

This is a very unintuitive behavior: as explained in the previous section, points are very sparse from each other because of the exponentially increasing volume of the space, but at the same time, the relative distances between points become smaller.

The notion of nearest neighbors vanishes

This means that the very concept of distance becomes less relevant and less discriminative as the dimension of the space increases. As you can imagine, it poses problems for Machine Learning algorithms that solely rely on distances such as kNN.

The maths: the n-ball

We will now talk about some other interesting phenomena. For this, we’ll need the n-ball. A n-ball is the generalization of a ball in n dimensions. The n-ball of radius R is the collection of points at a distance at most R from the center of the space 0.

Let’s consider a radius of 1. The 1-ball is the segment [-1, 1]. The 2-ball is the disk delimited by the unit circle, whose equation is x² + y² ≤ 1. The 3-ball (what we normally call a “ball”) has the equation x² + y² + z² ≤ 1. As you understand, we can extend that definition to any dimension:

The question now is: what is the volume of this ball? This is not an easy question and requires quite a lot of maths, which I won’t detail here. However, you can find all the details on my website, in my post about the volume of the n-ball.

After a lot of fun (integral calculus), you can prove that the volume of the n-ball can be expressed as follows, where Γ denotes the Gamma function.

For example, with R = 1 and n = 2, the volume is πR², because Γ(2) = 1. This is indeed the “volume” of the 2-ball (also called the “area” of a circle in this case).

However, beyond being an interesting mathematical challenge, the volume of the n-ball also has some very surprising properties.

As the dimension n increases, the volume of the n-ball converges to 0.

This is true for every radius, but let’s visualize this phenomenon with a few values of R.

Volume of the n-ball for different radii as the dimension increases (Image by the author)

As you can see, it only converges to 0, but it starts by increasing and then decreases to 0. For R = 1, the ball with the largest volume is the 5-ball, and the value of n that reaches the maximum shifts to the right as R increases.

Here are the first values of the volume of the unit n-ball, up to n = 10.

Volume of the unit n-ball for different values of n (Image by the author)

The volume of a high-dimensional unit ball is concentrated near its surface.

For small dimensions, the volume of a ball looks quite “homogeneous”: this is not the case in high dimensions.

A spherical shell

Let’s consider an n-ball with radius R and another with radius R-dR where dR is very small. The portion of the n-ball between these 2 balls is called a “shell” and corresponds to the portion of the ball near its surface(see the visualization above in 3D). We can compute the ratio of the volume of the entire ball and the volume of this thin shell only.

Ratio (total volume / thin shell volume) as n increases (Image by the author)

As we can see, it converges very quickly to 0: almost all the volume is near the surface in high dimensional spaces. For instance, for R = 1, dR = 0.05, and n = 50, about 92.3% of the volume is concentrated in the thin shell. This shows that in higher dimensions, the volume is in “corners”. This is again related to the distortion of the concept of distance we have seen earlier.

Note that the volume of the unit hypercube is 2ⁿ. The unit sphere is basically “empty” in very high dimensions, while the unit hypercube, in contrast, gets exponentially more points. Again, this shows how the idea of a “nearest neighbor” of a point loses its effectiveness because there is almost no point within a distance R of a query point q when n is large.

Curse of dimensionality, overfitting, and Occam’s Razor

The curse of dimensionality is closely related to the overfitting principle. Because of the exponential growth of the volume of the space with the dimension, we need very large datasets to adequately capture and model high-dimensional patterns. Even worse: we need a number of samples that grows exponentially with the dimension to overcome this limitation. This scenario, characterized by many features yet relatively few data points, is particularly prone to overfitting.

Occam’s Razor suggests that simpler models are generally better than complex ones because they are less likely to overfit. This principle is particularly relevant in high-dimensional contexts (where the curse of dimensionality plays a role) because it encourages the reduction of model complexity.

Applying Occam’s Razor principle in high-dimensional scenarios can mean reducing the dimensionality of the problem itself (via methods like PCA, feature selection, etc.), thereby mitigating some effects of the curse of dimensionality. Simplifying the model’s structure or the feature space helps in managing the sparse data distribution and in making distance metrics more meaningful again. For instance, dimensionality reduction is a very common preliminary step before applying the kNN algorithm. More recent methods, such as ANNs (Approximate Nearest Neighbors) also emerge as a way to deal with high-dimensional scenarios.

Blessing of dimensionality?

Image by Dall-E

While we’ve outlined the challenges of high-dimensional settings in machine learning, there are also some advantages!
- High dimensions can enhance linear separability, making techniques like kernel methods more effective.
- Additionally, deep learning architectures are particularly adept at navigating and extracting complex patterns from high-dimensional spaces.
As always with Machine Learning, this is a trade-off: leveraging these advantages involves balancing the increased computational demands with potential gains in model performance.

Conclusion

Hopefully, this gives you an idea of how “weird” geometry can be in high-dimension and the many challenges it poses for machine learning model development. We saw how, in high-dimensional spaces, data is very sparse but also tends to be concentrated in the corners, and distances lose their usefulness. For a deeper dive into the n-ball and mathematical proofs, I encourage you to visit the extended of this article on my website.

While the “curse of dimensionality” outlines significant limitations in high-dimensional spaces, it’s exciting to see how modern deep learning models are increasingly adept at navigating these complexities. Consider the embedding models or the latest LLMs, for example, which utilize very high-dimensional vectors to more effectively discern and model textual patterns.

Want to learn more about Transformers and how they transform your data under the hood? Check out my previous article:

Transformers: How Do They Transform Your Data?

References:
- [1] “Volume of an n-ball.” Wikipedia, https://en.wikipedia.org/wiki/Volume_of_an_n-ball
- [2] “Curse of Dimensionality.” Wikipedia, https://en.wikipedia.org/wiki/Curse_of_dimensionality#Blessing_of_dimensionality
- [3] Peterson, Ivars. “An Adventure in the Nth Dimension.” American Scientist, https://www.americanscientist.org/article/an-adventure-in-the-nth-dimension
- [4] Zhang, Yuxi. “Curse of Dimensionality.” Geek Culture on Medium, July 2021, https://medium.com/geekculture/curse-of-dimensionality-e97ba916cb8f
- [5] “Approximate Nearest Neighbors (ANN).” Activeloop, https://www.activeloop.ai/resources/glossary/approximate-nearest-neighbors-ann/
The Math Behind “The Curse of Dimensionality” was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
The Math Behind “The Curse of Dimensionality”

Go Here to Read this Fast! The Math Behind “The Curse of Dimensionality”
April 20, 2024
Tiny Llama — a Performance Review and Discussion

Eivind Kjosbakken

Learn how you can utilize a tiny large language model, fine-tune it, and achieve high performance

Continue reading on Towards Data Science »

Originally appeared here:
Tiny Llama — a Performance Review and Discussion

Go Here to Read this Fast! Tiny Llama — a Performance Review and Discussion

April 20, 2024
Soft Skills You Need For A Career in Data in 2024

Rashi Desai

Now that you know how to code & visualize data, what’s next?

Continue reading on Towards Data Science »

Originally appeared here:
Soft Skills You Need For A Career in Data in 2024

Go Here to Read this Fast! Soft Skills You Need For A Career in Data in 2024

April 20, 2024
Mistral vs Mixtral: Comparing the 7B, 8x7B, and 8x22B Large Language Models

Dmitrii Eliuseev

Running the 7B and 22B Models in Google Colab

Continue reading on Towards Data Science »

Originally appeared here:
Mistral vs Mixtral: Comparing the 7B, 8x7B, and 8x22B Large Language Models

Go Here to Read this Fast! Mistral vs Mixtral: Comparing the 7B, 8x7B, and 8x22B Large Language Models

April 20, 2024
Meta Llama 3 Optimized CPU Inference with Hugging Face and PyTorch
Eduardo Alvarez
Created with Nightcafe — Image property of Author

Learn how to reduce model latency when deploying Meta* Llama 3 on CPUs

The much-anticipated release of Meta’s third-generation batch of Llama is here, and I want to ensure you know how to deploy this state-of-the-art (SoTA) LLM optimally. In this tutorial, we will focus on performing weight-only-quantization (WOQ) to compress the 8B parameter model and improve inference latency, but first, let’s discuss Meta Llama 3.

Llama 3

To date, the Llama 3 family includes models ranging from 8B to 70B parameters, with more versions coming in the future. The models come with a permissive Meta Llama 3 license, you are encouraged to review before accepting the terms required to use them. This marks an exciting chapter for the Llama model family and open-source AI.

Architecture

The Llama 3 is an auto-regressive LLM based on a decoder-only transformer. Compared to Llama 2, the Meta team has made the following notable improvements:
- Adoption of grouped query attention (GQA), which improves inference efficiency.
- Optimized tokenizer with a vocabulary of 128K tokens designed to encode language more efficiently.
- Trained on a 15 trillion token dataset, this is 7x larger than Llama 2’s training dataset and includes 4x more code.
The figure below (Figure 1) is the result of print(model) where model is meta-llama/Meta-Llama-3–8B-Instruct. In this figure, we can see that the model comprises 32 LlamaDecoderLayers composed of Llama Attention self-attention components. Additionally, it has LlamaMLP, LlamaRMSNorm, and a Linear head. We hope to learn more once the Llama 3 research paper is released.

Figure 1. Output of `print(model)` showcasing the distribution of layers across llama-3–8B-instruct’s architecture — Image by Author

Language Modeling Performance

The model was evaluated on various industry-standard language modeling benchmarks, such as MMLU, GPQA, HumanEval, GSM-8K, MATH, and more. For the purpose of this tutorial, we will review the performance of the “Instruction Tuned Models” (Figure 2). The most remarkable aspect of these figures is that the Llama 3 8B parameter model outperforms Llama 2 70B by 62% to 143% across the reported benchmarks while being an 88% smaller model!

Figure 2 . Summary of Llama 3 instruction model performance metrics across the MMLU, GPQA, HumanEval, GSM-8K, and MATH LLM benchmarks. — Image by Author (source)

The increased language modeling performance, permissive licensing, and architectural efficiencies included with this latest Llama generation mark the beginning of a very exciting chapter in the generative AI space. Let’s explore how we can optimize inference on CPUs for scalable, low-latency deployments of Llama 3.

Optimizing Llama 3 Inference with PyTorch

In a previous article, I covered the importance of model compression and overall inference optimization in developing LLM-based applications. In this tutorial, we will focus on applying weight-only quantization (WOQ) to meta-llama/Meta-Llama-3–8B-Instruct. WOQ offers a balance between performance, latency, and accuracy, with options to quantize to int4 or int8. A key component of WOQ is the dequantization step, which converts int4/in8 weights back to bf16 before computation.

Fig 3. Simple illustration of weight-only quantization, with pre-quantized weights in orange and the quantized weights in green. Note that this depicts the initial quantization to int4/int8 and dequantization to fp16/bf16 for the computation step. — Image by Author (source)

Environment Setup

You will need approximately 60GB of RAM to perform WOQ on Llama-3-8B-Instruct. This includes ~30GB to load the full model and ~30GB for peak memory during quantization. The WOQ Llama 3 will only consume ~10GB of RAM, meaning we can free ~50GB of RAM by releasing the full model from memory.

You can run this tutorial on the Intel® Tiber® Developer Cloud free JupyterLab* environment. This environment offers a 4th Generation Intel® Xeon® CPU with 224 threads and 504 GB of memory, more than enough to run this code.

If running this in your own IDE, you may need to address additional dependencies like installing Jupyter and/or configuring a conda/python environment. Before getting started, ensure that you have the following dependencies installed.
```
intel-extension-for-pytorch==2.2
transformers==4.35.2
torch==2.2.0
huggingface_hub
```
Accessing and Configuring Llama 3

You will need a Hugging Face* account to access Llama 3’s model and tokenizer.

To do so, select “Access Tokens” from your settings menu (Figure 4) and create a token.

Figure 4. Snapshot of the Hugging Face token configuration console — Image by Author

Copy your access token and paste it into the “Token” field generated inside your Jupyter cell after running the following code.
```
from huggingface_hub import notebook_login, Repository

# Login to Hugging Face
notebook_login()
```
Go to meta-llama/Meta-Llama-3–8B-Instruct and carefully evaluate the terms and license before providing your information and submitting the Llama 3 access request. Accepting the model’s terms and providing your information is yours and yours alone.

Quantizing Llama-3–8B-Instruct with WOQ

We will leverage the Intel® Extension for PyTorch* to apply WOQ to Llama 3. This extension contains the latest PyTorch optimizations for Intel hardware. Follow these steps to quantize and perform inference with an optimized Llama 3 model:
1. Llama 3 Model and Tokenizer: Import the required packages and use the AutoModelForCausalLM.from_pretrained() and AutoTokenizer.from_pretrained() methods to load the Llama-3–8B-Instruct specific weights and tokenizer.
```
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

Model = 'meta-llama/Meta-Llama-3-8B-Instruct'

model = AutoModelForCausalLM.from_pretrained(Model)
tokenizer = AutoTokenizer.from_pretrained(Model)
```
2. Quantization Recipe Config: Configure the WOQ quantization recipe. We can set the weight_dtype variable to the desired in-memory datatypes, choosing from torch.quint4x2 or torch.qint8 for int4 and in8, respectively. Additionally we can use lowp_model to define the dequantization precision. For now, we will keep this as ipex.quantization.WoqLowpMode.None to keep the default bf16 computation precision.
```
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping(
 weight_dtype=torch.quint4x2, # or torch.qint8
 lowp_mode=ipex.quantization.WoqLowpMode.NONE, # or FP16, BF16, INT8
)
checkpoint = None # optionally load int4 or int8 checkpoint

# PART 3: Model optimization and quantization
model_ipex = ipex.llm.optimize(model, quantization_config=qconfig, low_precision_checkpoint=checkpoint)

del model 
```
We use ipex.llm.optimize() to apply WOQ and then del model to delete the full model from memory and free ~30GB of RAM.

3. Prompting Llama 3: Llama 3, like LLama 2, has a pre-defined prompting template for its instruction-tuned models. Using this template, developers can define specific model behavior instructions and provide user prompts and conversation history.
```
system= """nn You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. If you don't know the answer to a question, please don't share false information."""
user= "nn You are an expert in astronomy. Can you tell me 5 fun facts about the universe?"
model_answer_1 = 'None'

llama_prompt_tempate = f"""
<|begin_of_text|>n<|start_header_id|>system<|end_header_id|>{system}
<|eot_id|>n<|start_header_id|>user<|end_header_id|>{user}
<|eot_id|>n<|start_header_id|>assistant<|end_header_id|>{model_answer_1}<|eot_id|>
"""

inputs = tokenizer(llama_prompt_tempate, return_tensors="pt").input_ids
```
We provide the required fields and then use the tokenizer to convert the entire template into tokens for the model.

4. Llama 3 Inference: For text generation, we leverage TextStreamer to generate a real-time inference stream instead of printing the entire output at once. This results in a more natural text generation experience for readers. We provide the configured streamer to model_ipex.generate() and other text-generation parameters.
```
with torch.inference_mode():
 tokens = model_ipex.generate(
 inputs,
 streamer=streamer,
 pad_token_id=128001,
 eos_token_id=128001,
 max_new_tokens=300,
 repetition_penalty=1.5,
)
```
Upon running this code, the model will start generating outputs. Keep in mind that these are unfiltered and non-guarded outputs. For real-world use cases, you will need to make additional post-processing considerations.

Figure 5. Streamed inference of Llama-3–8B-Instruct with WOQ mode compression at int4 running on the Intel Tiber Developer Cloud’s JupyterLab environment — Gif by Author

That’s it. With less than 20 lines of code, you now have a low-latency CPU optimized version of the latest SoTA LLM in the ecosystem.

Considerations for Deployment

Depending on your inference service deployment strategy, there are a few things that you will want to consider:
- If deploying instances of Llama 3 in containers, WOQ will offer a smaller memory footprint and allow you to serve multiple inference services of the model on a single hardware node.
- When deploying multiple inference services, you should optimize the threads and memory reserved for each service instance. Leave enough additional memory (~4 GB) and threads (~4 threads) to handle background processes.
- Consider saving the WOQ version of the model and storing it in a model registry to eliminate the need to re-quantize the model per instance deployment.
Conclusion and Discussion

Meta’s Llama 3 LLM family delivers remarkable improvements over previous generations with a diverse range of configurations (8B to 70B). In this tutorial, we explored enhancing CPU inference with weight-only quantization (WOQ), a technique that minimizes latency while preserving accuracy.

By integrating the new generation of performance-oriented Llama 3 LLMs with optimization techniques like WOQ, developers can unlock new possibilities for GenAI applications. This combination simplifies the hardware requirements to achieve high-fidelity, low-latency results from LLMs integrated into new and existing systems.

A few exciting things to try next would be:
1. Experiment with Quantization Levels: You should test int4 and int8 quantization to identify the best compromise between performance and accuracy for your specific applications.
2. Performance Monitoring: It is crucial to continuously assess the performance and accuracy of the Llama 3 model across different real-world scenarios to ensure that quantization maintains the desired effectiveness.
3. Test more Llamas: Explore the entire Llama 3 family and evaluate the impact of WOQ and other PyTorch quantization recipes.
Thank you for reading! Don’t forget to follow my profile for more articles like this!

*Other names and brands may be claimed as the property of others.

Meta Llama 3 Optimized CPU Inference with Hugging Face and PyTorch was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Meta Llama 3 Optimized CPU Inference with Hugging Face and PyTorch

Go Here to Read this Fast! Meta Llama 3 Optimized CPU Inference with Hugging Face and PyTorch
April 20, 2024
N-of-1 Trials and Analyzing Your Own Fitness Data
Merete Lutz
Do I really sleep worse after drinking alcohol?

Photo by Luke Chesser on Unsplash

I first heard of N-of-1 trials in 2018 as a master’s student studying epidemiology. I was in my Intermediate Epidemiologic and Clinical Research Methods class, and we had a guest lecture from Dr. Eric Daza on N-of-1 study design. The N-of-1 study can be thought of as a clinical trial investigating the efficacy of an intervention on an individual patient. At the time, this methodology was an emerging practice, with promising implications for personalized medicine and optimizing healthcare for the individual.

As an aside at the end of the lecture, he mentioned that N-of-1 experiments were a hobby for some people who were obsessed with their Fitbit and familiar with statistics. These data scientists who had access to their own biometric data would conduct experiments and analyses to optimize their sleep, workouts, and diet. I was fascinated.

This lecture couldn’t have come at a more perfect time. I had spent the last 5 years as a student-athlete on the women’s volleyball team and was about to embark on what would become a 5 yearlong professional volleyball career. I had just started using a Whoop strap to track my workouts, sleep, and recovery, and had all this new data at my fingertips. I was learning statistics and study design, and now I knew how to put them to work and maybe get a leg-up in my new career.

I religiously tracked everything for the next year and a half via my Whoop, but eventually stopped wearing the device, as at the time the company didn’t allow you to download your own data. Fast forward a couple years, and now I’m a Master of Data Science Student at the University of British Columbia. I’m armed with even more analytical methods, and Whoop finally lets you access your old data! Now more than ever, I can finally conduct my own N-of-1 study and answer some of the questions I had back then.

Before getting to the analysis, we must first define the N-of-1 study framework, and examine the historical method of conducting causal medical research.

Randomized Clinical Trials

The gold standard of modern day medical research is the randomized clinical trial or RCT. Say we want to find out if a new drug lowers the risk of heart attacks. In an RCT, a group of patients are randomly assigned to a treatment (the drug) or control (a placebo pill). The researchers have cleverly designed this experiment so that the individuals making up the two groups have similar characteristics, such that the major difference between them is whether or not they are taking the drug. We follow these individuals for some time and take note of the heart attacks that occur. At the end of the experiment, we count up these numbers in each group, and do some statistics so that we can compare heart attack incidence between each group and whether there was a difference between the groups that was statistically significant.

RCTs are incredibly powerful tools of causal inference, and allow us to discover whether a certain intervention leads to a desired response. They are the historical backbone of applied medical research, but are somewhat limited by the importance they place on the generalizability of their results. When an RCT is conducted, the end goal is to set a new standard of practice for a wider population beyond the study participants. We make an inference on the population based on the sample, and in doing so we average out individual response. The act of doing so is almost contrary to the goal of medicine; that of caring for individual patients.

N-of-1 Trials

N-of-1 trials address this limitation by taking RCT study design and applying it on an individual scale. They allow us to explore the variability in patient response to a given treatment, and can lead to better patient outcomes by limiting time spent on a suboptimal treatment. While the idea of an N-of-1 study has been around for some time, such studies are more accessible now because of the advancement in technology allowing for easier collection of data.

N-of-1 trials aren’t always the answer to personalized medicine. In the case of fast moving maladies like infectious diseases, you likely won’t have time to conduct such an individualized trial, and are better off going with a more generalized approach. For the treatment of chronic conditions however, N-of-1 trials provide an incredibly promising avenue towards the improvement of health outcomes. These conditions may not be directly life threatening, and are observable over long periods of time. This allows for multiple different interventions to be attempted, in hopes of finding an optimal treatment.

Outside of medicine, you can also apply the N-of-1 trial to your every day life. How many of us have tried a new medicine, diet, supplement, workout or sleep routine and struggled to say whether it worked? It can be hard to conclusively state whether the intervention had any effect, as most of the available evidence is anecdotal or hard to quantify. By using a N-of-1 study framework in combination with your own biometric data taken from wearable health trackers, you can get conclusive evidence that allows you to make lifestyle changes you know will make a difference.

N-of-1 Trials in Practice

To show you an example of this methodology in practice, I will conduct my own analysis on a selection of data collected from my Whoop strap from April 27th, 2018 to October 5th, 2019. Our research question for this N-of-1 study is:

Does drinking alcohol lead to poor sleep?

As an athlete and epidemiologist, I am very aware of how detrimental alcohol can be on your sleep, athletic performance and general wellbeing. I’ve constantly been told how athletes should not drink, however its one thing to be told, but another to see the evidence for yourself. Once I started wearing my Whoop I noticed how my sleep score (a metric calculated by the Whoop app) would suffer after drinking alcohol. Sometimes even a day later, I thought I could still see the effect. These observations made me want to do my own analysis, which I can finally complete now.

Notes on the Data

The two variables of interest in our analysis is sleep performance score and alcohol consumption. Sleep performance score ranges from 0 to 100 and is a metric calculated by the Whoop app from biometric data like respiratory rate, light sleep duration, slow wave sleep duration, and REM sleep duration.

The alcohol consumption variable is the response to the question “Did you have any alcoholic drinks yesterday?” that is responded to by Whoop users on a daily basis upon waking up. I always answered these questions truthfully and consistently, although we are limited in our data in that the app does not ask questions about how much alcohol was consumed. This means that all levels of alcohol consumption are treated equally, which eliminates the opportunity to analyze the relationship on a deeper level. There was some missing data in our alcohol feature, but this missing information was imputed with ‘No’s as I know from personal experience that if I had drunk the night before I was sure to mark it in the app.

Exploratory Data Analysis

The first step in any analysis is to do some exploratory data analysis (EDA). This is just to get a general idea of what our data looks like, and to create a visual that will help direct our investigation.

Fig 1. Exploratory plot of the distribution of sleep performance score by level of alcohol consumed.

From the above box-plots, we see that average sleep score appears to be higher when no alcohol was consumed, and to have a narrower distribution. Curiously, there seems to be more outliers in sleep performance score when alcohol is not consumed. Perhaps travel days and jet-lag can account for these outliers, as I traveled overseas 5 times during this sample period.

Now that we have gotten a good first look at the data of interest, its time to dig into the statistical analysis.

Hypothesis Testing

To answer our research question, I will be conducting hypothesis testing. Hypothesis testing is a statistical technique that allows us to make inferences about a population based on some sample data. In this case, we are attempting to infer if me drinking alcohol is associated with having poor sleep that night. We don’t have data on alcohol consumption and sleep for every night I’ve been alive, so we study our sample data as a proxy.

The first step in hypothesis testing is to formulate my hypotheses. A ‘null hypothesis’ is the assumption that nothing interesting is happening or that there is no relationship or effect. In our case the null hypothesis is: There is no difference in mean sleep performance between nights in which alcohol was consumed and was not consumed.

An ‘alternative hypothesis’ is the hypothesis that contradicts the null, and claims that in fact there is something interesting happening. In our example the alternative hypothesis is: There is a difference in mean sleep performance between nights in which alcohol was consumed and was not consumed.

Choosing a Statistical Test

To assess which of these hypotheses is true, we have to chose a statistical test. We are curious if the average sleep score for nights in which I drank alcohol is different from the average sleep score for nights in which I did not drink alcohol, and so will be using a difference in means to test this. Specifically, our test statistic is: Mean sleep performance with no alcohol — Mean sleep performance with alcohol

Now that we have defined our framework, we can use R to calculate our test statistic and evaluate our hypotheses.

Conducting our Analysis in R

From our sample data we can calculate our observed test statistic. The code in R is included below.
```
test_stat <- data |> 
 specify(formula = sleep_performance ~ alcohol) |> 
 calculate(
 stat = "diff in means",
 order = c("No", "Yes")
 )
```
Our test statistic is 8.01. This number means that the average sleep score for nights in which I consumed no alcohol is 8.01 points higher than nights in which I did consume alcohol.

The next step in the analysis is to generate a null distribution from our sample data. A null distribution represents all the different values of test statistic we would observe if samples were drawn repeatedly from the population. The distribution is meant to reflect the variation in the test statistic purely due to random sampling. The null distribution is created in R below:
```
set.seed(42) #Setting seed for reproducibility

null_distribution <- data |> 
 specify(formula = sleep_performance ~ alcohol) |> 
 hypothesize(null = "independence") |> 
 generate(reps = 1000, type = "permute") |> 
 calculate(
 stat = "diff in means",
 order = c("No", "Yes")
 )
```
What we are doing above is taking samples with replacement from our data, and calculating the difference in means from those samples. We do this 1000 times to generate a large enough distribution so that we can determine if our observed test statistic is significant.

After we have our null distribution and test statistic, we can calculate a two-sided p-value for an alpha of 0.05. The p-value can be thought of as the probability of getting a test statistic that is as extreme or more than our observed test statistic if the null hypothesis is true. Put into plain words; it represents how likely it would be to see this result if there was no true association. We calculate a two-sided p-value in R below, as we are interested in the possibility of the test statistic being greater or lesser than expected.
```
p_value <- null_distribution|> 
 get_p_value(test_stat, direction = "both")
```
Our p-value is 0.017 which means that our finding is significant at the alpha=0.05 level, which is a commonly accepted level of significance in statistics. It means that the difference in sleep score we found was significant! We have the evidence to reject the null hypothesis and accept the alternative; there is a difference in mean sleep performance between nights in which alcohol was consumed and was not consumed.

I’ve included a helpful visualization of the null distribution, test statistic, and 95% quantile range below. The grey bars are the many possible test statistics calculated from our 1000 samples, and the orange line represents the density of these values. The blue dashed lines represent the 97.5th and 2.5th quantiles of this distribution, beyond which our test statistic (in red) is shown to be significant.

Figure 2. The distribution of test statistics under the null hypothesis (no difference in mean sleep score with alcohol consumption)

Final Conclusions

Well, it turns out my coaches were right all along! Our analysis found that my average sleep score when I did not consume alcohol was 8.01 points higher than my average sleep score when I did consume alcohol. This difference was found to be statistically significant, with a p-value of 0.017, meaning that we reject the null hypothesis in favor of the alternative. This statistical result backs up my personal experience, giving me a quantitative result that I can have confidence in.

Going Further

Now that I have this initial analysis under my belt, I can explore more associations in my data, and even use more complicated methods like forecasting and machine learning models.

This analysis is a very basic example of an N-of-1 study, and is not without limitations. My study was observational rather than experimental, and we cannot declare causality, as there are many other confounding variables not measured by my Whoop. If I wanted to find a causal relationship, I would have to carefully design a study, record data on all possible confounders, and find a way to blind myself to the treatment. N-of-1 studies are hard to do outside of a clinical setting, however we can still find meaningful associations and relationships by asking simple questions of our data.

I hope that after this tutorial you take the initiative to download your own data from whatever fitness tracker you can get your hands on, and play around with it. I know everyone can come up with a hypothesis about how some variable affects their health, but what most people don’t realize, is that you’re closer to getting a quantifiable answer to that question than you think.

References and Further Reading

[1] Davidson, K., Cheung, K., Friel, C., & Suls, J. (2022). Introducing Data Sciences to N-of-1 Designs, Statistics, Use-Cases, the Future, and the Moniker ‘N-of-1’ Trial. Harvard Data Science Review, (Special Issue 3). https://doi.org/10.1162/99608f92.116c43fe

[2] Lillie EO, Patay B, Diamant J, Issell B, Topol EJ, Schork NJ. The n-of-1 clinical trial: the ultimate strategy for individualizing medicine? Per Med. 2011 Mar;8(2):161–173. doi: 10.2217/pme.11.7. PMID: 21695041; PMCID: PMC3118090.

[3] Daza EJ. Causal Analysis of Self-tracked Time Series Data Using a Counterfactual Framework for N-of-1 Trials. Methods Inf Med. 2018 Feb;57(1):e10-e21. doi: 10.3414/ME16–02–0044. Epub 2018 Apr 5. PMID: 29621835; PMCID: PMC6087468.

[4] Schork, N. Personalized medicine: Time for one-person trials. Nature 520, 609–611 (2015). https://doi.org/10.1038/520609a

N-of-1 Trials and Analyzing Your Own Fitness Data was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
N-of-1 Trials and Analyzing Your Own Fitness Data

Go Here to Read this Fast! N-of-1 Trials and Analyzing Your Own Fitness Data
April 20, 2024
Practical Computer Simulations for Product Analysts
Mariya Mansurova
Part 1: Task-specific approaches for scenario forecasting

Image by DALL-E

In product analytics, we quite often get “what-if” questions. Our teams are constantly inventing different ways to improve the product and want to understand how it can affect our KPI or other metrics.

Let’s look at some examples:
- Imagine we’re in the fintech industry and facing new regulations requiring us to check more documents from customers making the first donation or sending more than $100K to a particular country. We want to understand the effect of this change on our Ops demand and whether we need to hire more agents.
- Let’s switch to another industry. We might want to incentivise our taxi drivers to work late or take long-distance rides by introducing a new reward scheme. Before launching this change, it would be crucial for us to estimate the expected size of rewards and conduct a cost vs. benefit analysis.
- As the last example, let’s look at the main Customer Support KPIs. Usually, companies track the average waiting time. There are many possible ways how to improve this metric. We can add night shifts, hire more agents or leverage LLMs to answer questions quickly. To prioritise these ideas, we will need to estimate their impact on our KPI.
When you see such questions for the first time, they look pretty intimidating.

If someone asks you to calculate monthly active users or 7-day retention, it’s straightforward. You just need to go to your database, write SQL and use the data you have.

Things become way more challenging (and exciting) when you need to calculate something that doesn’t exist. Computer simulations will usually be the best solution for such tasks. According to Wikipedia, simulation is an imitative representation of a process or system that could exist in the real world. So, we will try to imitate different situations and use them in our decision-making.

Simulation is a powerful tool that can help you in various situations. So, I would like to share with you the practical examples of computer simulations in the series of articles:
- In this article, we will discuss how to use simulations to estimate different scenarios. You will learn the basic idea of simulations and see how they can solve complex tasks.
- In the second part, we will diverge from scenario analysis and will focus on the classic of computer simulations — bootstrap. Bootstrap can help you get confidence intervals for your metrics and analyse A/B tests.
- I would like to devote the third part to agent-based models. We will model the CS agent behaviour to understand how our process changes can affect CS KPIs such as queue size or average waiting time.
So, it’s time to start and discuss the task we will solve in this article.

Our project: Launching tests for English courses

Suppose we are working on an edtech product that helps people learn the English language. We’ve been working on a test that could assess the student’s knowledge from different angles (reading, listening, writing and speaking). The test will give us and our students a clear understanding of their current level.

We agreed to launch it for all new students so that we can assess their initial level. Also, we will suggest existing students pass this test when they return to the service next time.

Our goal is to build a forecast on the number of submitted tests over time. Since some parts of these tests (writing and speaking) will require manual review from our teachers, we would like to ensure that we will have enough capacity to check these tests on time.

Let’s try to structure our problem. We have two groups of students:
- The first group is existing students. It’s a good practice to be precise in analytics, so we will define them as students who started using our service before this launch. We will need to check them once at their next transaction, so we will have a substantial spike while processing them all. Later, the demand from this segment will be negligible (only rare reactivations).
- New students will hopefully continue joining our courses. So, we should expect consistent demand from this group.
Now, it’s time to think about how we can estimate the demand for these two groups of customers.

The situation is pretty straightforward for new students — we need to predict the number of new customers weekly and use it to estimate demand. So, it’s a classic task of time series forecasting.

The task of predicting demand from existing customers might be more challenging. The direct approach would be to build a model to predict the week when students will return to the service next time and use it for estimations. It’s a possible solution, but it sounds a bit overcomplicated to me.

I would prefer the other approach. I would simulate the situation when we launched this test some time ago and use the previous data. In that case, we will have all the data after “this simulated launch” and will be able to calculate all the metrics. So, it’s actually a basic idea of scenario simulations.

Cool, we have a plan. Let’s move on to execution.

Modelling demand from new customers

Before jumping to analysis, let’s examine the data we have. We keep a record of the lessons’ completion events. We know each event’s user identifier, date, module, and lesson number. We will use weekly data to avoid seasonality and capture meaningful trends.

Let me share some context about the educational process. Students primarily come to our service to learn English from scratch and pass six modules (from pre-A1 to C1). Each module consists of 100 lessons.

The data was generated explicitly for this use case, so we are working with a synthetic data set.

First, we need to calculate the metric we want to predict. We will offer students the opportunity to pass the initial evaluation test after completing the first demo lesson. So, we can easily calculate the number of customers who passed the first lesson or aggregate users by their first date.
```
new_users_df = df.groupby('user_id', as_index = False).date.min()
 .rename(columns = {'date': 'cohort'})

new_users_stats_df = new_users_df.groupby('cohort')[['user_id']].count()
 .rename(columns = {'user_id': 'new_users'})
```
We can look at the data and see an overall growing trend with some seasonal effects (i.e. fewer customers joining during the summer or Christmas time).

For forecasting, we will use Prophet — an open-source library from Meta. It works pretty well with business data since it can predict non-linear trends and automatically take into account seasonal effects. You can easily install it from PyPI.
```
pip install prophet
```
Prophet library expects a data frame with two columns: ds with timestamp and y with a metric we want to predict. Also, ds must be a datetime column. So, we need to transform our data to the expected format.
```
pred_new_users_df = new_users_df.copy()
pred_new_users_df = pred_new_users_df.rename(
 columns = {'new_users': 'y', 'cohort': 'ds'})
pred_new_users_df.ds = pd.to_datetime(pred_new_users_df.ds)
```
Now, we are ready to make predictions. As usual in ML, we need to initialise and fit a model.
```
from prophet import Prophet

m = Prophet()
m.fit(pred_new_users_df)
```
The next step is prediction. First, we need to create a future data frame specifying the number of periods and their frequency (in our case, weekly). Then, we need to call the predict function.
```
future = m.make_future_dataframe(periods= 52, freq = 'W')
forecast_df = m.predict(future)
forecast_df.tail()[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
```
As a result, we get the forecast (yhat) and confidence interval (yhat_lower and yhat_upper).

It’s difficult to understand the result without charts. Let’s use Prophet functions to visualise the output better.
```
m.plot(forecast_df) # forecast
m.plot_components(forecast_df) # components
```
The forecast chart shows you the forecast with a confidence interval.

The components view lets you understand the split between trend and seasonal effects. For example, the second chart displays a seasonal drop-off during summer and an increase at the beginning of September (when people might be more motivated to start learning something new).

We can put all this forecasting logic into one function. It will be helpful for us later.
```
import plotly.express as px
import plotly.io as pio
pio.templates.default = 'simple_white'

def make_prediction(tmp_df, param, param_name = '', periods = 52):
 # pre-processing
 df = tmp_df.copy()
 date_param = df.index.name
 df.index = pd.to_datetime(df.index)
 
 train_df = df.reset_index().rename(columns = {date_param: 'ds', param: 'y'})
 
 # model
 m = Prophet()
 m.fit(train_df)

 future = m.make_future_dataframe(periods=periods, freq = 'W')
 forecast = m.predict(future)
 forecast = forecast[['ds', 'yhat']].rename(columns = {'ds': date_param, 'yhat': param + '_model'})
 
 # join to actual data
 forecast = forecast.set_index(date_param).join(df, how = 'outer')
 
 # visualisation
 fig = px.line(forecast, 
 title = 'Forecast: ' + (param if param_name == '' else param_name),
 labels = {'value': param if param_name == '' else param_name},
 color_discrete_map = {param: 'navy', param + '_model': 'gray'}
 )
 fig.update_traces(mode='lines', line=dict(dash='dot'), 
 selector=dict(name=param + '_model'))
 fig.update_layout(showlegend = False)
 fig.show()

 return forecast

new_forecast_df = make_prediction(new_users_stats_df, 
 'new_users', 'new users', periods = 75)
```
I prefer to share with my stakeholders a more styled version of visualisation (especially for public presentations), so I’ve added it to the function as well.

In this example, we’ve used the default Prophet model and got quite a plausible forecast. However, in some cases, you might want to tweak parameters, so I advise you to read the Prophet docs to learn more about the possible levers.

For example, in our case, we believe that our audience will continue growing at the same rate. However, this might not be the case, and you might expect it to have a cap of around 100 users. Let’s update our prediction for saturating growth.
```
# adding cap to the initial data
# it's not required to be constant
pred_new_users_df['cap'] = 100

#specifying logistic growth
m = Prophet(growth='logistic')
m.fit(pred_new_users_df)

# adding cap for the future
future = m.make_future_dataframe(periods= 52, freq = 'W')
future['cap'] = 100
forecast_df = m.predict(future)
```
We can see that the forecast has changed significantly, and the growth stops at ~100 new clients per week.

It’s also interesting to look at the components’ chart in this case. We can see that the seasonal effects stayed the same, while the trend has changed to logistic (as we specified).

We’ve learned a bit about the ability to tweak forecasts. However, for future calculations, we will use a basic model. Our business is still relatively small, and most likely, we haven’t reached saturation yet.

We’ve got all the needed estimations for new customers and are ready to move on to the existing ones.

Modelling demand from existing customers

The first version

The key point in our approach is to simulate the situation when we launched this test some time ago and calculate the demand using this data. Our solution is based on the idea that we can use the past data instead of predicting the future.

Since there’s significant yearly seasonality, I will use data for -1 year to take into account these effects automatically. We want to launch this project at the beginning of April. So, I will use past data from the week of 2nd April 2023.

First, we need to filter the data related to existing customers at the beginning of April 2023. We’ve already forecasted demand from new users, so we don’t need to consider them in this estimation.
```
model_existing_users = df[df.date < '2023-04-02'].user_id.unique()
raw_existing_df = df[df.user_id.isin(model_existing_users)]
```
Then, we need to model the demand from these users. We will offer our existing students the chance to pass the test the next time they use our product. So, we need to define when each customer returned to our service after the launch and aggregate the number of customers by week. There’s no rocket science at all.
```
existing_model_df = raw_existing_df[raw_existing_df.date >= '2023-04-02']
 .groupby('user_id', as_index = False).date.min()
 .groupby('date', as_index = False).user_id.count()
 .rename(columns = {'user_id': 'existing_users'})
```
We got the first estimations. If we had launched this test in April 2023, we would have gotten around 1.3K tests in the first week, 0.3K for the second week, 80 cases in the third week, and even less afterwards.

We assumed that 100% of existing customers would finish the test, and we would need to check it. In real-life tasks, it’s worth taking conversion into account and adjusting the numbers. Here, we will continue using 100% conversion for simplicity.

So, we’ve done our first modelling. It wasn’t challenging at all. But is this estimation good enough?

Taking into account long-term trends

We are using data from the previous year. However, everything changes. Let’s look at the number of active customers over time.
```
active_users_df = df.groupby('date')[['user_id']].nunique()
 .rename(columns = {'user_id': 'active_users'})
```
We can see that it’s growing steadily. I would expect it to continue growing. So, it’s worth adjusting our forecast due to this YoY (Year-over-Year) growth. We can re-use our prediction function and calculate YoY using forecasted values to make it more accurate.
```
active_forecast_df = make_prediction(active_users_df, 
 'active_users', 'active users')
```
Let’s calculate YoY growth based on our forecast and adjust the model’s predictions.
```
# calculating YoYs
active_forecast_df['active_user_prev_year'] = active_forecast_df.active_users.shift(52)
active_forecast_df['yoy'] = active_forecast_df.active_users_model/
 active_forecast_df.active_user_prev_year

existing_model_df = existing_model_df.rename(
 columns = {'date': 'model_date', 'existing_users': 'model_existing_users'})

# adjusting dates from 2023 to 2024
existing_model_df['date'] = existing_model_df.model_date.map(
 lambda x: datetime.datetime.strptime(x, '%Y-%m-%d') + datetime.timedelta(364)
)

existing_model_df = existing_model_df.set_index('date')
 .join(active_forecast_df[['yoy']])

# updating estimations
existing_model_df['existing_users'] = list(map(
 lambda x, y: int(round(x*y)),
 existing_model_df.model_existing_users,
 existing_model_df.yoy
))
```
We’ve finished the estimations for the existing students as well. So, we are ready to merge both parts and get the result.

Putting everything together

First results

Now, we can combine all our previous estimations and see the final chart. For that, we need to convert data to the common format and add segments so that we can distinguish demand between new and existing students.
```
# existing segment
existing_model_df = existing_model_df.reset_index()[['date', 'existing_users']]
 .rename(columns = {'existing_users': 'users'})
existing_model_df['segment'] = 'existing'

# new segment
new_model_df = new_forecast_df.reset_index()[['cohort', 'new_users_model']]
 .rename(columns = {'cohort': 'date', 'new_users_model': 'users'})
new_model_df = new_model_df[(new_model_df.date >= '2024-03-31') 
 & (new_model_df.date < '2025-04-07')]
new_model_df['users'] = new_model_df.users.map(lambda x: int(round(x)))
new_model_df['segment'] = 'new'

# combining everything
demand_model_df = pd.concat([existing_model_df, new_model_df])

# visualisation
px.area(demand_model_df.pivot(index = 'date', 
 columns = 'segment', values = 'users').head(15)[['new', 'existing']], 
 title = 'Demand: modelling number of tests after launch',
 labels = {'value': 'number of test'})
```
We should expect around 2.5K tests for the first week after launch, mostly from existing customers. Then, within four weeks, we will review tests from existing users and will have only ~100–130 cases per week from new joiners.

That’s wonderful. Now, we can share our estimations with colleagues so they can also plan their work.

What if we have demand constraints?

In real life, you will often face the problem of capacity constraints when it’s impossible to launch a new feature to 100% of customers. So, it’s time to learn how to deal with such situations.

Suppose we’ve found out that our teachers can check only 1K tests each week. Then, we need to stagger our demand to avoid bad customer experience (when students need to wait for weeks to get their results).

Luckily, we can do it easily by rolling out tests to our existing customers in batches (or cohorts). We can switch the functionality on for all new joiners and X% of existing customers in the first week. Then, we can add another Y% of existing customers in the second week, etc. Eventually, we will evaluate all existing students and have ongoing demand only from new users.

Let’s come up with a rollout plan without exceeding the 1K capacity threshold.

Since we definitely want to launch it for all new students, let’s start with them and add them to our plan. We will store all demand estimations by segments in the raw_demand_est_model_df data frame and initialise them with our new_model_df estimations that we got before.
```
raw_demand_est_model_df = new_model_df.copy()
```
Now, we can aggregate this data and calculate the remaining capacity.
```
capacity = 1000

demand_est_model_df = raw_demand_est_model_df.pivot(index = 'date', 
 columns = 'segment', values = 'users')

demand_est_model_df['total_demand'] = demand_est_model_df.sum(axis = 1)
demand_est_model_df['capacity'] = capacity
demand_est_model_df['remaining_capacity'] = demand_est_model_df.capacity 
 - demand_est_model_df.total_demand

demand_est_model_df.head()
```
Let’s put this logic into a separate function since we will need it to evaluate our estimations after each iteration.
```
import plotly.graph_objects as go

def get_total_demand_model(raw_demand_est_model_df, capacity = 1000):
 demand_est_model_df = raw_demand_est_model_df.pivot(index = 'date', 
 columns = 'segment', values = 'users')
 demand_est_model_df['total_demand'] = demand_est_model_df.sum(axis = 1)
 demand_est_model_df['capacity'] = capacity
 demand_est_model_df['remaining_capacity'] = demand_est_model_df.capacity 
 - demand_est_model_df.total_demand

 tmp_df = demand_est_model_df.drop(['total_demand', 'capacity', 
 'remaining_capacity'], axis = 1)
 fig = px.area(tmp_df,
 title = 'Demand vs Capacity',
 category_orders={'segment': ['new'] + list(sorted(filter(lambda x: x != 'new', tmp_df.columns)))},
 labels = {'value': 'tests'})
 fig.add_trace(go.Scatter(
 x=demand_est_model_df.index, y=demand_est_model_df.capacity, 
 name='capacity', line=dict(color='black', dash='dash'))
 )
 
 fig.show()
 return demand_est_model_df

demand_plan_df = get_total_demand_model(raw_demand_est_model_df)
demand_plan_df.head()
```
I’ve also added a chart to the output of this function that will help us to assess our results effortlessly.

Now, we can start planning the rollout for existing customers week by week.

First, let’s transform our current demand model for existing students. I would like it to be indexed by the sequence number of weeks and show the 100% demand estimation. Then, I can smoothly get estimations for each batch by multiplying demand by weight and calculating the dates based on the launch date and week number.
```
existing_model_df['num_week'] = list(range(existing_model_df.shape[0]))
existing_model_df = existing_model_df.set_index('num_week')
 .drop(['date', 'segment'], axis = 1)
existing_model_df.head()
```
So, for example, if we launch our evaluation test for 10% of random customers, then we expect to get 244 tests on the first week, 52 tests on the second week, 14 on the third, etc.

I will be using the same estimations for all batches. I assume that all batches of the same size will produce the exact number of tests over the following weeks. So, I don’t take into account any seasonal effects related to the launch date for each batch.

This assumption simplifies your process quite a bit. And it’s pretty reasonable in our case because we will do a rollout only within 4–5 weeks, and there are no significant seasonal effects during this period. However, if you want to be more accurate (or have considerable seasonality), you can build demand estimations for each batch by repeating our previous process.

Let’s start with the week of 31st March 2024. As we saw before, we have a spare capacity for 888 tests. If we launch our test to 100% of existing customers, we will get ~2.4K tests to check in the first week. So, we are ready to roll out only to a portion of all customers. Let’s calculate it.
```
cohort = '2024-03-31'
demand_plan_df.loc[cohort].remaining_capacity/existing_model_df.iloc[0].users
# 0.3638
```
It’s easier to operate with more round numbers, so let’s round the number to a fraction of 5%. I’ve rounded the number down to have some buffer.
```
full_demand_1st_week = existing_model_df.iloc[0].users
next_group_share = demand_plan_df.loc[cohort].remaining_capacity/full_demand_1st_week
next_group_share = math.floor(20*next_group_share)/20
# 0.35
```
Since we will make several iterations, we need to track the percentage of existing customers for whom we’ve enabled the new feature. Also, it’s worth checking whether we’ve already processed all the customers to avoid double-counting.
```
enabled_user_share = 0

# if we can process more customers than are left, update the number
if next_group_share > 1 - enabled_user_share:
 print('exceeded')
 next_group_share = round(1 - enabled_user_share, 2)

enabled_user_share += next_group_share
# 0.35
```
Also, saving our rollout plan in a separate variable will be helpful.
```
rollout_plan = []
rollout_plan.append(
 {'launch_date': cohort, 'rollout_percent': next_group_share}
)
```
Now, we need to estimate the expected demand from this batch. Launching tests for 35% of customers on 31st March will lead to some demand not only in the first week but also in the subsequent weeks. So, we need to calculate the total demand from this batch and add it to our plans.
```
# copy the model
next_group_demand_df = existing_model_df.copy().reset_index()

# calculate the dates from cohort + week number
next_group_demand_df['date'] = next_group_demand_df.num_week.map(
 lambda x: (datetime.datetime.strptime(cohort, '%Y-%m-%d') 
 + datetime.timedelta(7*x))
)

# adjusting demand by weight
next_group_demand_df['users'] = (next_group_demand_df.users * next_group_share).map(lambda x: int(round(x)))

# labelling the segment
next_group_demand_df['segment'] = 'existing, cohort = %s' % cohort

# updating the plan
raw_demand_est_model_df = pd.concat([raw_demand_est_model_df, 
 next_group_demand_df.drop('num_week', axis = 1)])
```
Now, we can re-use the function get_total_demand_mode, which helps us analyse the current demand vs capacity balance.
```
demand_plan_df = get_total_demand_model(raw_demand_est_model_df)
demand_plan_df.head()
```
We are utilising most of our capacity for the first week. We still have some free resources, but it was our conscious decision to keep some buffer for sustainability. We can see that there’s almost no demand from this batch after 3 weeks.

With that, we’ve finished the first iteration and can move on to the following week — 4th April 2024. We can check an additional 706 cases during this week.

We can repeat the whole process for this week and move to the next one. We can iterate to the point when we launch our project to 100% of existing customers (enabled_user_share equals to 1).

We can roll out our tests to all customers without breaching the 1K tests per week capacity constraint within just four weeks. In the end, we will have the following weekly forecast.

We can also look at the rollout plan we’ve logged throughout our simulations. So, we need to launch the test for randomly selected 35% of customers on the week of 31st March, then for the next 20% of customers next week, followed by 25% and 20% of existing users for the remaining two weeks. After that, we will roll out our project to all existing students.
```
rollout_plan
# [{'launch_date': '2024-03-31', 'rollout_percent': 0.35},
# {'launch_date': '2024-04-07', 'rollout_percent': 0.2},
# {'launch_date': '2024-04-14', 'rollout_percent': 0.25},
# {'launch_date': '2024-04-21', 'rollout_percent': 0.2}]
```
So, congratulations. We now have a plan for how to roll out our feature sustainably.

Tracking students’ performance over time

We’ve already done a lot to estimate demand. We’ve leveraged the idea of simulation by imitating the launch of our project a year ago, scaling it and assessing the consequences. So, it’s definitely a simulation example.

However, we mostly used the basic tools you use daily — some Pandas data wrangling and arithmetic operations. In the last part of the article, I would like to show you a bit more complex case where we will need to simulate the process for each customer independently.

Product requirements often change over time, and it happened with our project. You, with a team, decided that it would be even better if you could allow your students to track progress over time (not only once at the very beginning). So, we would like to offer students to go through a performance test after each module (if more than one month has passed since the previous test) or if the student returned to the service after three months of absence.

Now, the criteria for test assignments are pretty tricky. However, we can still use the same approach by looking at the data for the previous year. However, this time, we will need to look at each customer’s behaviour and define at what point they would get a test.

We will take into account both new and existing customers since we want to estimate the effects of follow-up tests on all of them. We don’t need any data before the launch because the first test will be assigned at the next active transaction, and all the history won’t matter. So we can filter it out.
```
sim_df = df[df.date >= '2023-03-31']
```
Let’s also define a function that calculates the number of days between two date strings. It will be helpful for us in the implementation.
```
def days_diff(date1, date2):
 return (datetime.datetime.strptime(date2, '%Y-%m-%d')
 - datetime.datetime.strptime(date1, '%Y-%m-%d')).days
```
Let’s start with one user and discuss the logic with all the details. First, we will filter events related to this user and convert them into the list of dictionaries. It will be way easier for us to work with such data.
```
user_id = 4861
user_events = sim_df[sim_df.user_id == user_id]
 .sort_values('date')
 .to_dict('records')

# [{'user_id': 4861, 'date': '2023-04-09', 'module': 'pre-A1', 'lesson_num': 8},
# {'user_id': 4861, 'date': '2023-04-16', 'module': 'pre-A1', 'lesson_num': 9},
# {'user_id': 4861, 'date': '2023-04-23', 'module': 'pre-A1', 'lesson_num': 10},
# {'user_id': 4861, 'date': '2023-04-23', 'module': 'pre-A1', 'lesson_num': 11},
# {'user_id': 4861, 'date': '2023-04-30', 'module': 'pre-A1', 'lesson_num': 12},
# {'user_id': 4861, 'date': '2023-05-07', 'module': 'pre-A1', 'lesson_num': 13}]
```
To simulate our product logic, we will be processing user events one by one and, at each point, checking whether the customer is eligible for the evaluation.

Let’s discuss what variables we need to maintain to be able to tell whether the customer is eligible for the test or not. For that, let’s recap all the possible cases when a customer might get a test:
- If there were no previous tests -> we need to know whether they passed a test before.
- If the customer finished the module and more than one month has passed since the previous test -> we need to know the last test date.
- If the customer returns after three months -> we need to store the date of the last lesson.
To be able to check all these criteria, we can use only two variables: the last test date (None if there was no test before) and the previous lesson date. Also, we will need to store all the generated tests to calculate them later. Let’s initialise all the variables.
```
tmp_gen_tests = []
last_test_date = None
last_lesson_date = None
```
Now, we need to iterate by event and check the criteria.
```
for rec in user_events:
 pass
```
Let’s go through all our criteria, starting from the initial test. In this case, last_test_date will be equal to None. It’s important for us to update the last_test_date variable after “assigning” the test.
```
if last_test_date is None: # initial test
 last_test_date = rec['date']
 # TBD saving the test info
```
In the case of the finished module, we need to check that it’s the last lesson in the module and that more than 30 days have passed.
```
if (rec['lesson_num'] == 100) and (days_diff(last_test_date, rec['date']) >= 30): 
 last_test_date = rec['date']
 # TBD saving the test info
```
The last case is that the customer hasn’t used our service for three months.
```
if (days_diff(last_lesson_date, rec['date']) >= 30): 
 last_test_date = rec['date']
 # TBD saving the test info
```
Besides, we need to update the last_lesson_date at each iteration to keep it accurate.

We’ve discussed all the building blocks and are ready to combine them and do simulations for all our customers.
```
import tqdm
tmp_gen_tests = []

for user_id in tqdm.tqdm(sim_raw_df.user_id.unique()):
 # initialising variables
 last_test_date = None
 last_lesson_date = None

 for rec in sim_raw_df[sim_raw_df.user_id == user_id].to_dict('records'):
 # initial test
 if last_test_date is None: 
 last_test_date = rec['date']
 tmp_gen_tests.append(
 {
 'user_id': rec['user_id'],
 'date': rec['date'],
 'trigger': 'initial test'
 }
 )
 # finish module
 elif (rec['lesson_num'] == 100) and (days_diff(last_test_date, rec['date']) >= 30): 
 last_test_date = rec['date']
 tmp_gen_tests.append(
 {
 'user_id': rec['user_id'],
 'date': rec['date'],
 'trigger': 'finished module'
 })
 # reactivation
 elif (days_diff(last_lesson_date, rec['date']) >= 92):
 last_test_date = rec['date']
 tmp_gen_tests.append(
 {
 'user_id': rec['user_id'],
 'date': rec['date'],
 'trigger': 'reactivation'
 })
 last_lesson_date = rec['date']
```
Now, we can aggregate this data. Since we are again using the previous year’s data, I will adjust the number by ~80% YoY, as we’ve estimated before.
```
exist_model_upd_stats_df = exist_model_upd.pivot_table(
 index = 'date', columns = 'trigger', values = 'user_id', 
 aggfunc = 'nunique'
).fillna(0)

exist_model_upd_stats_df = exist_model_upd_stats_df
 .map(lambda x: int(round(x * 1.8)))
```
We got quite a similar estimation for the initial test. In this case, the “initial test” segment equals the sum of new and existing demand in our previous estimations.

So, looking at other segments is way more interesting since they will be incremental to our previous calculations. We can see around 30–60 cases per week from customers who finished modules starting in May.

There will be almost no cases of reactivation. In our simulation, we got 4 cases per year in total.

Congratulations! Now the case is solved, and we’ve found a nice approach that allows us to make precise estimations without advanced math and with only simulation. You can use similar

You can find the full code for this example on GitHub.

Summary

Let me quickly recap what we’ve discussed today:
- The main idea of computer simulation is imitation based on your data.
- In many cases, you can reframe the problem from predicting the future to using the data you already have and simulating the process you’re interested in. So, this approach is quite powerful.
- In this article, we went through an end-to-end example of scenario estimations. We’ve seen how to structure complex problems and split them into a bunch of more defined ones. We’ve also learned to deal with constraints and plan a gradual rollout.
Thank you a lot for reading this article. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

All the images are produced by the author unless otherwise stated.

Practical Computer Simulations for Product Analysts was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Practical Computer Simulations for Product Analysts

Go Here to Read this Fast! Practical Computer Simulations for Product Analysts
April 19, 2024
How to Implement Knowledge Graphs and Large Language Models (LLMs) together at the Enterprise Level
Steve Hedden
Source: OpenArt SDXL

How to Implement Knowledge Graphs and Large Language Models (LLMs) Together at the Enterprise Level

A survey of the current methods of integration

Large Language Models (LLMs) and Knowledge Graphs (KGs) are different ways of providing more people access to data. KGs use semantics to connect datasets via their meaning i.e. the entities they are representing. LLMs use vectors and deep neural networks to predict natural language. They are often both aimed at ‘unlocking’ data. For enterprises implementing KGs, the end goal is usually something like a data marketplace, a semantic layer, to FAIR-ify their data or to make their enterprise more data-centric. These are all different solutions with the same end goal: making more data available to the right people faster. For enterprises implementing an LLM or some other similar GenAI solution, the goal is often similar: to provide employees or customers with a ‘digital assistant’ that can get the right information to the right people faster. The potential symbiosis is clear: some of the main weaknesses of LLMs, that they are black-box models and struggle with factual knowledge, are some of KGs’ greatest strengths. KGs are, essentially, collections of facts, and they are fully interpretable. But exactly how can and should KGs and LLMs be implemented together at an enterprise?

When I was searching for a job last year, I had to write a lot of cover letters. I used ChatGPT to help — I’d copy my existing cover letter into the prompt window, along with my resume and the job description of the job I was applying for, and ask ChatGPT to do the rest. ChatGPT helped me gain momentum with some pretty solid first drafts, but unchecked, it also gave me years of experience I didn’t have and claimed I went to schools I never attended.

I bring up my cover letter because 1) I think it is a great example of the strengths and weaknesses of LLMs, and why KGs are an important part of their implementation and 2) this use case is not that different from what many large enterprises are using LLMs for currently: automated report generation. ChatGPT does a pretty good job of recreating a cover letter by changing the content to be more focused on a specific job description, as long as you explicitly include the existing cover letter and job description in the prompt. Ensuring the LLM has the right content is where a KG comes in. If you simply write, ‘write me a cover letter for a job I want,’ the results are going to be laughable. Additionally, the cover letter example is a great application of an LLM because it is about summarizing and restructuring language. Remember what the second L in LLM stands for? LLMs have, historically, focused on unstructured data (text) and that is where they excel, whereas KGs excel at integrating structured and unstructured data. You can use the LLM to write the cover letter but you should use a KG to make sure it has the right resume.

Note: I am not an AI expert but I also don’t really trust anyone who pretends to be. This space is changing so fast that it is impossible to keep up, let alone predict what the future of AI implementation at the enterprise level will look like. I describe some of the ways KGs and LLMs are being integrated currently, as I see it. This is not a comprehensive list and I am open to additions and suggestions.

The two ways KGs and LLMs are related

There are two ways KGs and LLMs are interacting right now: LLMs as tools to build KGs and KGs as inputs into LLM or GenAI applications. Those of us working in the knowledge graph space are in the weird position of building things that are expected to improve AI applications, while AI simultaneously changes the way we build those things. We are expected to optimize AI as a tool in our day to day while changing our output to facilitate AI optimization. These two trends are related and often overlap, but I’ll discuss them one at a time below.

Using LLMs to assist in the KG creation and curation process

LLMs are valuable tools for building KGs. One way to leverage LLM technology in the KG curation process is by vectorization (or embedding) your KG in a vector database. A vector database (or a vector store) is a database built to store vectors or lists of numbers. Vectorization is one of, if not the, core technological component driving language models. These models, through incredible amounts of training data, learn to associate words with vectors. The vectors capture semantic and syntactic information about the word based on its context in the training data. By using an embedding service trained using these incredible amounts of data, we can leverage that semantic and syntactic information in our KG.

Note: vectorizing your KG is by no means the only way to use LLM-tech in KG curation and construction. Also, none of these applications of LLMs are new to KG creation. NLP has been used for decades for entity extraction for example, LLM is just a new capability to assist the ontologist/taxonomist.

Some of the ways LLMs can help in the KG creation process are:
- Entity resolution: Entity resolution is the process of aligning records that refer to the same real-world entity. For example, acetaminophen, a common pain reliever used in the US and sold under the brand name Tylenol, is called paracetamol in the UK and sold under the brand name Panadol. These four names are nothing alike, but If you were to embed your KG into a vector database, the vectors would have the semantic understanding to know that these entities are closely related.
- Tagging of unstructured data: Suppose you want to incorporate some unstructured data into your KG. You have a bunch of PDFs with vague file names but you know there is important information in those documents. You need to tag these documents with file type and topic. If your topical taxonomy and document type taxonomy have been embedded, all you need to do is vectorize the documents and the vector database will identify the most relevant entities from each taxonomy.
- Entity and class extraction: Create or enhance a controlled vocabulary like an ontology or a taxonomy based on a corpus of unstructured data. Entity extraction is similar to tagging but the goal here is about enhancing the ontology rather than incorporating unstructured data into KG. Suppose you have a geographic ontology and you want to populate it with instances of towns, cities, states, etc. You can use an LLM to extract entities from a corpus of text to populate the ontology. Likewise, you can use the LLM to extract classes and relationships between classes from the corpus. Suppose you forgot to include ‘capital’ in your ontology. The LLM might be able to extract this as a new class or a property of a city.
Using KGs to power and govern GenAI pipelines

There are several reasons to use a KG to power and govern your GenAI pipelines and applications. According to Gartner, “Through 2025, at least 30% of GenAI projects will be abandoned after proof of concept (POC) due to poor data quality, inadequate risk controls, escalating costs or unclear business value.” KGs can help improve data quality, mitigate risks, and reduce costs.

Data governance, access control, and regulatory compliance

Only authorized people and applications should have access to certain data and for certain purposes. Usually, enterprises want certain types of people (or apps) to chat with certain types of data, in a well-governed way. How do you know which data should go into which GenAI pipeline? How can you ensure PII does not make its way into the digital assistant you want all of your employees to chat with? The answer is data governance. Some additional points:
- Policies and regulations can change, especially when it comes to AI. Even if your AI apps are compliant now, they might not be in the future. A good data governance foundation allows an enterprise to adapt to these changing regulations.
- Sometimes, the correct answer to a question is ‘I don’t know,’ or ‘you don’t have access to the information required to answer that question,’ or ‘it is illegal or unethical for me to answer that question.’ The quality of responses is more than just a matter of truth or accuracy but also of regulatory compliance.
- Notable players implementing or enabling this solution (alphabetically): Semantic KG companies like Cambridge Semantics, data.world, PoolParty, metaphacts, and TopQuadrant but also data catalogs like Alation, Collibra, and Informatica (and many many more).
Accuracy and contextual understanding

KGs can also help improve overall data quality — if your documents are filled with contradictory and/or false statements, do not be surprised when your ChatBot tells you inconsistent and false things. If your data is poorly structured, storing it in one place isn’t going to help. That is how the promise of data lakes became the scourge of data swamps. Likewise, if your data is poorly structured, vectorizing it isn’t going to solve your problems, it’s just going to create a new headache: a vectorized data swamp. If your data is well structured, however, KGs can provide LLMs with additional relevant resources to generate more personalized and accurate recommendations in several ways. There are different ways of using KGs to improve the accuracy of an LLM, but they generally fall under the category of natural language querying (NLQ)— using natural language to interact with databases. The current ways NLQ is being implemented, as far as I know, are through RAG, prompt-to-query, and fine-tuning.

Retrieval-Augmented Generation (RAG): RAG means supplementing a prompt with additional relevant information outside of the training data to generate a more accurate response. While LLMs have been trained on vast amounts of data, they have not been trained on your data. Think of the cover letter example above. I could ask an LLM to ‘write a cover letter for Steve Hedden for a job in product management at TopQuadrant’ and it would return an answer but it would contain hallucinations. A smarter way of doing that would be for the model to take this prompt, retrieve the LinkedIn profile for Steve Hedden, retrieve the job description for the open position at TopQuadrant, and then write the cover letter. There are currently two prominent ways of doing this retrieval: by vectorizing the graph or by turning the prompt into a graph query (prompt-to-query).
- Vector-based retrieval: This method of retrieval requires that you vectorize your KG and store it in a vector store. If you then vectorize your natural language prompt, you can find vectors in the vector store that are most similar to your prompt. Since these vectors correspond to entities in your graph, you can return the most ‘relevant’ entities in the graph given a natural language prompt. This is the exact same process described above under the tagging capability — we are essentially ‘tagging’ a prompt with relevant tags from our KG.
- Prompt-to-query retrieval: Alternatively, you could use an LLM to generate a SPARQL or Cypher query and use that query to get the most relevant data from the graph. Note: you can use the prompt-to-query method to query the database directly, without using the results to supplement a prompt to an LLM. This would not be an application of RAG, since you are not ‘augmenting’ anything. This method is explained in more detail below.
Some additional pros, cons, and notes on RAG and the two retrieval methods:
- RAG requires, by definition, a knowledge base. A knowledge graph is a knowledge base, and so proponents of KGs are going to be proponents of RAG powered by graphs (sometimes called GraphRAG). But RAG can be implemented without a knowledge graph.
- RAG can supplement a prompt based on the most relevant data from your KG based on the content of the prompt, but also the metadata from the prompt. For example, we can customize the response based on who asked the question, what they have access to, and additional demographic information about them.
- As described above, one benefit of using the vector-based retrieval method is that if you have embedded your KG into a vector database for tagging and entity resolution, the hard part is already done. Finding the most relevant entities related to a prompt is no different than tagging a chunk of unstructured text with entities from a KG.
- RAG provides some level of explainability in the response. The user can now see the supplemental data that went into their prompt, along with, potentially, where the answer to their question lives in that data.
- I mentioned above that AI is affecting the way we build KGs while we are expected to build KGs that facilitate AI. The prompt-to-query approach is a perfect example of this. The schema of the KG will affect how well an LLM can query it. If the purpose of the KG is to feed an AI application, then the ‘best’ ontology is no longer a reflection of reality but a reflection of the way AI sees reality.
- In theory, more relevant information should reduce hallucinations, but that does not mean RAG eliminates hallucinations. We are still using a language model to generate a response, so there is still plenty of room for uncertainty and hallucinations. Even with my resume and job description, an LLM might still exaggerate my experience. For the text to query approach, we are using the LLM to generate the KG query and the response, so there are actually two places for potential hallucinations.
- Likewise, RAG offers some level of explainability, but not entirely. For example, if we used vector-based retrieval, the model can tell us which entities it included because they were the most relevant, but it can’t explain why those were the most relevant. If using an auto-generated KG query, the auto-generated query ‘explains’ why certain data was returned by the graph, but the user will need to understand SPARQL or Cypher to fully understand why those data were returned.
- These two approaches are not mutually exclusive and many companies are pursuing both. For example, Neo4j has tutorials on implementing RAG with vector-based retrieval, and on prompt-to-query generation. Anecdotally, I am writing this just after attending a conference with a heavy focus on KG and LLM implementation in life sciences, and many of the life sciences companies I saw give presentations are doing some combination of vector-based and prompt-to-query RAG.
- Notable players implementing or enabling this solution (alphabetically): data.world, Microsoft, Neo4j, Ontotext, PoolParty, SciBite, Stardog, TopQuadrant (and many many more)
Prompt-to-query alone: Use an LLM to translate a natural language query into a formal query (like in SPARQL or Cypher) for your KG. This is the same as the prompt-to-query retrieval approach to RAG described above, except that we don’t send the data to an LLM after it is retrieved. The idea here is that by using the LLM to generate the query and not interpret the data, you are reducing hallucinations. Though, as mentioned above, it doesn’t matter what the LLM generates, it can contain hallucinations. The argument for this approach is that it is easier for the user to detect hallucinations in the auto-generated query than in an auto-generated response. I am somewhat skeptical about that since, presumably, many users who use an LLM to generate a SPARQL query will not know SPARQL well enough to detect issues with the auto-generated query.
- Anyone implementing a RAG solution using prompt-to-query retrieval can also implement prompt-to-query alone. These include: Neo4j, Ontotext, and Stardog.
KGs for fine-tuning LLMs: Use your KG to provide additional training to an off-the-shelf LLM. Rather than provide the KG data as part of the prompt at query time (RAG), you can use your KG to train the LLM itself. The benefit here is that you can keep all of your data local — you don’t need to send your prompts to OpenAI or anyone else. The downside is that the first L in LLM stands for large and so downloading and fine-tuning one of them is resource intensive. Additionally, while a model fine-tuned on your enterprise or industry-specific data is going to be more accurate, it will not eliminate hallucinations altogether. Some additional thoughts on this:
- Once you use the graph to fine-tune the model, you also lose the ability to use the graph for access control.
- There are LLMs that have already been fine-tuned for different industries like MedLM for healthcare and SecLM for cybersecurity.
- Depending on the use case, a fine-tuned LLM might not be necessary. For example, if you are largely using the LLM to summarize news articles, the LLM might not need special training.
- Rather than fine-tuning the LLM with industry specific information, some are using LLMs fine-tuned to generate code (like Code Llama) as part of their prompt-to-query solution.
- Notable players implementing or enabling this solution (alphabetically): As far as I know, Stardog’s Voicebox is the only solution that uses a KG to fine-tune an LLM for the customer.
A note on the different ways of integrating KGs and LLMs I have listed here: These categories (RAG, prompt-to-query, and fine-tuning) are neither comprehensive nor mutually exclusive. There are other ways of implementing KGs and LLMs and there will be more in the future. Also, there is considerable overlap between these solutions and you can combine solutions. You can run a vector-based and prompt-to-query RAG hybrid solution on a fine-tuned model, for example.

Efficiency and scalability

Building many separate apps that do not connect is inefficient and what Dave McComb refers to as a software wasteland. It doesn’t matter that the apps are ‘powered by AI’. Siloed apps result in duplicative data and code and overall redundancies. KGs provide a foundation for eliminating these redundancies through the smooth flow of data throughout the enterprise.

Gartner’s claim above is that many GenAI projects will be abandoned due to escalating costs, but I don’t know whether a KG will significantly reduce those costs. I don’t know of any studies or cost-benefit analyses done to support that claim. Developing an LLM-powered ChatBot for an enterprise is expensive, but so is developing a KG.

Conclusion

I won’t pretend to know the ‘optimal’ solution and, like I said above, I think anyone who pretends to know the future of AI is full of it. I do believe that both KGs and LLMs are useful tools for anyone trying to make more data available to the right people faster, and that they each have their strengths and weaknesses. Use the LLM to write the cover letter (or regulatory report), but use the KG to make sure you give it the right resume (or studies or journal articles or whatever).

Generally speaking, I believe in using AI as much as possible to build, maintain, and extend knowledge graphs, and also that KGs are necessary for enterprises looking to adopt GenAI technologies. This is for several reasons: data governance, access control, and regulatory compliance; accuracy and contextual understanding; and efficiency and scalability.

How to Implement Knowledge Graphs and Large Language Models (LLMs) together at the Enterprise Level was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
How to Implement Knowledge Graphs and Large Language Models (LLMs) together at the Enterprise Level

Go Here to Read this Fast! How to Implement Knowledge Graphs and Large Language Models (LLMs) together at the Enterprise Level
April 19, 2024
Introducing automatic training for solutions in Amazon Personalize

Ba’Carri Johnson

Amazon Personalize is excited to announce automatic training for solutions. Solution training is fundamental to maintain the effectiveness of a model and make sure recommendations align with users’ evolving behaviors and preferences. As data patterns and trends change over time, retraining the solution with the latest relevant data enables the model to learn and adapt, […]

Originally appeared here:
Introducing automatic training for solutions in Amazon Personalize

Go Here to Read this Fast! Introducing automatic training for solutions in Amazon Personalize

April 19, 2024
The Business Guide to Tailoring Language AI Part 2

Georg Ruile, Ph.D.

Prompting ChatGPT and other chat-based language AI — and why you should (not) care about it

Foreword

This article sheds some light on the question of how to “talk” to Large Language Models (LLM) that are designed to interact in conversational ways, like ChatGPT, Claude and others, so that the answers you get from them are as useful as possible for the task at hand. This particular communication from human to the language chatbot is what’s typically referred to as prompting. With this article, I mean to give people with no computer science background a compact overview of the topic so that everyone can understand. It can also help businesses to contextualize of what (not) to expect from their LLM adaption endeavors.

Prompting is the first of four steps you can take when climbing the ladder of adapting language models for your businesses’ custom use. I have introduced the overall 4 Step Framework of unlocking custom LLM in my previous post. If you haven’t already it might be helpful to read it first so that you can fit the ideas presented here into a larger context.

The Business Guide to Tailoring Language AI

Introduction

Shortly after the mass market introduction of ChatGPT, a new, hot profession has entered the AI scene: Prompt engineering. These AI “whisperers”, i.e. people that have certain skills to “prompt”, that is, talk to language AI so that it responds in useful ways, have become highly sought-after (and generously paid) new roles. Considering that a main building block of proper prompting is simply (or not so simply) giving precise instructions (see below), I must confess that I was surprised by this development (regardless of the fact that prompt engineering certainly involves more than just “whispering”): Isn’t communicating in a precise and concise manner a basic professional skill that we all should possess? But then again, I was reflecting on how important it is to have well-crafted requirements in software development, and “requirement engineering” roles have been an important ingredient of successful software development projects for a while now.

I observe a level of uncertainty and “best guessing” and even contradictions in the topic of LLM and prompting that I have not yet experienced in any IT-related subject. This has to do with the type and size of AI models and their stochastic characteristics, which is beyond the scope of this article. Considering the 1.76 trillion parameters of models like GPT-4, the number of possible combinations and paths from input (your “prompt”) to output (the model response) is virtually indefinite and non-deterministic. Hence, applications treat these models mainly as black boxes, and related research focuses on empirical approaches such as benchmarking their performance.

The sad news is that I cannot present you a perfect one-size-fits-all prompting solution that will forever solve your LLM requirements. Add to this that different models behave differently, and you may understand my dilemma. There’s some good news, though: On the one hand, you can, and should, always consider some basic principles and concepts that will help you optimize your interactions with the machines. Well-crafted prompts gets you farther than poor ones, and this is why it is well worthwhile to dig a bit deeper into the topic. On the other hand, it may not even be necessary to worry too much about prompting at all, which saves you valuable computing time (literally, CPU/GPU and figuratively, in your own brain).

Start with Why

Here I am not referring to Simon Sinek’s classic TEDx business advice. Instead, I encourage you to curiously wonder why technology does what it does. I strongly believe in the notion that if you understand at least a bit of the inner workings of software, it will tremendously help you in its application.

So how, in principle, is the input (the prompt) related to the output (the response), and why is it that proper prompts result in better suited responses? To figure this out, we need to have at least a superficial look at the model architecture and its training and fine-tuning, without needing to understand the details of the impressive underlying concepts like the infamous Transformer Architecture and Attention Mechanisms which ultimately caused the breakthrough of ChatGPT-like Generative AI as we know it today.

For our purposes, we can look at it from two angles:

How does the model retrieve knowledge and generate its response?
and closely related
How has the model been trained and fine-tuned?

It is important to understand that an LLM is in essence a Deep Neural Network and as such, it works based on statistics and probabilities. Put very simplistically, the model generates output which reflects the closest match to the context, based on its knowledge it has learned from vast amounts of training data. One of the building blocks here are so-called Embeddings, where similar word meanings are (mathematically) close to each other, even though the model does not actually “understand” those meanings. If this sounds fancy, it kinda is, but at same time, it is “only” mathematics, so don’t be afraid.

A simple illustration of word vector embeddings — similar word “meanings” are close to each other

When looking at training, it helps considering the training data and process a language model has gone through. Not only has the model seen vast amounts of text data. It has also learned what makes out a high rated response to a specific question, for instance on sites like StackOverflow, and on high-quality Q&A assistant documents written for model training and tuning. In addition, in its fine-tuning stage, it learned and iteratively adapted its optimal responses based on human feedback. Without all this intense training and tuning efforts, the model might answer a question like “what is your first name” simply with “what is your last name”, because it has seen this frequently on internet forms [1].

Where I am trying to get at is this: When interacting with natural language AI, always keep in mind what and how the model has learned and how it gets to its output, given your input. Even though no one really knows this exactly, it is useful to consider probable correlations: Where and in what context could the model have seen input similar to yours before? What data has been available during the pre-training stage, and in which quality and quantity? For instance: Ever wondered why LLMs can solve mathematical equations (not reliably, however, sometimes still surprisingly), without inherent calculation capabilities? LLMs don’t calculate, they match patterns!

Prompting 101

There is a plethora of Prompting techniques, and plenty of scientific literature that benchmarks their effectiveness. Here, I just want to introduce a few well-known concepts. I believe that once you get the general idea, you will be able to expand your prompting repertoire and even develop and test new techniques yourself.

Ask and it will be given to you

Before going into specific prompting concepts, I would like to stress a general idea that, in my opinion, cannot be stressed enough:

The quality of your prompt highly determines the response of the model.

And by quality I don’t necessarily mean a sophisticated prompt construction. I mean the basic idea of asking a precise question or giving well-structured instructions and providing necessary context. I have touched on this already when we met Sam, the piano player, in my previous article. If you ask a bar piano player to play some random Jazz tune, chances are that he might not play what you had in mind. Instead, if you ask exactly what it is you want to hear, your satisfaction with the result is likely to increase.

Similarly, if you ever had the chance of, say, hire someone to do something around your house and your contract specification only says, say, “bathroom renovation”, you might be surprised that in the end your bathroom does not look like what you had in mind. The contractor, just like the model, will only refer to what he has learned about renovations and bathroom tastes and will take the learned route to deliver.

So here are some general guidelines for prompting:

· Be clear and specific.

· Be complete.

· Provide context.

· Specify the desired output style, length, etc.

This way, the model has sufficient and matching reference data in your prompt that it can relate to when generating its response.

Roleplay prompting — simple, but overrated

In the early days of ChatGPT, the idea of roleplay prompting was all around: Instead of asking the assistant to give you an immediate answer (i.e. a simple query), you first assign it a specific role, such as “teacher” or “consultant” etc. Such a prompt could look like [2]:

From now on, you are an excellent math teacher and always teach your students math problems correctly. And I am one of your students.

It has been shown that this concept yields superior results. One paper reports that by this role play, the model implicitly triggers a step by step reasoning process, which is what you want it to do when applying the CoT- technique, see below. However, this approach has also been shown to sometimes perform sub-optimal and needs to be well designed.

In my experience, simply assigning a role doesn’t do the trick. I have experimented with the example task from the paper referred to above. Unlike in this research, GPT3.5 (which is as of today the free version of OpenAI’s ChatGPT, so you can try it yourself) has given the correct result, using a simple query:

An example using a simple query instead of the roleplay prompt suggested by [2], still yielding the correct response

I have also experimented with different logical challenges with both simple queries and roleplay, using a similar prompt like the one above. In my experiments two things happen:

either simple queries provides the correct answer on the first attempt, or

both simple queries and roleplay come up with false, however different answers

Roleplay did not outperform the queries in any of my simple (not scientifically sound) experiments. Hence, I conclude that the models must have improved recently and the impact of roleplay prompting is diminishing.

Looking at different research, and without extensive further own experimenting, I believe that in order to outperform simple queries, roleplay prompts need to be embedded into a sound and thoughtful design to outperform the most basic approaches — or are not valuable at all.

I am happy to read your experiences on this in the comments below.

Few-Shot aka in-context learning

Another intuitive and relatively simple concept is what’s referred to as Few-Shot prompting, also referred to as in-context learning. Unlike in a Zero-Shot Prompt, we not only ask the model to perform a task and expect it to deliver, we additionally provide (“few”) examples of the solutions. Even though you may find this obvious that providing examples leads to better performance, this is quite a remarkable ability: These LLMs are able to in-context learn, i.e. perform new tasks via inference alone by conditioning on a few input-label pairs and making predictions for new inputs [3].

Setting up a few-shot prompt involves

(1) collecting examples of the desired responses, and
(2) writing your prompt with instructions on what to do with these examples.

Let’s look at a typical classification example. Here the model is given several examples of statements that are either positive, neutral or negative judgements. The model’s task is to rate the final statement:

A typical classification example of a Few-Shot prompt. The model is required to classify statements into the given categories (positive / negative)

Again, even though this is a simple and intuitive approach, I am sceptical about its value in state-of-the-art language models. In my (again, not scientifically sound) experiments, Few-Shot prompts have not outperformed Zero-Shot in any case. (The model knew already that a drummer who doesn’t keep the time, is a negative experience, without me teaching it…). My finding seems to be consistent with recent research, where even the opposite effect (Zero-Shot outperforming Few-Shot) has been shown [4].

In my opinion and on this empirical background it is worth considering if the cost of design as well as computational, API and latency cost of this approach are a worthwhile investment.

CoT-Prompting or “Let’s think step-by-step’’

Chain of Thought (CoT) Prompting aims to make our models better at solving complex, multi-step reasoning problems. It can be as simple as adding the CoT instruction “Let’s think step-by-step’’ to the input query, to improve accuracy significantly [5][6].

Instead of just providing the final query or add one or few examples within your prompt like in the Few-Shot approach, you prompt the model to break down its reasoning process into a series of intermediate steps. This is akin to how a human would (ideally) approach a challenging problem.

Remember your math exams in school? Often, at more advanced classes, you were asked to not only solve a mathematical equation, but also write down the logical steps how you arrived at the final solution. And even if it was incorrect, you might have gotten some credits for mathematically sound solution steps. Just like your teacher in school, you expect the model to break the task down into sub-tasks, perform intermediate reasoning, and arrive at the final answer.

Again, I have experimented with CoT myself quite a bit. And again, most of the time, simply adding “Let’s think step-by-step” didn’t improve the quality of the response. In fact, it seems that the CoT approach has become an implicit standard of the recent fine-tuned chat-based LLM like ChatGPT, and the response is frequently broken down into chunks of reasoning without the explicit command to do so.

However, I came across one instance where the explicit CoT command did in fact improve the answer significantly. I used a CoT example from this article, however, altered it into a trick question. Here you can see how ChatGPT fell into my trap, when not explicitly asked for a CoT approach (even though the response shows a step wise reasoning):

A trick question with a simple query instead of a CoT prompt. Even though the response is broken down “step by step”, it is not quite correct.

When I added “Let’s think step-by-step” to the same prompt, it solved the trick question correctly (well, it is unsolvable, which ChatGPT rightfully pointed out):

The same trick question with an explicit CoT prompt, delivering a correct response

To summarize, Chain of Thought prompting aims at building up reasoning skills that are otherwise difficult for language models to acquire implicitly. It encourages models to articulate and refine their reasoning process rather than attempting to jump directly from question to answer.

Again, my experiments have revealed only limited benefits of the simple CoT approach (adding “Let’s think step-by-step“). CoT did outperform a simple query on one occasion, and at the same time the extra effort of adding the CoT command is minimal. This cost-benefit ratio is one of the reasons why this approach is one of my favorites. Another reason why I personally like this approach is, it not only helps the model, but also can help us humans to reflect and maybe even iteratively consider necessary reasoning steps while crafting the prompt.

As before, we will likely see diminishing benefits of this simple CoT approach when models become more and more fine-tuned and accustomed to this reasoning process.

Conclusion

In this article, we have taken a journey into the world of prompting chat-based Large Language Models. Rather than just giving you the most popular prompting techniques, I have encouraged you to begin the journey with the question of Why prompting matters at all. During this journey we have discovered that the importance of prompting is diminishing thanks to the evolution of the models. Instead of requiring users to invest in continuously improving their prompting skills, currently evolving model architectures will likely further reduce their relevance. An agent-based framework, where different “routes” are taken while processing specific queries and tasks, is one of those.

This does not mean, however, that being clear and specific and providing the necessary context within your prompts isn’t worth the effort. On the contrary, I am a strong advocate of this, since it not only helps the model but also yourself to figure out what exactly it is you’re trying to achieve.

Just like in human communication, multiple factors determine the appropriate approach for reaching a desired outcome. Often, it is a mix and iteration of different approaches that yield optimal results for the given context. Try, test, iterate!

And finally, unlike in human interactions, you can test virtually limitlessly into your personal trial-and-error prompting journey. Enjoy the ride!

References

[1]: How Large Language Models work: From zero to ChatGPT

https://medium.com/data-science-at-microsoft/how-large-language-models-work-91c362f5b78f

[2]: Better Zero-Shot Reasoning with Role-Play Prompting

https://arxiv.org/abs/2308.07702

[3]: Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
https://arxiv.org/abs/2202.12837

[4]: Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

https://dl.acm.org/doi/abs/10.1145/3411763.3451760.

[5]: When do you need Chain-of-Thought Prompting for ChatGPT?

https://arxiv.org/abs/2304.03262

[6]: Large Language Models are Zero-Shot Reasoners
https://arxiv.org/abs/2205.11916

The Business Guide to Tailoring Language AI Part 2 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Originally appeared here:
The Business Guide to Tailoring Language AI Part 2

Go Here to Read this Fast! The Business Guide to Tailoring Language AI Part 2

April 19, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Tag: AI

Dive into the “Curse of Dimensionality” concept and understand the math behind all the surprising phenomena that arise in high dimensions.

What is the curse of dimensionality?

The notion of distance in high dimensions

The maths: the n-ball

Curse of dimensionality, overfitting, and Occam’s Razor

Blessing of dimensionality?

Conclusion

Learn how to reduce model latency when deploying Meta* Llama 3 on CPUs

Llama 3

Architecture

Language Modeling Performance

Optimizing Llama 3 Inference with PyTorch

Environment Setup

Accessing and Configuring Llama 3

Quantizing Llama-3–8B-Instruct with WOQ

Considerations for Deployment

Conclusion and Discussion

Do I really sleep worse after drinking alcohol?

Randomized Clinical Trials

N-of-1 Trials

N-of-1 Trials in Practice

Notes on the Data

Exploratory Data Analysis

Hypothesis Testing

Choosing a Statistical Test

Conducting our Analysis in R

Final Conclusions

Going Further

References and Further Reading

Part 1: Task-specific approaches for scenario forecasting

Our project: Launching tests for English courses

Modelling demand from new customers

Modelling demand from existing customers

The first version

Taking into account long-term trends

Putting everything together

First results

What if we have demand constraints?

Tracking students’ performance over time

Summary

Reference

How to Implement Knowledge Graphs and Large Language Models (LLMs) Together at the Enterprise Level

A survey of the current methods of integration

The two ways KGs and LLMs are related

Using LLMs to assist in the KG creation and curation process

Using KGs to power and govern GenAI pipelines

Data governance, access control, and regulatory compliance

Accuracy and contextual understanding

Efficiency and scalability

Conclusion

Prompting ChatGPT and other chat-based language AI — and why you should (not) care about it

Foreword

Introduction

Start with Why

Prompting 101

Ask and it will be given to you

Roleplay prompting — simple, but overrated

Few-Shot aka in-context learning

CoT-Prompting or “Let’s think step-by-step’’

Conclusion