In this article, I attempt to clarify the use of essential tools in the applied econometrician’s toolkit: Difference-in-Differences (DiD) and Event Study Designs. Inspired mostly by my students, this article breaks down the basic concepts and addresses common misconceptions that often confuse practitioners.
If you wonder why the title focuses on Event Studies while I am also talking about DiD, it is because, when it comes to causal inference, Event Studies are a generalization of Difference-in-Differences.
But before diving in, let me reassure you that if you are confused, there may be good reasons for it. The DiD literature has been booming with new methodologies in recent years, making it challenging to keep up. The origins of Event Study designs don’t help either…
Origins of Event Studies
Finance Beginnings
Event studies originated in Finance, developed to assess the impact of specific events, such as earnings announcements or mergers, on stock prices. The event study was pioneered by Ball and Brown (1968) and laid the groundwork for the methodology.
Event Studies in Finance
Methodology
In Finance, the event study methodology involves identifying an event window for measuring ‘abnormal returns’, namely the difference between actual and expected returns.
Finance Application
In the context of finance, the methodology typically involves the following steps:
Identifying a specific event of interest, such as a company’s earnings announcement or a merger.
Determining an “event window,” or the time period surrounding the event during which the stock price might be affected.
Calculating the “abnormal return” of the stock by comparing its actual performance during the event window to the performance of a benchmark, such as a market index or industry average.
Assessing the statistical significance of the abnormal return to determine whether the event had an impact on the stock price.
This methodological approach has since evolved and expanded into other fields, most notably economics, where it has been adapted to suit a broader range of research questions and contexts.
Evolution into Economics — Causal Inference
Adaptation in Economics
Economists use Event Studies to causally evaluate the impact of economic shocks, and other significant policy changes.
Before explaining how Event Studies are used for causal inference, we need to touch upon Difference-in-Differences.
Differences-in-Differences (DiD) Approach
The DiD approach typically involves i) a policy adoption or an economic shock, ii) two time periods, iii) two groups, and iv) a parallel trends assumption.
Let me clarify each of them here below:
i) A policy adoption may be: the use of AI in the classroom in some schools; expansion of public kindergartens in some municipalities; internet availability in some areas; cash transfers to households, etc.
ii) We denote “pre-treatment” or “pre-period” asthe period before the policy is implemented and “post-treatment” as the period after the policy implementation.
iii) We call as “treatment group” the units that are affected by the policy, and “control group” units that are not. Both treatment and control groups are composed of several units of individuals, firms, schools, or municipalities, etc.
iv) The parallel trends assumption is fundamental for the DiD approach. It assumes that in the absence of treatment, treatment and control groups follow similar trends over time.
A common misconception about the DiD approach is that we need random assignment.
In practice, we don’t. Although random assignment is ideal, the parallel trends assumption is sufficient for estimating causally the effect of the treatment on the outcome of interest.
Randomization, however, ensures that differences between the groups before the intervention are zero, and non-statistically significant. (Although by chance they may be different.)
Scenario: Adoption of ChatGPT in Schools and Emotional Intelligence Performance
Background
Imagine a scenario in which AI becomes available in the year 2023 and some schools immediately adopt AI as a tool in their teaching and learning processes, while other schools do not. The aim is to understand the impact of AI adoption on student emotional intelligence (EI) scores.
Data
Treatment Group: Schools that adopted AI in 2023.
Control Group: Schools that did not adopt AI in 2023.
Pre-Treatment: Academic year before 2023.
Post-Treatment: Academic year 2023–2024.
Methodology
Pre-Treatment Comparison: Measure student scores for both treatment and control schools before AI adoption.
Post-Treatment Comparison: Measure student scores for both treatment and control schools after AI adoption.
Calculate Differences:
Difference in test scores for treatment schools between pre-treatment and post-treatment.
Difference in test scores for control schools between pre-treatment and post-treatment.
The DiD estimate is the difference between the two differences calculated above. It estimates the causal impact of AI adoption on EI scores.
A Graphical Example
The figure below plots the emotional intelligence scores in the vertical axis, whereas the horizontal axis measures time. Our time is linear and composed of pre- and post-treatment.
The Counterfactual Group 2 measures what would have happened had Group 2 not received treatment. Ideally, we would like to measure Contrafactual Group 2, which are scores for Group 2 in the absence of treatment, and compare it with observed scores for Group 2, or those observed once the group receives treatment. (This is the main issue in causal inference, we can’t observe the same group with and without treatment.)
If we’re tempted to do the naive comparison between the outcomes of Group 1 and Group 2 post-treatment, we would get an estimate that won’t be correct, it will be biased, namely delta OLS in the figure.
The difference-in-differences estimator allows us to estimate the causal effect of AI adoption, shown geometrically in the figure as delta ATT.
The plot indicates that schools where students had lower emotional intelligence scores initially adopted AI. Post-treatment, the scores of the treatment group almost caught up with the control group, where the average EI score was higher in the pre-period. The plot suggests that in the absence of treatment, scores would have increased for both groups — common parallel trends. With treatment, however, the gap in scores between Group 2 and Group 1 is closing.
Note that units in the control group can be treated at another point in time, namely in 2024, 2025, 2026, and so on. What’s important is that in the pre-treatment period, none of the units, in treatment or control groups, is affected by the policy (or receives treatment).
If we only have two time periods, the control group, in our study, will be, by design, a “never treated”group.
DiD for Multiple Periods
When different units receive treatment at different points in time, we can establish a more generalized framework for estimating causal effects. In this case, each group will have a pre-treatment period and a post-treatment period, delineated by the introduction/adoption of the policy (or the assignment to treatment) for each unit.
Back to our Event Study Design
As mentioned, Event Study Designs are a flavor of generalized Difference-in-Differences. Such designs allow researchers to flexibly estimate the effects of the policy or allocation to treatment.
We estimate Event Studies with a linear estimating equation in which we include leads and lags of the treatment variable, denoted with D, and additional covariates x.
A nice feature of these designs is that the coefficient estimates on the leads and lags will give us a sense of the relevant event window helping us to determine whether the impact is temporary or persistent over time.
Moreover, and most importantly perhaps, the coefficient estimates on the leads indicators will inform us if the parallel trends assumption holds.
The simulation ensures that we have data on schools since 2021 (pre-treatment). From 2023 onward, some schools, randomly chosen, adopt AI in 2023, some in 2024, and so on until 2027.
Let’s assume that the treatment causes a 5-unit increase in the year of adoption (for the treatment groups only), and the effect lasts only for one period.
The event study plot will display the coefficient estimates on leads and lags. Our hypothetical scenario will produce a plot similar to the one below.
The coefficient estimates on the lead indicators are close to zero and statistically insignificant, suggesting that the parallel trends assumption holds: treatment and control groups are comparable.
The intervention causes a temporary increase in the outcome of interest by 5 units, after which the treatment effect vanishes.
(Feel free to explore using the code in GitHub ariedamuco/event-studyand compare the DiD design with two periods with the event study design, with and without a pure control or a never-treated group.)
Misconception 1: Event Studies Are Only for Stock Prices
While commonly associated with stock prices, event studies have broader applications, including analyzing the impact of policy changes, economic events, and more.
Misconception 2: You Always Need a Never-Treated Group
Event Studies, allow us to estimate the effects of the policy without having a “never treated group”.
The staggered treatment in the event study design allows us to create a control group for each treated unit at time t, with the control group being treated at a later time t+n.
In this case, we are closer to comparing apples with apples as we are comparing units that receive treatment with those that will receive treatment in the future.
Misconception 3: You Need an Event Window
The power of event studies lies in their flexibility. Because we can estimate the effect of the treatment over time, the event study allows us to pinpoint the relevant event window.
For example, if there is an anticipatory effect, the event study will show it in the leads, namely the indicator variables leading up to the event will be statistically significant. (Note that this will invalidate the parallel trends assumption.)
Similarly, if the effect is long-lasting, the event study will help capture this feature. In this case, the lags included in the estimating equation will turn out to be significant, and even understand how long the effect persists.
Misconception 4: You Don’t Need Parallel Trend Assumptions
If you don’t have parallel trend assumptions, you are not dealing with a DiD (or Event Studies). However, you can have conditional parallel trend assumptions. This means that the parallel trend assumption holds once we control for covariates. (The covariates should not be affected themselves by the treatment, and ideally, they should be measured pre-treatment.)
Conclusion
Event study designs are a generalization of the Difference-in-Differences approach. Understanding the methodology and addressing common misconceptions ensures that researchers can effectively apply this tool in their empirical research.
Notes
Unless otherwise noted, all images are by the author.
Thank you for taking the time to read about my thoughts. If you enjoyed the article feel free to reach out at [email protected] or on Twitter, Linkedin, or Instagram. Feel free to also share it with others.
This blog post will go line-by-line through the code in Section 1 of Andrej Karpathy’s “Let’s reproduce GPT-2 (124M)”
Image by Author — SDXL
Andrej Karpathy is one of the foremost Artificial Intelligence (AI) researchers out there. He is a founding member of OpenAI, previously led AI at Tesla, and continues to be at the forefront of the AI community. He recently released an incredible 4 hour video walking through how to build a high-quality LLM model from scratch.
In that video, we go through all of the major parts of training an LLM, from coding the architecture to speeding up its training time to adjusting the hyperparameters for better results. There’s an incredible amount of knowledge there, so I wanted to expand upon it by going line-by-line through the code Karpathy creates and explaining how it is working. This blog post will be part of a series I do covering each section of Karpathy’s video.
In section one, we focus on implementing the architecture of GPT-2. While GPT-2 was open-sourced by OpenAI in 2018, it was written in Tensor Flow, which is a harder framework to debug than PyTorch. Consequently, we are going to recreate GPT-2 using more commonly used tools. Using only the code we are going to create today, you can create a LLM of your own!
Let’s dive in!
High Level Vocabulary
Before we begin, let’s get on the same page about some terminology. While there may be some naming collisions with other sources, I’ll try to be consistent within these blog posts.
Vocabulary Size — tells us how many unique tokens the model will be able to understand and use. In general, researchers have found that larger vocabulary sizes allow models to be more precise with their language and to capture more nuances in their responses.
Layer — part of the hidden layers of our neural network. Specifically here we refer to how many times we repeat the calculations shown in the grey box below:
Embedding — a vector representation of data we pass to the model.
Multi-Head Attention — rather than running attention once, we run it n-times and then concatenate all of the results together to get the final result.
Let’s go into the code!
GPT Class & Its Parameters
@dataclass class GPTConfig: block_size : int = 1024 vocab_size : int = 50257 n_layer : int = 12 n_head : int = 12 n_embd : int = 768
To begin, we are setting 5 hyper-parameters in the GPTConfig class. block_size appears to be somewhat arbitrary along with n_layerand n_head. Put differently, these values were chosen empirically based on what the researchers saw had the best performance. Moreover, we choose 786 for n_embd as this is the value chosen for the GPT-2 paper, which we’ve decided to emulate.
However, vocab_size is set based off the tiktoken gpt-2 tokenizer that we will use. The GPT-2 tokenizer was created by using the Byte-Pair Encoding algorithm (read more here). This starts off with an initial set of vocab (in our case 256) and then goes through the training data creating new vocab based on the frequency it sees the new vocabulary appearing in the training set. It keeps doing this until it has hit a limit (in our case 50,000). Finally, we have vocab set aside for internal use (in our case the end token character). Adding these up we get 50,257.
class GPT(nn.Module): def __init__(self, config): super().__init__() self.config = config
# ...
With our configs set, we create a GPT class which is an instance of the torch nn.Module class. This is the base class for all PyTorch neural networks, and so by using this we get access to all of the optimizations that PyTorch has for these types of models. Each nn.Module will have a forward function that defines what happens during a forward pass of the model (more on these in a moment).
We begin by running the super constructor in the base class and then create a transformer object as a ModuleDict. This was created because it allows us to index into transformer like an object, which will come in handy both when we want to load in weights from HuggingFace and when we want to debug and quickly go through our model.
class GPT(nn.Module): def __init__(self, config): # ...
self.transformer = nn.ModuleDict(dict( wte = nn.Embedding(config.vocab_size, config.n_embd), wpe = nn.Embedding(config.block_size, config.n_embd), h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), ln_f = nn.LayerNorm(config.n_embd) ))
Our transformer here has 4 major pieces we are going to load in: the weights of the token embeddings (wte), the weights of the positional encodings (wpe), the hidden layers (h), and the layer normalization (ln_f). This setup is following mostly the decoder part of the Transformer architecture from “Attention is All You Need” (output embeddings ~ wte, positional encoding ~ wte, hidden layers ~h ). One key difference is that we have an additional normalization layer ln_f done after all of the hidden layers have finished in our architecture.
The wte and the wpe are both embeddings so naturally we use the nn.Embedding class to represent them. Our hidden layers are where we will have most of the logic for the Transformer, so I will go into this more later. For now, just note that we are creating a loop of the object Block so that we have n.layer‘s of them. Finally, we use the built-in nn.LayerNorm for ln_f , which will normalize our output based on the equation below (where x and y are input and output, E[x] is the mean value, and γ and β are learnable weights).
Next, we setup the final linear layer of our network which will generate the logits of the model. Here we are projecting from the embedding dimension of our model (768) to the vocabulary size of our model (50,257). The idea here is that we have taken the hidden state and expanded it to map onto our vocabulary so that our decoder head can use the values on each vocab to figure out what the next token should be.
Finally in our constructor, we have an interesting optimization where we tell the model to make the tokenizer weights the same as the linear layer weights. This is done because we want the linear layer and the tokenizer to have the same understanding of the tokens (if two tokens are similar when being input into the model, the same two tokens should be similar when being output by the model). Finally, we initialize the weights for the model so we can start training.
for block in self.transformer.h: x = block(x) x = self.transformer.ln_f(x) logits = self.lm_head(x) loss = None if targets is not None: loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1)) return logits, loss
Our forward function is where we lay out exactly how our model will behave during a forward pass. We start off by verifying that our sequence length is not greater than our configured max value (block_size). Once that’s true, we create a tensor with values of 0 to T-1 (for example if T = 4, we’d have tensor([0, 1, 2, 3]) and run them through our positional embedding weights. Once that’s complete, we run the input tensor through the token embedding weights.
We combine both the token and the positional embeddings into x, requiring a broadcast to combine them. As the tok_emb are bigger than the pos_emb (in our example 50257 vs 1024), x will have the dimensions of tok_emb . x is now our hidden state, which we will pass through the hidden layers via the for loop. We are careful to update x after each time through a Block.
Next, we normalize x via our LayerNormalization ln_f and then do our linear projection to get the logits necessary to predict the next token. If we are training the model (which we signal via the targets parameter), we will then compute cross entropy between the logits we have just produced and the ground truth values held in our targets variable. We accomplish this via our cross_entropy loss function. To do this right, we need to convert our logits and target to the right shape via .view(). We ask pytorch to infer the correct size when we pass through -1.
There’s one more function in this class, the initialization function, but we’ll get to the initialization logic a little later. For now, let’s dive into the Block logic that will help us implement our multi-head attention and MLPs.
Block is instantiated as a nn.Module , so we also call the super constructor at the beginning for its optimizations. Next, we setup the same calculations as set out in the “Attention is All You Need” paper — 2 layer normalizations, an attention calculation, and a feed forward layer via MLPs.
class Block(nn.Module): # ... def forward(self, x): x = x + self.attn(self.ln_1(x)) x = x + self.mlp(self.ln_2(x)) return x
We then define our forward function which PyTorch will call for every forward pass of the model. Note that this is where we do something different than Attention is All You Need. We setup the layer normalizations to happen before attention and the feedforward respectively. This is part of the insights from GPT-2 paper, and you can see how making little changes like this can make a big difference. Note the addition to the original tensor remains in the corresponding same position. These 2 additions will be important when we setup our weight initialization function.
This class is a nice abstraction, as it lets us swap out implementations of attention or choose another type of feed forward function other than MLP without having to majorly refactor the code.
Attention is an important part of our model, so naturally there are a number of configurations here. We have the assert statement as a debugging tool to make sure that the configuration dimensions we pass through are compatible. Then we create some helper functions that will assist us when we do our self-attention. First, we have our c_attn and c_proj which are linear projections that convert our hidden state into new dimensions needed for the attention calculation. The c_proj.NANOGPT_SCALE_INIT is a flag we set here and in the MLP that will help us with the weight initialization later (in truth this could be named anything).
Finally, we tell torch to create a buffer that will not be updated during training called bias. Bias will be a lower triangular matrix of dimensions block_size x block_size that we will then turn into a 4D tensor with dimensions 1 x 1 x block_size x block_size . The 1 x 1 is done so that we can compute these in a batch in a single channel. This buffer will be used to apply a mask on our multi-headed attention.
class CausalSelfAttention(nn.Module): # ... def forward(self, x): B, T, C = x.size() # batch size, sequence length, channels qkv = self.c_attn(x) q, k, v = qkv.split(self.n_embd, dim=2) # transpose is done for efficiency optimization k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
att = (q @ k.transpose(-2,-1)) * (1.0 / math.sqrt(k.size(-1))) att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf")) att = F.softmax(att, dim=-1) y = att @ v y = y.transpose(1,2).contiguous().view(B, T, C)
y = self.c_proj(y) return y
Now comes the implementation of attention, with a focus on making this performant in torch. Going line by line, we begin by finding the batch size, sequence length, and channels in our input tensor x. We then will call our c_attn from before to project our hidden state into the dimensions we’ll need. We then split that result into 3 tensors of (B, T, C) shape (specifically one for query, one for key, and one for value).
We then adjust the dimensions of q, k, and v so that we can do multi-head attention on these performantly. By changing the dimensions from (B, T, C) to (B, T, self.n_head, C // self.n_head), we are dividing up the data so that each head gets its own unique data to operate on. We transpose our view so that we can make T the third dimension and self.n_head the second dimension, allowing us to more easily concatenate the heads.
Now that we have our values, we can start to calculate. We perform a matrix multiplication between query and key (making sure to transpose key so that it is in the proper direction), then divide by the square root of the size of k. After this calculation, we then apply the bias from our register so that the attention data from tokens in the future cannot impact tokens in the present (hence why we apply the mask only for tokens greater than T for the time and channel dimension). Once that is complete, we apply the softmax to only pass through certain information through.
Once the mask is on, we multiply the values by v, and then transpose our values back to (B, T, self.n_head, C // self.n_head) setup. We call .contiguous() to ensure that in memory all of the data is laid out next to each other, and finally convert our tensor back to the (B, T, C) dimensions it came in with (thus, concatenating our attention heads in this step).
Finally, we use our linear projection c_proj to convert back to the original dimensions of the hidden state.
Like all the classes before, MLP inherits from nn.Module. We begin by setting some helper functions — specifically the c_fc and c_proj linear projection layers, expanding from our embedding to 4 times the size and then back again respectively. Next, we have GELU. Karpathy makes a point to say that the approximate parameter here is only set so that we can closely match the GPT-2 paper. While at the time, the approximation of GELU was necessary, now a days we no longer need to approximate — we can calculate precisely.
class MLP(nn.Module): # ... def forward(self, x): x = self.c_fc(x) x = self.gelu(x) x = self.c_proj(x) return x
Our forward pass then is relatively straight forward. We call each function on our input tensor and return the final result.
Hugging Face Connection Code
Because GPT-2 is open-source, it is available on Hugging Face. While our goal here is to train our own model, it is nice to be able to compare what our results will be with the ones OpenAI found in their training. To allow us to do so, we have the below function that pulls in the weights and populates them into our GPT class.
This code also allows us to reuse this code to pull in foundation models from Hugging Face and fine-tune them (with some modifications as right now it’s optimized only for gpt-2).
class GPT(nn.Module): # ... @classmethod def from_pretrained(cls, model_type): """Loads pretrained GPT-2 model weights from huggingface""" assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'} from transformers import GPT2LMHeadModel print("loading weights from pretrained gpt: %s" % model_type)
# n_layer, n_head and n_embd are determined from model_type config_args = { 'gpt2': dict(n_layer=12, n_head=12, n_embd=768), # 124M params 'gpt2-medium': dict(n_layer=24, n_head=16, n_embd=1024), # 350M params 'gpt2-large': dict(n_layer=36, n_head=20, n_embd=1280), # 774M params 'gpt2-xl': dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params }[model_type] config_args['vocab_size'] = 50257 # always 50257 for GPT model checkpoints config_args['block_size'] = 1024 # always 1024 for GPT model checkpoints # create a from-scratch initialized minGPT model config = GPTConfig(**config_args) model = GPT(config) sd = model.state_dict() sd_keys = sd.keys() sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask / buffer, not a param # ...
Starting from the top, we bring in HuggingFace’s transformers library and setup the hyperparameters that vary between different variants of the GPT-2 model. As the vocab_size and block_size don’t change, you can see we hard-code them in. We then pass these variables into the GPTConfig class from before, and then instantiate the model object (GPT). Finally, we remove all keys from the model that end with .attn.bias , as these are not weights, but rather the register we setup to help with our attention function before.
# copy while ensuring all of the parameters are aligned and match in names and shapes sd_keys_hf = sd_hf.keys() sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')] # ignore these, just a buffer sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')] # same, just the mask (buffer) transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight'] # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear # this means that we have to transpose these weights when we import them assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
Next, we load in the model from the HuggingFace class GPT2LMHeadModel. We take the keys out from this model and likewise ignore the attn.masked_bias and attn.bias keys. We then have an assert to make sure that we have the same number of keys in the hugging face model as we do in our model.
class GPT(nn.Module): # ... @classmethod def from_pretrained(cls, model_type): # ... for k in sd_keys_hf: if any(k.endswith(w) for w in transposed): # special treatment for the Conv1D weights we need to transpose assert sd_hf[k].shape[::-1] == sd[k].shape with torch.no_grad(): sd[k].copy_(sd_hf[k].t()) else: # vanilla copy over the other parameters assert sd_hf[k].shape == sd[k].shape with torch.no_grad(): sd[k].copy_(sd_hf[k])
return model
To round out the function, we loop through every key in the Hugging Face model and add its weights to the corresponding key in our model. There are certain keys that need to be manipulated so that they fit the data structure we’re using. We run the function .t() to transpose the hugging face matrix into the dimensions we need. For the rest, we copy them over directly. You’ll notice we are using torch.no_grad() . This is telling torch that it doesn’t need to cache the values for a backward propagation of the model, another optimization to make this run faster.
Generating Our First Predictions (Sampling Loop)
With the classes we have now, we can run the model and have it give us output tokens (just make sure if you’re following this sequentially that you comment out the _init_weights call in the GPT constructor). The below code shows how we would do that.
device = "cpu" if torch.cuda.is_available(): device = "cuda" elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available(): device = "mps" print(f"device {device}")
torch.manual_seed(1337)
model = GPT(GPTConfig()) model.eval() model.to(device)
We start off by determining what devices we have access to. Cuda is NVIDIA’s platform that runs extremely fast GPU calculations, so if we have access to chips that use CUDA we will use them. If we don’t have access but we’re on Apple Silicon, then we will use that. Finally, if we have neither, then we fall back to CPU (this will be the slowest, but every computer has one so we know we can still train on it).
Then, we instantiate our model using the default configurations, and put the model into ‘eval’ mode — (this does a number of things, like disabling dropout, but from a high level it makes sure that our model is more consistent during inferencing). Once set, we move the model onto our device. Note that if we wanted to use the HuggingFace weights instead of our training weights, we would modify the third-to-last-line to read: model = GPT.from_pretrained(‘gpt2’)
import tiktoken enc = tiktoken.get_encoding('gpt2') tokens = enc.encode("Hello, I'm a language model,") tokens = torch.tensor(tokens, dtype=torch.long) tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1) x = tokens.to(device)
We now bring in tiktoken using the gpt2 encodings and have it tokenize our prompt. We take these tokens and put them into a tensor, which we then convert to batches in the below line. unsqueeze() will add a new first dimension of size 1 to the tensor, and repeat will repeat the entire tensor num_return_sequences times within the first dimension and once within the second dimension. What we’ve done here is formatted our data to fit the batched schema our model is expecting. Specifically we now match the (B, T) format: num_return_sequences x encoded length of prompt. Once we pass through the input tensor into the beginning of the model, our wte and wpe will create the C dimension.
while x.size(1) < max_length: with torch.no_grad(): logits, _ = model(x) logits = logits[:, -1, :] probs = F.softmax(logits, dim=-1) topk_probs, topk_indices = torch.topk(probs, 50, dim=-1) ix = torch.multinomial(topk_probs, 1) xcol = torch.gather(topk_indices, -1, ix) x = torch.cat((x, xcol), dim=1)
Now that they’re ready, we send them to the device and begin our sampling loop. The loop will be exclusively a forward pass, so we wrap it in the torch.no_grad to stop it from caching for any backward propagation. Our logits come out with shape (batch_size, seq_len, vocab_size) — (B,T,C) with C coming after a forward pass of the model.
We only need the last item in the sequence to predict the next token, so we pull out [:, -1, :] We then take those logits and run it through a softmax to get the token probabilities. Taking the top 50, we then choose a random index of the top 50 and pick that one as our predicted token. We then get the information about that and add it to our tensor x. By concatenating xcol to x, we set ourselves up to go into the next token given what we just predicted. This is how we code up autoregression.
for i in range(num_return_sequences): tokens = x[i, :max_length].tolist() decoded = enc.decode(tokens) print(f">> {decoded}")
After the sampling loop is done, we can go through each of the selected tokens and decode them, showing the response to the user. We grab data from the i-th in our batch and decode it to get the next token.
If you run the sampling loop on our initial model, you will notice that the output leaves a lot to be desired. This is because we haven’t trained any of the weights. The next few classes show how we can begin a naive training of the model.
DataLoaderLite
All training requires high quality data. For Karpathy’s videos, he likes to use public domain Shakespeare text (find it here).
class DataLoaderLite: def __init__(self, B, T): self.B = B self.T = T
with open('shakespeare.txt', "r") as f: text = f.read()
We begin by simply opening the file and reading in the text. This data source is ASCII only, so we don’t need to worry about any unexpected binary characters. We use tiktoken to get the encodings for the body, and then convert these tokens into a tensor. We then create a variable called current_position, which will let us know where in the token tensor we are currently training from (naturally, this is initialized to the beginning). Note, this class is not inheriting from nn.Module, mainly because we have no need for the forward function here. Just as with the prompt part of the sampling loop, our DataLoaderLite class only needs to generate tensors of shape (B, T).
class DataLoaderLite: # ... def next_batch(self): B, T = self.B, self.T buf = self.tokens[self.current_position: self.current_position+(B*T + 1)] x = (buf[:-1]).view(B, T) y = (buf[1:]).view(B,T)
self.current_position += B * T if self.current_position + (B*T+1) > len(self.tokens): self.current_position = 0 return x,y
In the above we define the function next_batch to help with training. To make programs run faster, we like to run the calculations in batches. We use the B and T fields to determine the batch size (B) and sequence length (T) we’ll be training on. Using these variables, we create a buffer that holds the tokens we are going to train with, setting the dimensions to be of rows B and columns T. Note that we read from current_position to current_position + (B*T + 1) , where the +1 is to make sure we have all of the ground truth values for our B*T batch.
We then setup our model input (x) and our expected output (y) along the same lines. x is the entire buffer except for the last character, and y is the entire buffer except for the first. The basic idea is that given the first value in token buffer, we expect to get back the second token in the token buffer from our model.
Finally, we update the current_position and return x and y.
Weight Initialization
As we are dealing with probabilities, we’d like to pick initial values for our weights that are likely to require fewer epochs to get right. Our _init_weights function helps us do so, by initializing the weights with either zeroes or with a normal distribution.
class GPT(nn.Module): # ... def _init_weights(self, module): # layer norm is by default set to what we want, no need to adjust it if isinstance(module, nn.Linear): std = 0.02 if hasattr(module, "NANOGPT_SCALE_INIT"): std *= (2 * self.config.n_layer) ** -0.5 # 2 * for 2 additions (attention & mlp) torch.nn.init.normal_(module.weight, mean=0.0, std=std) # reasonable values are set based off a certain equation if module.bias is not None: torch.nn.init.zeros_(module.bias) elif isinstance(module, nn.Embedding): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02 )
If you remember from before, we’re passing in every field of the GPT class into _init_weights, so we’re processing nn.Modules. We are using the Xavier method to initialize our weights, which means we set the standard deviation of our sampling distribution equal to 1 / sqrt(hidden_layers) . You will notice that in the code, we are often using the hardcoded 0.02 as the standard deviation. While this might seem arbitrary, from the below table you can see that as the hidden dimensions GPT-2 uses are all roughly 0.02, this is a fine-approximation.
Going through the code, we start off by checking which subtype of nn.Module the module we’re operating on is.
If the module is Linear, then we will check if it is one of our projections from MLP or CasualSelfAttention classes (by checking if it has the NANO_GPT_INIT flag set). If it is, then our 0.02 approximation won’t work because the number of hidden layers in these modules is increasing (this is a function of our addition of the tensors in the Block class). Consequently, the GPT-2 paper uses a scaling function to account for this: 1/sqrt(2 * self.config.n_layer). The 2* is because our Block has 2 places where we are adding the tensors.
If we have a bias in the Linear module, we will start by initializing these all to zero.
If we have an Embedding Module (like the Token or Positional Encoding pieces), we will initialize this with the same normal distribution with standard deviation of 0.02.
If you remember, we have another subtype of module that is in our model: nn.LayerNorm . This class already is initialized with a normal distribution and so we decide that this is good enough to not need any changes.
Training Loop
Now that we have the training fundamentals setup, let’s put together a quick training loop to train our model.
device = "cpu" if torch.cuda.is_available(): device = "cuda" elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available(): device = "mps" print(f"device {device}")
num_return_sequences = 5 max_length = 30
torch.manual_seed(1337)
train_loader = DataLoaderLite(B=4, T=32)
model = GPT(GPTConfig()) model.to(device)
You can see that we repeat our device calculations to get optimal performance. We then set our data loader to use batch sizes of 4 and sequence lengths of 32 (set arbitrarily, although powers of 2 are best for memory efficiency).
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) for i in range(50): x, y = train_loader.next_batch() x, y = x.to(device), y.to(device) optimizer.zero_grad() #have to start with a zero gradient logits, loss = model(x, y) loss.backward() #adds to the gradient (+=, which is why they must start as 0) optimizer.step() print(f"loss {loss.item()}, step {i}")
Now we have the optimizer, which will help us train our model. The optimizer is a PyTorch class that takes in the parameters it should be training (in our case the ones given from the GPT class) and then the learning rate which is a hyperparameter during training determining how quickly we should be adjusting parameters — a higher learning rate means more drastic changes to the weights after each run. We chose our value based off of Karpathy’s recommendation.
We then use 50 training steps to train the model. We start by getting the training batch and moving them onto our device. We set the optimizer’s gradients to zero (gradients in pytorch are sums, so if we don’t zero it out we will be carrying information over from the last batch). We calculate the logits and loss from our model, and then run backwards propagation to figure out what the new weight models should be. Finally, we run optimizer.step() to update all of our model parameters.
Sanity Check
To see how all of the above code runs, you can check out my Google Colab where I combine all of it and run it on the NVIDIA T4 GPU. Running our training loop, we see that the loss starts off at ~11. To sanity test this, we expect that at the beginning the odds of predicting the right token is (1/vocab_size). Taking this through a simplified loss function of -ln, we get ~10.88, which is just about where we begin!
Image by Author
Closing
Thanks for reading through to the end!
I tried to include as much detail as I could in this blog post, but naturally there were somethings I had to leave out. If you enjoyed the blog post or see anything you think should be modified / expanded upon, please let me know!
The splash that ChatGPT was making last year brought with it the realization — surprise for many — that a putative AI could sometimes offer very wrong answers with utter conviction. The term for this is usually “hallucination” and the main remedy that’s developed over the last 18 months is to bring facts into the matter, usually through retrieval augmented generation (RAG), also sometimes called relevant answer generation, which basically reorients the GPT (generative pretrained transformer language model) to draw from contexts where known-to-be-relevant facts are found.
Yet hallucinations are not the only way a GPT can misstep. In some respects, other flavors of misstep are deeper and more interesting to consider— especially when prospects of Artificial General Intelligence (AGI) are now often discussed. Specifically I’m thinking of what are known as counterfactuals (counterfactual reasoning) and the crucial role counterfactuality can play in decision making, particularly in regard to causal inference. Factuality therefore isn’t the only touchstone for effective LLM operation.
In this article I’ll reflect on how counterfactuals might help us think differently about the pitfalls and potentials of Generative AI. And I’ll demonstrate with some concrete examples using open source LMs (specifically Microsoft’s Phi). I’ll show how to set up Ollama locally (it can also be done in Databricks), without too much fuss (both with and without a Docker container), so you can try it out for yourself. I’ll also compare OpenAI’s LLM response to the same prompts.
I suggest that if we want to even begin to think about the prospect of “intelligence” within, or exuded by, an artificial technology, we might need to think beyond the established ML paradigm, which assumes some pre-existing factual correctness to measure against. An intelligent behavior might instead be speculative, as yet lacking sufficient past evidence to obviously prove its value. Isaac Newton, Charles Darwin, or your pet cat, could reason practically about the physical world they themselves inhabit, which is something that LLMs — because they are disembodied — don’t do. In a world where machines can write fluently, talk is cheaper than speculative practical reasoning.
What is a counterfactual and why should we care? There’s certainly idle speculation, sometimes with a rhetorical twist: At an annual meeting, a shareholder asked what the
“…returns since 1888 would have been without the ‘arrogance and greed’ of directors and their bonuses.” ¹
…to which retired banker Charles Munn² replied:
“That’s one question I won’t be answering. I’m a historian, I don’t deal in counterfactuals.” ¹
It’s one way to evade a question. Politicians have used it, no doubt. Despite emphasis on precedent, in legal matters, counterfactuality can be a legitimate consideration. As Robert N. Strassfeld puts it:
“As in the rest of life, we indulge, indeed, require, many speculations on what might have been. Although such counterfactual thinking often remains disguised or implicit, we encounter it whenever we identify a cause, and quite often when we attempt to fashion a remedy… Yet, … troublesome might-have-beens abound in legal disputes. We find ourselves stumbling over them in a variety of situations and responding to them in inconsistent ways ranging from brazen self-confidence to paralysis in the face of the task. When we recognize the exercise for what it is, however, our self confidence tends to erode, and we become discomforted, perplexed, and skeptical about the whole endeavor.”³
He goes further in posing that
“…legal decision makers cannot avoid counterfactual questions. Because such questions are necessary, we should think carefully about when and how to pose them, and how to distinguish good answers from poor ones.”
Counterfactuals are not an “anything goes” affair — far from it.
If you’re a Data Scientist, the prospect of “explanations” (with respect to a model) probably brings to mind SHAP (or LIME). Basically, a SHapley Additive exPlanation (SHAP) is derived by taking each predictive feature (each column of data) one at a time, and scrambling the observations (rows) of that feature (column) to assess which features (columns) the scrambling of which changes the prediction the most. For the rejected job candidate it might say, for instance: The primary reason the algorithm rejected you is “years of experience” because when we randomly substitute (permute) other candidates’ “years of experience” it affects the algorithm’s rating of you more than when we do that substitution (permutation) with your other features (like gender, education, etc). It’s making a quantitative comparison to a “what if.” So what is a SHAPley other than a counterfactual? Counterfactuality is at the heart of the explanation, because it gives a glimpse into causality; and explanation is relied on for making AI responsible.
Leaving responsibility and ethics to the side for the moment, causal explanation still has other uses in business. At least in some companies, Data Science and AI are expected to guide decision making, which in causal inference terms means making an intervention: adjusting price or targeting this verses that customer segment, and so forth. An intervention is an alteration to the status quo. The fundamental problem of causal inference is that we aren’t able to observe what has never happened. So we can’t observe the result of an intervention until we make that intervention. Where there’s risk involved, we don’t want to intervene without sufficiently anticipating the result. Thus we want to infer in advance that the result we desire can be caused by our intervention. That entails making inferences about the causal effects of events that aren’t yet fact. Instead, such events are counterfactual, that is, contrary to fact. This is why counterfactuals have been couched, by Judea Pearl and others, as the
fundamental problem of causal inference ⁸
So the idea of a “thought experiment”, which has been important especially in philosophy — and ever more so since Ludwig Wittgenstein popularized it —as a way to probe how we use language to construct our understanding of the world — isn’t just a sentimental wish-upon-a-star.⁹ Quite to the contrary: counterfactuals are the crux of hard-headed decision making.
In this respect, what Eric Siegel suggests in his recent AI Playbook follows as corollary: Siegel suggests that change management be repositioned from afterthought to prerequisite of any Machine Learning project.¹⁰ If the conception of making a business change isn’t built into the ML project from the get go, its deployment is likely doomed to remain fiction forever (eternally counterfactual). The antidote is to imagine the intervention in advance, and systematically work out its causal effects, so that you can almost taste them. If its potential benefits are anticipated — and maybe even consistently visualized — by all parties who stand to benefit, then the ML project stands a better chance of transitioning from fiction to fact.
As Aleksander Molak explains it in his recent Causal Inference and Discovery in Python (2023)
“Counterfactuals can be thought of as hypothetical or simulated interventions that assume a particular state of the world.” ¹¹
The capacity for rational imagination is implicated in many philosophical definitions of rational agency.¹² ¹³
“[P]sychological research shows that rational human agents do learn from the past and plan for the future engaging in counterfactual thinking. Many researchers in artificial intelligence have voiced similar ideas (Ginsberg 1985; Pearl 1995; Costello & McCarthy 1999)” ¹³ ¹⁴ ¹⁵ ¹⁶
As Molak demonstrates “we can compute counterfactuals when we meet certain assumptions” (33) ¹¹. That means there are circumstances when we can judge reasoning on counterfactuals as either right or wrong, correct or incorrect. In this respect even what’s fiction (counter to fact) can, in a sense, be true.
Verbal fluency seems to be the new bright shiny thing. But is it thought? If the prowess of IBM’s Deep Blue and DeepMind’s AlphaGo could be relegated to mere cold calculation, the flesh-and-blood aroma of ChatGPT’s verbal fluency since late 2022 seriously elevated—or at least reframed — the old question of whether an AI can really “think.” Or is the LLM inside ChatGPT merely a “stochastic parrot,” stitching together highly probable strings of words in infinitely new combinations? There are times though when it seems that putative human minds — some running for the highest office in the land — are doing no more than that. Will the real intelligence in the room please stand up?
In her brief article about Davos 2024, “Raising Baby AI in 2024”, Fiona McEvoy reported that Yann LeCun emphasized the need for AIs to learn from not just text but also video footage.¹⁷ Yet that’s still passive; it’s still an attempt to learn from video-documented “fact” (video footage that already exists); McEvoy reports that
“[Daphne] Koller contends that to go beyond mere associations and get to something that feels like the causal reasoning humans use, systems will need to interact with the real world in an embodied way — for example, gathering input from technologies that are ‘out in the wild’, like augmented reality and autonomous vehicles. She added that such systems would also need to be given the space to experiment with the world to learn, grow, and go beyond what a human can teach them.” ¹⁷
Another way to say it this: AIs will have to interact with the world in an embodied way at least somewhat in order to hone their ability to engage in counterfactual reasoning. We’ve all seen the videos — or watched up close — a cat pushing an object off a counter, with apparently no purpose, except to annoy us. Human babies and toddlers do it too. Despite appearances, however, this isn’t just acting out. Rather, in a somewhat naïve incarnation, these are acts of hypothesis testing. Such acts are prompted by a curiosity: What would happen if I shoved this vase?
Generated with DALL-E 3 and edited by the author
Please watch this 3-second animated gif displayed by North Toronto Cat Rescue.¹⁸ In this brief cat video, there’s an additional detail which sheds more light: The cat is about to jump; but before jumping she realizes there’s an object immediately at hand which can be used to test the distance or surface in advance of jumping. Her jump was counterfactual (she hadn’t jumped yet). The fact that she had already almost jumped indicates that she hypothesized the jump was feasible; the cat had quickly simulated the jump in her mind; suddenly realizing that the bottle on the counter afforded the opportunity to make an intervention, to test out her hypothesis; this act was habitual.
I have no doubt that her ability to assess the feasibility of such jumps arose from having physically acted out similar situations many times before. Would an AI, who doesn’t have physical skin in the game, have done the same? And obviously humans do this on a scale far beyond what cats do. It’s how scientific discovery and technological invention happen; but on a more mundane level this part of intelligence is how living organisms routinely operate, whether it’s a cat jumping to the floor or a human making a business decision.
Abduction
Testing out counterfactuals by making interventions seems to hone our ability to do what Charles Sanders Pierce dubbed abductive reasoning.¹⁹ ²⁰ As distinct from induction (inferring a pattern from repeated cases) and deduction (deriving logical implications), abduction is the assertion of a hypothesis. Although Data Scientists often explore hypothetical scenarios in terms of feature engineering and hyperparameter tuning, abductive reasoning isn’t really directly a part of applications of Machine Learning, because Machine Learning is usually optimizing on a pre-established space of possibilities based on fact, whereas as abductive reasoning is expanding the space of possibilities, beyond what is already fact. So perhaps Artificial General Intelligence (AGI) has a lot to catch up on.
Here’s a hypothesis:
Entities (biological or artificial) that lack the ability (or opportunity) to make interventions don’t cultivate much counterfactual reasoning capability.
Counterfactual reasoning, or abduction, is mainly worthwhile to the extent that one can subsequently try out the hypotheses through interventions. That’s why it’s relevant to an animal (or human). Absent eventual opportunities to intervene, causal reasoning (abduction, hypothesizing) is futile, and therefore not worth cultivating.
The capacity for abductive reasoning would not have evolved in humans (or cats), if it didn’t provide some advantage. Such advantage can only pertain to making interventions since abduction (counterfactual reasoning) by definition does not articulate facts about the current state of the world. These observations are what prompt the hypothesis above about biological and artificial entities.
As I mentioned above, RAG (retrieval augmented generation, also known as relevant answer generation) has become the de facto approach for guiding LLM-driven GenAI systems (chatbots) toward appropriate or even optimal responses. The premise of RAG is that if snippets of relevant truthful text are directly supplied to the generative LLM along with your question, then it’s less likely to hallucinate, and in that sense provides better responses. “Hallucinating” is AI industry jargon for: fabricating erroneous responses.
As is well known, hallucinations arise because LLMs, though trained thoroughly and carefully on massive amounts of human-written text from the internet, are still not omniscient, but tend to issue responses in a rather uniformly confident tone. Surprising? Actually it shouldn’t be. It makes sense, as the famous critique goes: LLMs are essentially parroting the text they’ve been trained on. Because LLMs are trained not on people’s sometimes tentative or evolving inner thoughts, but rather on the verbalizations of those thoughts that reached an assuredness threshold sufficient for a person to post for all to read on the biggest ever public forum that is the internet. So perhaps it’s understandable that LLMs skew toward overconfidence — they are what they eat.
In fact, I think it’s fair to say that, unlike many honest humans, LLMs don’t verbally signal their assuredness level at all; they don’t modulate their tone to reflect their level of assuredness. Therefore the strategy for avoiding or reducing hallucinations is to set up the LLM for success by pushing the facts it needs right under its nose, so that it can’t ignore them. This is feasible for situations where chatbots are usually deployed, which typically have a limited scope. Documents generally relevant to the scope are assembled in advance (in a vector store/database) so that particularly relevant snippets of text can be searched for on demand and supplied to the LLM along with the question being asked, so that the LLM is nudged to somehow exploit the snippets upon generating its response.
From RAGs to richer
Still there are various ways things can go awry. An entire ecosystem of configurable toolkits for addressing these has arisen. NVIDIA’s open source NeMo-guardrails can filter out unsafe and inappropriate responses as well as help check for factuality. John Snow Labs’ LangTest boasts “60+ Test Types for Comparing LLM & NLP Models on Accuracy, Bias, Fairness, Robustness & More.” Two toolkits that focus most intensely on the veracity of responses are Ragas and TruEra’s TrueLens.
At the heart of TrueLens (and similarly Ragas) sits an elegant premise: There are three interconnected units of text involved in each call to a RAG pipeline: the query, the retrieved context, and the response; and the pipeline fails to the extent that there’s a semantic gap between any of these. TruEra calls this the “RAG triad.” In other words, for a RAG pipeline to work properly, three things have to happen successfully: (1) the context retrieved must be sufficiently relevant; (2) the generated response must be sufficiently grounded in the retrieved context; and (3) the generated response must also be sufficiently relevant to the original query. A weak link anywhere in this loop equates to weakness in that call to the RAG pipeline. For instance:
Query: “Which country landed on the moon first?”
Retrieved context: “Neil Armstrong stepped foot on the moon in July 1969. Buzz Aldrin was the pilot.”
Generated response: “Neil Armstrong and Buzz Aldrin landed on the moon in 1969.”
The generated response isn’t sufficiently relevant to the original query — the third link is broken.
In so far as veracity is concerned, the RAG strategy is to avoid hallucination by deriving responses as much as possible from relevant trustworthy human-written preexisting text.
How RAG lags
Yet how does the RAG strategy square with the popular critique of language AI: that it is merely parroting the text it was trained on? Does it go beyond parroting, to handle counterfactuals? The RAG strategy basically tries to avoid hallucination by supplementing training text with additional text curated by humans, humans in the loop, who can attend to the scope of the chatbot’s particular use case. Thus humans-in-the-loop supplement the generative LLM’s training, by supplying a corpus of relevant factual texts to be drawn from.
Works of fiction are typically not included in the corpus that populates a RAG’s vector store. And even preexisting fictional prose doesn’t exhaust the theoretically infinite number of counterfactual propositions which might be deemed true, or correct, in some sense.
But intelligence includes the ability to assess such counterfactual propositions:
“My foot up to my ankle will get soaking wet if I step in that huge puddle.”
In this case, a GenAI system able to synthesize verbalizations previously issued by humans — whether from the LLM’s training set or from a context retrieved and supplied downstream — isn’t very impressive. Rather than original reasoning, it’s just parroting what someone already said. And parroting what’s already been said doesn’t serve the purpose at hand when counterfactuals are considered.
Here’s a pair of proposals:
To the extent a GenAI is just parroting, it is bringing forth or synthesizing verbalizations it was trained on.
To the extent a GenAI can surmount mere parroting and reason accurately, it can successfully handle counterfactuals.
The crucial thing about counterfactuals that Molak explains is that they “can be thought of as hypothetical or simulated interventions that assume a particular state of the world” or as Pearl, Gilmour, and Jewell describe counterfactuals as a minimal modification to a system (Molak, 28).¹¹ ²² The point is that answering counterfactuals correctly — or even plausibly — requires more-than-anecdotal knowledge of the world. For LLMs, their corpus-based pretraining, and their prompting infused with retrieved factual documents pins their success to the power of anecdote. Whereas a human intelligence often doesn’t need, and cannot rely on, anecdote to engage in counterfactual reasoning plausibly. That’s why counterfactual reasoning is in some ways a better measure of LLMs capabilities than fidelity to factuality is.
To explore a bit these issues of counterfactuality with respect to Large Language Models, let us consider them more concretely by running a generative model. To minimize impediments I will demonstrate it by downloading a model to run on one’s own machine — so you don’t need an api key. We’ll do this using Ollama. (If you don’t want to try this yourself, you can skip over the rest of this section.)
Ollama is a free tool that facilitates running open source LLMs on your local computer. It’s also possible to run Ollama in DataBricks, and possibly other cloud platforms. For simplicity’s sake, let’s do it locally. (For such local setup I’m indebted to Iago Modesto Brandão’s handy Building Open Source LLM based Chatbots using Llama Index²³ from which the following is adapted.)
The easiest way is to: download and install docker (the Docker app) then, within terminal, run a couple of commands to pull and run ollama as a server, which can be accessed from within a jupyter notebook (after installing two packages).
2. Launch Terminal and run these commands one after another:
docker pull ollama/ollama
docker run -d -v ollama:/root/.ollama -p 11434:11434 - name ollama ollama/ollama
pip install install llama-index==0.8.59
pip install openai==0.28.1
3. Launch jupyter:
jupyter notebook
4. Within the jupyter notebook, import ollama and create an LLM object. For the sake of speed, we’ll use a relatively smaller model: Microsoft’s phi.
Now we’re ready to use Phi via ollama to generate text in response to our prompt. For this we use the llm object’s complete() method. It generates a response (might take a minute or so), which we’ll print out.
Let’s try out a counterfactual: : “Would a cement balloon float?”
Photo by Haneen Krimly on Unsplash
Next let’s examine the “reasoning” displayed in Phi’s output.
Interpreting LLM (GenAI) answers to counterfactual questions
Leaving aside, for the moment, that Microsoft’s Phi is not a very powerful GPT (generative pretrained transformer), consider its response:
I don’t have personal beliefs or opinions, but based on my knowledge of physics, i can tell you that a cement balloon would not be able to float in air due to its low density. air is lighter than most materials used for balloons, and the inflated rubber material of a typical balloon provides enough buoyancy to lift it into the sky. however, if you were to fill a cement balloon with helium gas or another lighter-than-air substance, it would likely float as long as the weight of the balloon itself is less than the upward force from the lifting gas.
This seems reasonable. If a regular balloon (one I’ve blown up directly with my mouth) wouldn’t float in the air, then neither would one whose exterior is made from cement. If instead, however, the gas pumped inside is lighter than air, such as helium is, then a regular balloon would float up into the sky.
The original question asked about floating, but not didn’t specify the medium; apparently Phi assumed it to be air.
So now let’s specify that in the question, but choose water instead, which was actually my original intention anyway.
No, a cement balloon would not float in water. Cement is a very dense material and does not have much air trapped inside it, so it cannot create buoyancy like a regular balloon filled with helium or air. Additionally, the weight of the cement balloon combined with its density would make it sink to the bottom of the water rather than floating on the surface.
Interesting. Not only does it conclude that it would sink, but now it assumes that the medium is water instead of air. It reasons about the density of cement, as compared to air, but doesn’t take into account the air inside the balloon. Or possibly it’s reasoning about a solid cement ball; but, hmmm, that’s not a balloon. Unless, by “cement balloon” it’s thinking that this is analogous to a water balloon: a balloon filled with water.
Let’s change the question again, this time specifying the medium is water.
No, a cement balloon filled with helium would not float on water. This is because helium is lighter than air and therefore lifts the balloon upwards when it is released into the air. However, cement is much denser than water and thus will sink to the bottom of the water. When you combine these two facts, the result is that the concrete balloon filled with helium would not float on water but instead sink to the bottom due to its weight.
In summary, while the helium in the balloon provides buoyancy to keep it afloat in air, the density of the cement itself makes it too heavy for the helium to lift and keep it afloat in water. Thus, a cement balloon filled with helium would not float on water but instead sink to the bottom due to its weight.
Hmm, that’s odd. According to this latest response, the same object that would float up into the air, would sink in water, even though water is denser than air. Sure, this is a counterfactual situation — who’s ever heard of a cement balloon? But let’s try to visualize it: You are standing by a pool of water, or perhaps you are in a boat. You are holding this spherical object, say at waist level. You let go of it, and it starts to float up higher into the air, but you catch it before it’s out of reach. Now you lower this same spherical object down to the surface of the water. And as it touches the water, you slowly and gently let go again. What do you think will happen?
Will this helium-filled cement balloon, which a moment earlier had started to float up into the sky, now suddenly be swallowed up by the water, pushed down to sink to the bottom? Does that ring true?
Keep in mind our own embodied knowledge, which might be relevant here: Our own flesh-and-blood bodies — which aren’t even light enough to spontaneously float up into the air — when placed on water in fact don’t sink, but rather float. That embodied knowledge suggests that an object that is light enough to float up into the air, and thus is lighter than ourselves, would also be light enough to float on whatever, if we (being heavier than that object) float on water. (Consult Lakoff & Johnson for an overview of embodied cognition.²⁴)
Granted, a GPT language model, such as Phi, is decidedly disembodied; it lacks embodied knowledge. In this case, perhaps that embodied knowledge, which Phi lacks, impinges on counterfactual reasoning ability.
Ok, but perhaps context and continuity are the issue here. What I’ve shown above is not a true conversation, as would happen with a chat-bot; I haven’t passed the previous context back to Phi, but rather I’m asking isolated questions. So perhaps Phi is imagining different proportions of cement-to-helium in its responses to different questions. After all, if the cement were thick enough, or if the chamber of helium inside it were small enough, then the relative lightness of helium to air (or water) would insufficiantly counteract the heaviness of the cement. So lets rephrase the question to clarify that detail.
It is unlikely that a cement balloon would be able to float in water since cement is denser than water. The reason why the cement balloon might float in air is due to the helium gas inside, which has a lower density than both air and cement. In water, the buoyant force exerted by the water on the cement balloon will be much less than the weight of the balloon itself. Hence, the cement balloon would sink in water rather than float as it does in the air.
Ok, still something’s wrong. Within the same response it’s asserting that the cement balloon is light enough to float up into the air, but is too heavy to float in water. Doesn’t this imply that water is lighter than air? Hmm, something about that smells fishy — and it isn’t the water.
I’m not the first to suggest that GPT language models like Phi (models that gain their ability through training on massive corpuses of text) sometimes lack basic reasoning ability. In this case I’m suggesting how some of that reasoning ability emerges from implicit embodied knowledge — knowledge we gain through physical presence of our bodies within the material world (our physical bodies which don’t float up into the air but do float on water). I’m also suggesting how counterfactual questions can help reveal this.
GPT models, such as Phi, primarily act associatively. It is the associative or connotative facets of language usage that GPTs and LLMs generally have mastered so well. In terms of connotation, arising from most common usage, the word ‘cement’ is associated with heaviness. It has this in common with the heavy metal lead — “Led Zeppelin” was originally a sarcastic barb precluding the New Yardbirds’ prospects to soar. Although if enough air were pumped into it, cement could become light, that isn’t an association that readily comes to mind. Rather it’s a counterfactual, a thought experiment. Even though my previous phrasing of the question refers to “a very thin layer of cement,… filled with helium”, the material itself, “cement,” too strongly associates with heavines for Phi to imagine it as floating in water — no one has ever built a cement boat!
So let’s switch out cement for a different substance with less leaden connotations. How about glass?
As an AI language model, I do not have personal opinions or beliefs, but based on scientific principles, a glass balloon filled with helium would indeed float in water. This is because the density of helium gas is much lower than that of water, causing it to rise and displace its weight in water. Additionally, glass is less dense than water, which means that it would also be buoyant enough to keep the balloon afloat on top of the water’s surface. However, if the balloon were to be filled with a denser gas or a heavier object, such as sand, then it may sink or become too heavy to float.
Aha! Now with associative (connotative) obstacles brushed to the side, Phi is able to arrive at reasoning that is comparable to human reasoning, replete with implicit embodied knowledge, even though the latter isn’t the source in this case. In this case Phi has uncovered the key point, that “the density of helium gas is much lower than that of water” even though it has gotten mixed up in its “Additional” point, where it says that glass’s density is less than water’s, which is a hallucination; it’s not true.
What about OpenAI’s ChatGPT?
The big elephant in the room, though, is that Microsoft’s Phi is very far from being the state of the art LLM. It’s small enough to run locally on a laptop. So what about OpenAI’s GPT models?
To run this code below, you’ll need to sign up for free at https://openai.com to get an OpenAI API access token.
What does OpenAI’s GPT say about the helium-filled glass balloon on water?
No, a glass balloon filled with helium cannot float in water, regardless of its size. The buoyancy of an object in a fluid is determined by the density of the object compared to the density of the fluid. Density is defined as mass divided by volume. In simple terms, if the object is less dense than the fluid, it will float; if it is more dense, it will sink. In this case, the glass balloon is filled with helium, which is a gas that is less dense than water. Helium-filled balloons float in the air because the density of helium is lower than that of the surrounding air. However, when we place the helium-filled glass balloon in water, the density of the water is significantly higher than that of helium. Even though the glass balloon is very thin and lightweight, the overall density of the balloon, including the glass and helium, will still be higher than that of water. Therefore, the balloon will not displace enough water to generate enough buoyant force to counteract the weight of the balloon itself. As a result, the balloon will sink in water rather than float.
Oh no! Even OpenAI’s GPT 3.5 turbo gets buried in the connotations of cement. It gets that “helium filled balloons float in air.” And it gets that helium is less dense than water. But then it fumbles into thinking “the overall density of the balloon, including the glass and helium, will still be higher than that of water.” As Phi did above, OpenAI’s GPT 3.5 turbo has implied that the balloon is heavier than water but lighter than air, which implies that water is heavier than air.
We know it’s wrong; but it’s not wrong because it’s lacking facts, or has directly contradicted fact: The whole cement balloon scenario is far from being fact; it’s counterfactual.
Post-hoc we are able to apply reductio ad absurdum to deduce that Phi’s and OpenAI’s GPT 3.5 turbo’s negative conclusions do actually contradict another fact, namely that water is heavier than air. But this is a respect in which counterfactual reasoning is in fact reasoning, not just dreaming. That is, counterfactual reasoning can be shown to be definitively true or definitively false. Despite deviating from what’s factual, it is actually just as much a form of reasoning as is reasoning based on fact.
Fact, Fiction, and Hallucination: What counterfactuals show us
Since ChatGPT overwhelmed public consciousness in late 2022, the dominant concern that was immediately and persistently stirred up has been hallucination. Oh the horror that an AI system could assert something not based in fact! But instead of focusing on just factuality as a primary standard for AI systems — as has happened in many business use-cases — it now seems clear that fact vs. fiction isn’t the only axis along which an AI system should be expected to or hoped to succeed. Even when an AI system’s response is based in fact, it can still be irrelevant, a non sequitur, which is why evaluation approaches such as Ragas and TruVera specifically examine relevance of response.
When it fails on the relevance criterion, it is not even the Fact vs. Fiction axis that is at play. An irrelevant response can be just as factual as a relevant one, and by definition, counterfactual reasoning, whether correct or not, is not factual in a literal sense, certainly not in the sense constituted by RAG systems. That is, counterfactual reasoning is not achieved by retrieving documents that are topically relevant to the question posed. What makes counterfactual reasoning powerful is how it may apply analogoies to bring to bear systems of facts that might seem completely out of scope to the question being posed. It might be diagrammed something like this:
image by author
One might also represent some of these facets this way:
image by author
What do linear estimators have to do with this? Since counterfactual reasoning is not based in seemingly relevant facts but rather in systematized facts, sometimes from other domains, or that are topically remote, it’s not something that obviously benefits directly from document-store retrieval systems. There’s an analogy here between types of linear estimators: A gradient-boosted tree linear estimator essentially cannot succeed in making accurate predictions on data whose features substantially exceed the numeric ranges of the training data; this is because decision cut-points can only be made based on data presented at training time. By contrast, regression models (which can have closed form solutions) enable accurate predictions on features that exceed the numerical ranges of the training data.
In practical terms this is why linear models can be helpful in business applications. To the extent your linear model is accurate, it might help you predict the outcome of raising or lowering your product’s sales price beyond any price you’ve never offered before: a counterfactual price. Whereas a gradient-boosted tree model that performs equally well in validation does not help you reason through such counterfactuals, which, ironically, might have been the motivation for developing the model in the first place. In this sense, the explainability of linear models is of a completely different sort from what SHAP values offer, as the latter shed little light on what would happen with data that is outside the distribution of the model’s training data.
The prowess of LLMs has certainly shown that the limits of “intelligence” synthesized ingeniously from crowdsourcing human-written texts are much greater than expected. It’s obvious that this eclipsed the former tendency to place value in “intelligence” based on conceptual understanding, which reveals itself especially in the ability to accurately reason beyond facts. So I find it interesting to attempt to challenge LLMs to this standard, which goes against their grain.
Far from being frivolous diversions, counterfactuals play a role in the progress of science, exemplifying what Charles Sanders Pierce calls abduction, which is distinct from induction (inductive reasoning) and deduction (deductive reasoning). Abduction basically means the formulation of hypotheses. We might rightly ask: Should we expect an LLM to exhibit such capability? What’s the advantage? I don’t have a definitive answer, but more a speculative one: It’s well known within the GenAI community that when prompting an LLM, asking it to “reason step-by-step” often leads to more satisfactory responses, even though the reasoning steps themselves are not the desired response. In other words, for some reason, not yet completely understood, asking the LLM to somewhat simulate the most reliable processes of human reasoning (thinking step-by-step) leads to better end results. Perhaps, somewhat counterintuitively, even though LLMs are not trained to reason as humans do, the lineages of human reasoning in general contribute to better AI end results. In this case, given the important role that abduction plays in the evolution of science, AI end results might improve to the extent that LLMs are capable of reasoning counterfactually.
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.