A measure of correlation between discrete (categorical) variables
Introduction
Theil’s U, also known as the uncertainty coefficient or entropy coefficient, quantifies the strength of association between two nominal variables. It assesses how much knowing the value of one variable reduces uncertainty about the other, providing a measure of association that ranges from 0 to 1. A higher value indicates a stronger relationship, making Thiel’s U particularly useful in fields such as statistics and data science for exploring relationships within categorical data.
Theory
Theil’s U is a measure of nominal association based on the concept of information entropy. Suppose we have samples from two discrete random variables, X and Y.
Then the entropy of X is defined as:
And the conditional entropy of X given Y is defined as:
We can then use the joint distribution (numerator) in combination with the marginal probabilities of X or Y to calculate the conditional distributions of X given Y (denominator) or Y given X, respectively, as follows:
The result captures how the probability of one variable changes given the value of the other. We can calculate the probability of X given Y by using the joint probability of X and Y — that is, the probability of different combinations of X and Y — as well as the marginal probability of Y. We insert the result of their division into our formula for H(X) to obtain:
So much for the theory; here’s how we can calculate the conditional entropy of X given Y in Python.
from typing import List, Union
from collections import Counter
import math
def conditional_entropy(
x: List[Union[int, float]],
y: List[Union[int, float]]
) -> float:
""" Calculates conditional entropy """
# Count unique values
y_counter = Counter(y) # Counts of unique values in y
xy_counter = Counter(list(zip(x, y))) # Counts of unique pairs from (x, y)
# Calculate sum of y values
total_occurrences = sum(y_counter.values())
# (Re-)set entropy to 0
entropy = 0
# For every unique value pair of x and y
for xy in xy_counter.keys():
# Joint probability of x AND y
p_xy = xy_counter[xy] / total_occurrences
# Marginal probability of y
p_y = y_counter[xy[1]] / total_occurrences
# Conditional probability of x given y
p_x_given_y = p_xy / p_y
# Calculate the conditional entropy H(X|Y)
entropy += p_xy * math.log(p_x_given_y, 2) # Use base 2 instead of natural (base e)
return -entropy
Once we have calculated the conditional entropy of X given Y, we can calculate Theil’s U. One last step is to calculate the entropy of X, which we defined at the beginning of this article. The uncertainty coefficient, or proficiency, is then calculated as follows:
Switching from theory to practice, this can be accomplished in Python using the following code:
import scipy.stats as ss
def theils_u(
x: List[Union[int, float]],
y: List[Union[int, float]]
) -> float:
""" Calculate Theil U """
# Calculate conditional entropy of x and y
H_xy = conditional_entropy(x,y)
# Count unique values
x_counter = Counter(x)
# Calculate sum of x values
total_occurrences = sum(x_counter.values())
# Convert all absolute counts of x values in x_counter to probabilities
p_x = list(map(lambda count: count/total_occurrences, x_counter.values()))
# Calculate entropy of single distribution x
H_x = ss.entropy(p_x)
return (H_x - H_xy) / H_x if H_x != 0 else 0
Lastly we can then define a function that calculates the Theil’s values for every feature combination within a given dataset. We can do this in Python with the following code:
import itertools
import pandas as pd
def get_theils_u_for_df(df: pd.DataFrame) -> pd.DataFrame:
""" Compute Theil's U for every feature combination in the input df """
# Create an empty dataframe to fill
theilu = pd.DataFrame(index=df.columns, columns=df.columns)
# Insert Theil U values into empty dataframe
for var1, var2 in itertools.combinations(df, 2):
u = theil_u(df[var1],df[var2])
theilu[var1][var2] = round(u, 2) # fill lower diagonal
u = theil_u(df[var2],df[var1])
theilu[var2][var1] = round(u, 2) # fill upper diagonal
# Set 1s to diagonal where row index + column index == n - 1
for i in range(0, len(theilu.columns)):
for j in range(0, len(theilu.columns)):
if i == j:
theilu.iloc[i, j] = 1
# Convert all values in the DataFrame to float
return theilu.map(float)
Code Example
We will demonstrate the functionality of the code using the well-known Iris dataset. In addition to its numeric variables, the dataset contains a categorical variable, “species.” Traditional correlation measures, such as Pearson’s correlation, are limited in capturing relationships between categorical and numerical features. However, Thiel’s U can effectively measure the association between “species” and the other numerical features.
import pandas as pd
import seaborn as sns
import itertools
import matplotlib.pyplot as plt
# Load the Iris dataset from seaborn
df = sns.load_dataset('iris')
# Compute Theil's U for every feature combination in the input df
theilu = get_theils_u_for_df(df)
# Create a heatmap of the Theil's V values
plt.figure(figsize=(10, 4))
sns.heatmap(theilu, annot=True, cmap='Reds', fmt='.2f')
plt.title('Heatmap of Theil's U for all variable pairs')
plt.show()
The result is a heatmap of Thiel’s U for all variable pairs. Note that this measure has the advantage of being asymmetric, meaning the relationship between two variables can differ depending on the direction of analysis. For example, Thiel’s U can quantify how much information X provides about Y, which may not be the same as how much information Y provides about X.
The interpretation of the results is relatively straightforward: Petal Length and Petal Width have the strongest associations with the categorical variable “species,” both with a value of 0.91. This indicates that knowing the petal dimensions provides a high degree of information about the flower species. Sepal Length also has a moderate relationship with species at 0.55, meaning it offers some information about the species, though less than the petal measurements. Sepal Width has the weakest association with species at 0.33, indicating it provides relatively little information about the flower type. The relatively lower values between the sepal measurements and species highlight that petal dimensions are more informative for predicting species, which is consistent with the known characteristics of the Iris dataset.
Conclusion
In this article, we demonstrated how to calculate Theil’s U to assess associations between categorical and numerical variables. By applying this measure to the Iris dataset, we showed that petal dimensions provide significant insights into predicting flower species, highlighting the effectiveness of Theil’s U compared to traditional correlation methods.
Sources
- Theil, H. (1958): Economic Forecasts and Policy. Amsterdam: North Holland.
- Theil, H. (1966): Applied Economic Forecasting. Chicago: Rand McNally.
- Bliemel, F. (1973): Theil’s Forecast Accuracy Coefficient: A Clarification, Journal of Marketing Research 10(4), pp. 444–446
Note: Unless otherwise noted, all images are by the author.
Calculating the Uncertainty Coefficient (Theil’s U) in Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Calculating the Uncertainty Coefficient (Theil’s U) in Python
Go Here to Read this Fast! Calculating the Uncertainty Coefficient (Theil’s U) in Python