From the Archives: Generative AI in the ‘00s
I want to take you back to the last generative AI episode, in the early ’00s. During this time, Geoff Hinton, one of the founding fathers of deep learning, published an influential paper detailing the contrastive divergence algorithm [1]. This discovery allowed Smolensky’s harmonium [2] — which Hinton called the restricted Boltzmann machine — to be trained efficiently. It was soon realised that this model could be used for all sorts of purposes: initialising a feed-forward neural net [3], used as part of a deep belief net [4], etc. For at least a decade, the harmonium remained one of the pillars in AI until we discovered better optimisers for training feed-forward networks. Although the harmonium has now gone out of fashion, the model remains useful for modelling discrete data.
In this first article of a two-part series, we’ll focus on the essentials: what harmoniums are, when they are useful, and how to get started with scikit-learn. In a follow-up, we’ll take a closer look at the technicalities.
What are Harmoniums?
The vanilla harmonium — or, restricted Boltzmann machine — is a neural network operating on binary data [2]. These networks are composed of two types of variables: the input, x, and the hidden states, h (Fig. 1). The input consists of zeroes and ones, xᵢ ∈ {0, 1}, and together we call these observed values—x — the visible states or units of the network. Conversely, the hidden units h are latent, not directly observed; they are internal to the network. Like the visible units, the hidden units h are either zero or one, hᵢ ∈ {0, 1}.
Standard feed-forward neural networks process data sequentially, by directing the layer’s output to the input of the next layer. In harmoniums, this is different. Instead, the model is an undirected network. The network structure dictates how the probability distribution factorises over the graph. In turn, the network topology follows from the energy function E(x, h) that quantifies the preferences for specific configurations of the visible units x and the hidden units h. Because the harmonium is defined in terms of an energy function, we call it an energy-based model.
The Energy Function
The simplest network directly connects the observations, x, with the hidden states, h, through E(x, h) = xᵀWh where W is a receptive field. Favourable configurations of x and h have a low energy E(x, h) while unlikely combinations have a high energy. In turn, the energy function controls the probability distribution over the visible units
p(x,h) = exp[-E(x, h)] / Z,
where the factor Z is a constant called the partition function. The partition function ensures that p(x,h) is normalised (sums to one). Usually, we include additional bias terms for the visible states, a, and hidden states, b in the energy function:
E(x, h) = xᵀa + xᵀWh + bᵀh.
Structurally, E(x, h) forms a bipartition in x and h (Fig. 1). As a result, we can easily transform observations x to hidden states h by sampling the distribution:
p(hᵢ=1|x) = σ[-(Wᵀx+b)],
where σ(x) = 1/[1 + exp(-x)] is the sigmoid activation function. As you see, the probability distribution for h | x is structurally akin to a one-layer feed-forward neural network. A similar relation holds for the visible states given the latent observation: p(xᵢ=1|h) = σ[-(Wh+a)].
This identity can be used to impute (generate new) input variables based on the latent state h. The trick is to Gibbs sample by alternating between p(x|h) and p(h|x). More on that in the second part of this series.
When to use harmoniums
In practice, consider using harmoniums when:
1. Your data is discrete (binary-valued).
Harmoniums have a strong theoretical foundation: it turns out that the model is powerful enough to describe any discrete distribution. That is, harmoniums are universal approximators [5]. So in theory, harmoniums are a one-size-fits-all when your dataset is discrete. In practice, harmoniums also work well on data that naturally lies in the unit [0, 1] interval.
2. For representation learning.
The hidden states, h, that are internal to the network can be used in itself. For example, h can be used as a dimension reduction technique to learn a compressed representation of x. Think of it as principal components analysis, but for discrete data. Another application of the latent representation h is for a downstream task by using it as the features for a classifier.
3. To elicit latent structure in your variables.
Harmoniums are neural networks with receptive fields that describe how an example, x, relates to its latent state h: neurons that wire together, fire together. We can use the receptive fields as a read-out to identify input variables that naturally go together (cluster). In other words, the model describes different modules of associations (or, correlations) between the visible units.
4. To impute your data.
Since harmoniums are generative models, they can be used to complete missing data (i.e., imputation) or generate completely new (synthetic) examples. Traditionally, they have been used for in-painting: completing part of an image that is masked out. Another example is recommender systems: harmoniums featured in the Netflix competition to improve movie recommendations for users.
Getting started with scikit-learn
Now that you know the essentials, let’s show how to train a model.
As our running example, we’ll use the UCI MLR handwritten digits database (CC BY 4.0) that is part of scikit-learn. While technically the harmonium requires binary data as input, using binary probabilities (instead of samples thereof) works fine in practice. We therefore normalise the pixel values to the unit interval [0, 1] prior to training.
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MaxAbsScaler
# Load dataset of 8x8 pixel handwritten digits numbered zero to nine.
digits = load_digits()
X = MaxAbsScaler().fit_transform(digits.data) # Scale to interval [0, 1].
X_train, X_test = train_test_split(X)
Conveniently, scikit-learn comes with an off-the-shelf implementation: BernoulliRBM.
from sklearn.neural_network import BernoulliRBM
harmonium = BernoulliRBM(n_components=32, learning_rate=0.05)
harmonium.fit(X_train)
receptive_fields = -harmonium.components_ # Energy sign convention.
Under the hood, the model relies on the persistent contrastive divergence algorithm to fit the parameters of the model [6]. (To learn more about the algorithmic details, stay tuned.)
To interpret the associations in the data — which input pixels fire together — you can inspect the receptive fields W. In scikit-learn, a NumPy array of W can be accessed by the BernoulliRBM.components_ attribute after fitting the BernoulliRBM model (Fig. 2). [Beware: scikit-learn uses a different sign convention in the energy function: E(x,h) -> –E(x,h).]
For representation learning, it is customary to use a deterministic value p(hᵢ=1|x) as a representation instead of stochastic sample hᵢ ~ p(hᵢ|x). Since p(hᵢ=1|x) equals the expected hidden state <hᵢ> given x, it is a convenient measure to use during inference where we prefer determinism (over randomness). In scikit-learn, the latent representation, p(hᵢ=1|x), can be directly obtained through
H_test = harmonium.transform(X_test)
Finally, to demonstrate imputation or in-painting, let’s take an image containing the digit six and erase 25% of the pixel values.
import numpy as np
mask = np.ones(shape=[8,8]) # Mask: erase pixel values where zero.
mask[-4:, :4] = 0 # Zero out 25% pixels: lower left corner.
mask = mask.ravel()
x_six_missing = X_test[0] * mask # Digit six, partly erased.
We will now use the harmonium to impute the erased variables. The trick is to do Markov chain Monte Carlo (MCMC): simulate the missing pixel values using the pixel values that we do observe. It turns out that Gibbs sampling — a specific MCMC approach — is particularly easy in harmoniums.
Here is how yo do it: first, initialise multiple Markov chains (e.g., 100) using the sample you want to impute. Then, Gibbs sample the chain for several iterations (e.g., 1000) while clamping the observed values. Finally, aggregate the samples from the chains to obtain a distribution over the missing values. In code, this looks as follows:
# Impute the data by running 100 parallel Gibbs chains for 1000 steps:
X_reconstr = np.tile(x_six_missing, reps=(100, 1)) # Initialise 100 chains.
for _ in range(1_000):
# Advance Markov chains by one Gibbs step.
X_reconstr = harmonium.gibbs(X_reconstr)
# Clamp the masked pixels.
X_reconstr = X_reconstr * (1 - mask) + x_six_missing * mask
# Final result: average over samples from the 100 Markov chains.
x_imputed = X_reconstr.mean(axis=0)
The result is shown in Fig. 3. As you can see, the harmonium does a pretty decent job reconstructing the original image.
Conclusion
Generative AI is not new, it goes back a long way. We’ve looked at harmoniums, an energy-based unsupervised neural network model that was popular two decades ago. While no longer at the centre of attention, harmoniums remain useful today for a specific niche: learning from discrete data. Because it is a generative model, harmoniums can be used to impute (or, complete) variable values or generate completely new examples.
In this first article of a two-part harmonium series, we’ve looked at the essentials. Just enough to get you started. Stay tuned for part two, where we’ll take a closer look at the technicalities behind training these models.
Acknowledgements
I would like to thank Rik Huijzer and Dina Boer for proofreading.
References
[1] Hinton “Training products of experts by minimizing contrastive divergence.” Neural computation 14.8, 1771–1800 (2002).
[2] Smolensky “Information processing in dynamical systems: Foundations of harmony theory.” 194–281 (1986).
[3] Hinton-Salakhutdinov, “Reducing the dimensionality of data with neural networks.” Science 313.5786, 504–507 (2006).
[4] Hinton-Osindero-Teh. “A fast learning algorithm for deep belief nets.” Neural computation 18.7, 1527–1554 (2006).
[5] Le Roux-Bengio, “Representational power of restricted Boltzmann machines and deep belief networks.” Neural computation 20.6, 1631–1649 (2008).
[6] Tieleman, “Training restricted Boltzmann machines using approximations to the likelihood gradient.” Proceedings of the 25th international conference on Machine learning. 2008.
Learning Discrete Data with Harmoniums: Part I, The Essentials was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Learning Discrete Data with Harmoniums: Part I, The Essentials
Go Here to Read this Fast! Learning Discrete Data with Harmoniums: Part I, The Essentials