An exploration of the Weak Law of Large Numbers and the Central Limit Theorem through the long lens of history
In my previous article, I introduced you to the Central Limit Theorem. We disassembled its definition, looked at its applications, and watched it do its magic in a simulation.
I ended that article with a philosophical question asked by a famous 17th century mathematician about how nature behaves when confronted with a large collection of anything. A question that was to lead to the discovery of the Central Limit Theorem more than a century later.
In this article, I’ll root into this question, and into the life of the mathematician who pondered over it, and into the big discovery that unfolded from it.
The discovery of the Weak Law of Large Numbers
It all started with Jacob Bernoulli. Sometime around 1687, the 32 year old first-born son of the large Bernoulli family of Basel in present day Switzerland started working on the 4th and final part of his magnum opus titled Ars Conjectandi (The Art of the Conjecture). In the 4th part, Bernoulli focused on Probability and its use in “Civilibus, Moralibus & Oeconomicis” (Civil, Moral and Economic) affairs.
In Part 4 of Ars Conjectandi, Bernoulli posed the following question: How do you determine the true probability of an event in situations where the sample space isn’t fully accessible? He illustrated his question with a thought experiment which when stated in modern terms goes like this:
Imagine an urn filled with r black tickets and s white tickets. You don’t know r and s. Thus, you don’t know the ‘true’ probability p=r/(r+s) of drawing a black ticket in a single random trial.
Suppose you draw a random sample of n tickets (with replacement) from the urn and you get X_bar_n black tickets and (n — X_bar_n) white tickets in your sample. X_bar_n is clearly Binomial distributed. We write this as:
X_bar_n ~ Binomial(n,p).
In what’s to follow, just keep in mind that even though I’ve placed a bar over the X, X_bar_n is the sum, not the mean, of n i.i.d random variables. Thus:
- X_bar_n/n is the proportion of black tickets that you have observed, and
- |X_bar_n/n — p| is the unknown error in your estimation of the real, unknown ratio p.
What Bernoulli theorized was that as the sample size n becomes very large, the odds of the error |X_bar_n/n — p| being smaller than any arbitrarily small positive number ϵ of your choice become incredibly and unfathomably large. Shaped into an equation, his thesis can be expressed as follows:
The probabilities P(|X_bar_n/n — p| <= ϵ) and P(|X_bar_n/n — p| > ϵ) are respectively the probability of the estimation error being at most ϵ, and greater than ϵ. The constant ‘c’ is some seriously large positive number. Some texts replace the equals sign with a ‘≥’ or a simple ‘>’.
A little bit of algebraic manipulation yields the following three alternate forms of Bernoulli’s equation:
Did you notice how similar the third form looks to the modern day definition of a confidence interval? Well, don’t let yourself be deceived by the similarity. It is in fact the (1 — α) confidence interval of the known sample mean (or sum) X_bar_n, not the unknown population mean (sum). In the late 1600s, Bernoulli was incredibly far away from giving us the formula for confidence interval of the unknown population mean (or sum).
What Bernoulli did show came to be known as the Weak Law of Large Numbers.
Bernoulli was well aware what he was stating was in a colloquial sense already woven into the common sense of his times. He said as much quite vividly in Ars Conjectandi:
“…even the most stupid person, all by himself and without any preliminary instruction, being guided by some natural instinct (which is extremely miraculous) feels sure that the more such observations are taken into account, the less is the danger of straying from the goal.”
The ‘goal’ Bernoulli refers to is that of being “morally certain” that the observed ratio approaches the true ratio. In Ars Conjectendi, Bernoulli defines “moral certainty” as “that whose probability is almost equal to complete certainty so that the difference is insensible”. It’s possible to be somewhat precise about its definition if you state it as follows:
There always exists some really large sample size (n_0) such that as long as your sample’s size (n) exceeds n_0, then for any error threshold ϵ > 0:
P(|(X_bar_n/n) — p| <= ϵ) = 1.0 for all practical purposes.
Bernoulli’s singular breakthrough on the Law of Large Numbers was to take the common sense intuition about how nature works and mold it into the exactness of a mathematical statement. In that respect Bernoulli’s thoughts on probability were deeply philosophical for his era. He wasn’t simply seeking a solution for a practical problem. Bernoulli was, to borrow a phrase from Richard Feynman, probing the very “character of physical law”.
Over the next two and a half centuries, a long parade of mathematicians chiseled away at Bernoulli’s 1689 theorem to shape it into the modern form we recognize so well. Many improvements were made to it. The theorem was freed from the somewhat suffocating straight jacket of Bernoulli’s binomial thought experiment. The constraints of identical distribution, and even independence of the random variables that make up the random sample were eventually relaxed. The proof was greatly simplified using Markov and Chebyshev’s inequalities. Today, the WLLN says simply the following:
If X_1, X_2, …, X_n are i.i.d. random variables forming a sample of size n with mean X_bar_n. Assume that the sample is drawn randomly with replacement from a population with an unknown mean μ. The probability of the error |X_bar_n — μ| being less than any non-negative number ε approaches absolute certainty as you progressively dial up the sample size. And this holds true no matter how tiny is your choice of the threshold ε.
The WLLN uses the concept of convergence in probability. To get your head around it, you must picture a situation where you are seized with a need to collect several random samples each of some size. For example, as your town’s health inspector, you went to the local beach and took water quality measurements from 100 random points along the water’s edge. This set of 100 measurements formed a single random sample of size 100. If you repeated this exercise, you got a second random sample of 100 measurements. Maybe you had nothing better to do that day. So you repeated this exercise 50 times and ended up with 50 random samples each containing 100 water quality measurements. In the above equation, this size (100) is the n and the mean water quality of any of these 50 random samples is X_bar_n. Effectively, you ended up with 50 random values of X_bar_n. Clearly, X_bar_n is a random variable. Importantly, any one of these X_bar_n values is your estimate of the unknown — and never will it ever be known — true, average water quality of your local beach i.e. the population mean μ. Now consider this following.
When you gathered a random sample of size n, no matter how big ‘n’ is, there is no guarantee that its mean X_bar_n will lie within your specified degree of tolerance ϵ of the population mean μ. You could just have been crushingly unlucky to be saddled with a sample with a big error in its mean. But if you gathered a group of very large sized random samples and another group of small sized random samples, and you compared the fraction of the large sized samples in which the mean did not overshoot the tolerance with the corresponding fraction in the group of small sized samples, you’d find that the former fraction was larger than the later. This fraction is the probability mentioned in the above equation. And if you examined this probability in groups of random samples of larger and larger size, you’d find that it progressively increases until it converges to 1.0 — a perfect certainty — as n tends to infinity. This manner of convergence of a quantity to a certain value in purely probabilistic terms is called convergence in probability.
In terms of convergence in probability, what the above equation is saying is that the sample mean (or sum) converges in probability to the real population mean (or sum). That is, X_bar_n converges in probability to μ and it can be stated as follows (Note the little p on the arrow):
WLLN’s connection to the Central Limit Theorem
I ended my previous article on the CLT by saying how the WLLN forms the keystone for the CLT. I’ll explain why that is.
Let’s recall what CLT says: the standardized sum or mean of a sample of i.i.d. random variables converges in distribution to N(0,1). Let’s drive into that a bit.
Assume X_1, X_2, …,X_n represent a random sample of size n drawn from a population with mean μ and a finite, positive variance σ². Let X_bar_n be the sample mean or sample sum. Let Z_n be the standardized X_bar_n:
Thus, Z_n is the standardized mean using the above transformation. Stated another way, Z_n is the simple mean of the standardized sample, i.e. the original sample transformed by standardizing X_1, X_2, …, X_n using the above formula and then taking the simple mean of the transformed sample.
The CLT says that as the sample size n tends to infinity, the Cumulative Distribution Function (CDF) of Z_n converges in distribution to the standard normal random variable N(0,1) Note the little ‘d’ on the arrow to denote convergence in distribution.
Now as per the WLLN, Z_n will also converge in probability to the mean of N(0, 1) which is zero:
Notice how the WLLN says that Z_n converges, not to a point to the left of or to the right of 0, but exactly to zero. WLLN guarantees a probabilistic convergence of Z_n to 0 with perfect precision.
If you remove the WLLN from the picture, you also withdraw this guarantee. Now recall that the standard normal random variable N(0, 1) is symmetrically distributed around a mean of 0. So you must also withdraw the guarantee that the probability distribution of the standardized mean i.e. Z_n will converge to that of N(0,1). Effectively, if you take the WLLN out of the picture, you have pulled the rug out from under the CLT.
Two big problems with the WLLN and a path to the CLT
In spite of the WLLN’s importance to the CLT, the path from the WLLN to the CLT is full of tough, thorny, difficult brambles that took Bernoulli’s successors several decades to hack through. Look once again at the equation at the heart of Bernoulli’s theorem:
Bernoulli chose to frame his investigation within a Binomial setting. The ticket-filled urn is the sample space for what is clearly a binomial experiment, and the count X_bar_n of black tickets in the sample is Binomial(n, p). If the real fraction p of black tickets in the urn is known, then E(X_bar_n) is the expected value of a Binomial(n, p) random variable which is np. With E(X_bar_n) known, the probability distribution P(X_bar_n|p,n) is fully specified. Then it’s theoretically possible to crank out probabilities such as P(np — δ ≤ X_bar_n ≤ np + δ) as follows:
I suppose P(np — δ ≤ X_bar_n ≤ np + δ) is a useful probability to calculate. But you can only calculate it if you know the true ratio p. And who will ever know the true p? Bernoulli with his Calvinist leanings, and Abraham De Moivre whom we’ll meet in my next article and who was to continue Bernoulli’s research seemed to believe that a divine being might know the true ratio. In their writings, both made clear references to Fatalism and ORIGINAL DESIGN. Bernoulli brought up Fatalism in the final para of Ars Conjectandi. De Moivre mentioned ORIGINAL DESIGN (in capitals!) in his book on probability, The Doctrine of Chances. Neither man made secret his suspicion that a Creator’s intention was the reason we have a law such as the Law of Large Numbers.
But none of this theology helps you or me. Almost never will you know the true value of pretty much any property of any non-trivial system in any part of the universe. And if by an unusually freaky stroke of good fortune you were to stumble upon the true value of some parameter then case closed, right? Why waste your time drawing random samples to estimate what you already know when you have God’s eye view of the data? To paraphrase another famous scientist, God has no use for statistical inference.
On the other hand, down here on Earth, all you have is a random sample, and its mean or sum X_bar_n, and its variance S. Using them, you’ll want to draw inferences about the population. For example, you’ll want to build a (1 — α)100% confidence interval around the unknown population mean μ. Thus, it turns out you don’t have as much use for the probability:
P(np — δ ≤ X_bar_n ≤ np + δ)
as you do for the confidence interval for the unknown mean, namely:
P(X_bar_n — δ ≤ np ≤ X_bar_n+δ).
Notice how subtle but crucial is the difference between the two probabilities.
The probability P(X_bar_n — δ ≤ np ≤ X_bar_n+δ) can be expressed as a difference of two cumulative probabilities:
To estimate the two cumulative probabilities, you’ll need a way to estimate the probability P(p|X_bar_n,n) which is the exact inverse of the binomial probability P(X_bar_n|n,p) that Bernoulli worked with. And by the way, since the ratio p is a real number, P(p|X_bar_n,n) is the Probability Density Function (PDF) of p conditioned upon the observed sample mean X_bar_n. Here you are asking the question:
Given the observed ratio X_bar_n/n, what is the probability density function of the unknown true ratio p?
P(p|n,X_bar_n) is called inverse probability (density). Incidentally, the path to the Central Theorem’s discovery runs straight through a mechanism to compute this inverse probability — a mechanism that an English Presbyterian minister named Thomas Bayes (of the Bayes Theorem fame), and the Isaac Newton of France Pierre-Simon Laplace were to independently discover in the late 1700s to early 1800s using two strikingly different approaches.
Returning to Jacob Bernoulli’s thought experiment, the way to understand inverse probability is to look at the true fraction of black tickets p as the cause that is ‘causing’ the effect of observing X_bar_n/n fraction of black tickets in a random sample of size n. For each observed value of X_bar_n, there are an infinite number of possible values for p. With each value of p is associated a probability density that can be read off from the inverse probability distribution function P(p|X_bar_n,n). If you know this inverse PDF, you can calculate the probability that p will lie within some specified interval [p_low, p_high], i.e. P(p_low ≤ p ≤ p_high) given the observed X_bar_n.
Unfortunately, Jacob Bernoulli’s theorem isn’t expressed in terms of inverse PDF P(p|n,X_bar_n). Instead, it’s expressed in terms of its exact complement i.e. P(X_bar_n|n,p) which requires you to know the true ratio p.
Having come as far as stating and proving the WLLN in terms of the ‘forward’ probability P(X_bar_n|n,p), you’d think Jacob Bernoulli would take the natural next step to invert the statement of his theorem and show how to calculate the inverse PDF P(p|n,X_bar_n).
But Bernoulli did no such thing, choosing instead to mysteriously bring the whole of Ars Conjectandi to a sudden, unexpected close with a rueful sounding paragraph on Fatalism.
“…if eventually the observations of all should be continued through all eternity (from probability turning to perfect certainty), everything in the world would be determined to happen by certain reasons and by the law of changes. And so even in the most casual and fortuitous things we are obliged to acknowledge a certain necessity, and if I may say so, fate,…”
PARS QUARTA of Ars Conjectandi was to disappoint (but also inspire) future generations of scientists in yet another way.
Look at the summations on the R.H.S. of the following equation:
They contain big, bulky factorials that are all but impossible to crank out for large n. Unfortunately, everything about Bernoulli’s theorem is about large n. And the calculation must become especially tedious if you are doing it in the year 1689 under the unsteady, dancing glow of grease lamps and using nothing more than paper and pen. In Part 4, Bernoulli did a few of these calculations particularly to calculate the minimum sample sizes required to achieve different degrees of accuracy. But he left the matter there.
Neither did Bernoulli show how to approximate the factorial (a technique that was to be discovered four decades later by Abraham De Moivre and James Stirling (in that order), nor did he make the crucial, conceptual leap of showing how to attack the problem of inverse probability.
Jacob Bernoulli’s program of inquiry into Probability’s role in different aspects of social, moral and economic affairs was, to put it lightly, ambitious for even the current era. To illustrate, at one point in Part 4 of Ars Conjectandi Bernoulli ventures so far as to confidently define human happiness in terms of probabilities of events.
During the final two years of his life, Bernoulli corresponded with Gottfried Leibniz (the co-inventor — or the primary inventor — based on which side of the Newton-Leibniz controversy your sympathies lie, of differential and integral Calculus) in which Bernoulli complained about his struggles in completing his book and lamented how his laziness and failing health were coming in the way.
Sixteen years after starting work on Part 4, in the Summer of 1705 a relatively young and possibly dispirited Jacob Bernoulli succumbed to Tuberculosis leaving both Part 4 and Ars Conjectandi unfinished.
Since Jacob’s children weren’t mathematically inclined, the task of publishing his unfinished work ought to have fallen into the capable hands of his younger brother Johann, also a prominent mathematician. Unfortunately, for a good fraction of their professional lives, the two Bernoulli brothers bickered and quarreled, often bitterly and publicly, and often in ways that only first-rate scholars might be expected to do so — through the pages of eminent mathematics journals. At any rate by Jacob’s death in 1705 they were barely on speaking terms. The publication of Ars Conjectandi eventually landed upon the reluctant shoulders of Nicolaus Bernoulli (1687–1759) who was both Jacob and Johann’s nephew and also an accomplished mathematician. At one point Nicolaus asked Abraham De Moivre in England if he was interested in completing Jacob’s program on probability. De Moivre politely refused and curiously chose to go on record with his refusal.
Finally in 1713, eight years after his uncle Jacob’s death, and more than two decades after his uncle’s pen rested for the final time on the pages of Ars Conjectandi, Nicolaus published Jacob’s work in pretty much the same state that Jacob left it.
Just in case you are wondering, Jacob Bernoulli’s family tree was packed to bursting with math and physics geniuses. One would be hard pressed to find a family tree as densely adorned with scientific talent as the Bernoullis. Perhaps the closet contenders are the Curies (of Marie and Pierre Curie fame). But get this: Pierre Curie was a great-great-great-great-great grandson of Jacob Bernoulli’s younger brother Johann.
Ars Conjectandi had fallen short of addressing the basic needs of statistical inference even for the limited case of binomial processes. But Jacob Bernoulli had sown the right kinds of seeds in the minds of his fellow academics. His contemporaries who continued his work on probability — particularly his nephew Nicolaus Bernoulli (1687–1759), and two French mathematicians Pierre Remond de Montmort (1678–1719), and our friend Abraham De Moivre (1667–1754) knew the general direction in which to take Bernoulli’s work to make it useful. In the decades following Bernoulli’s death, all three mathematicians made progress. And in 1733, De Moivre finally broke through with one of the finest discoveries in mathematics.
Join me next week when I’ll cover De Moivre’s Theorem and the birth of the Normal curve and how it was to inspire the solution for Inverse Probability and lead to the discovery of the Central Limit Theorem. Stay tuned.
References and Copyrights
Books and Papers
Bernoulli, Jakob (2005) [1713]. On the Law of Large Numbers, Part Four of Ars Conjectandi (English translation). Translated by Oscar Sheynin, Berlin: NG Verlag. ISBN 978–3–938417–14–0 PDF download
Seneta, E. (2013) A Tricentenary history of the Law of Large Numbers. Bernoulli 19 (4) 1088–1121. https://doi.org/10.3150/12-BEJSP12 PDF Download
Fischer, H. (2010) A History of the Central Limit Theorem. From Classical to Modern Probability Theory. Springer. Science & Business Media.
Shafer, G. (1996) The significance of Jacob Bernoulli’s Ars Conjectandi for the philosophy of probability today. Journal of Econometrics. Volume 75, Issue 1, Pages 15–32. ISSN 0304–4076. https://doi.org/10.1016/0304-4076(95)01766-6.
Polasek, W. (2000) The Bernoullis and the origin of probability theory: Looking back after 300 years. Resonance. Volume 5, pages 26–42. https://doi.org/10.1007/BF02837935. PDF download
Stigler, S. M. (1986) The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press.
Todhunter, I. (1865) A history of the mathematical theory of probability : from the time of Pascal to that of Laplace. Macmillan and Co.
Hald, H. (2007) A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713–1935. Springer
Images and Videos
All images and videos in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image or video.
Related
A New Look at the Central Limit Theorem
Thanks for reading! If you liked this article, please follow me for more content on statistics and statistical modeling.
On Jacob Bernoulli, the Law of Large Numbers, and the Origins of the Central Limit Theorem was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
On Jacob Bernoulli, the Law of Large Numbers, and the Origins of the Central Limit Theorem