Chapter 4: The Statistical Roots

Цей контент ще не доступний вашою мовою.

Cast of characters

Name	Lifespan	Role
Andrei Markov	1856–1922	Petersburg mathematician; took over Chebyshev’s probability course on his retirement in 1883; published the 1906 chain-dependence proof in the Kazan Bulletin and the 1913 Eugene Onegin analysis at the Imperial Academy of Sciences. The chapter’s protagonist.
Pavel Nekrasov	1853–1924	Moscow mathematician with a theological-seminary background; the 1902 paper Markov read as arguing pairwise independence is necessary for the Weak Law of Large Numbers — the “abuse of mathematics” Markov set out to refute.
Pafnuty Chebyshev	1821–1894	St. Petersburg mathematician; published a broader proof of the Weak Law of Large Numbers than Bernoulli’s; mentor to Markov; passed his probability course to Markov on his 1883 retirement.
Aleksandr Chuprov	1874–1926	Petersburg statistician; long-running correspondent of Markov; a key source for Markov’s private framing of the Nekrasov dispute (preserved in Ondar 1981).
Maurice Fréchet	1878–1973	French mathematician; his 1938 Méthode des fonctions arbitraires became an important bibliographic bridge between Markov and Shannon.
Claude Shannon	1916–2001	Bell Labs mathematician; “A Mathematical Theory of Communication” (BSTJ, 1948) §3 “The Series of Approximations to English” runs n-gram approximations on English text using the Markoff-process formalism. Reaches Markov via the genus name, not direct citation.

Timeline (1902–2006)

timeline
    title From Nekrasov's free will to Shannon's English approximations
    1902 : Nekrasov publishes the free-will-by-statistics argument (Moscow)
    1906 : Markov publishes the 2-state chain proof in the Kazan Bulletin (counterexample to Nekrasov)
    1907 : Markov's six-year, eight-paper purely theoretical chain-dependence series begins
    1912 : Heinrich Liebmann's German translation of Markov's Calculus of Probabilities published
    1913 : Jan 23 (o.s.) — Markov delivers the Onegin lecture at the Imperial Academy of Sciences (Bernoulli Ars Conjectandi bicentenary)
    1922 : Markov dies in Petrograd
    1924 : Posthumous 4th edition with Aksakov 100,000-letter analysis (appendix); Nekrasov dies
    1933 : Kolmogorov's Grundbegriffe recasts probability theory in measure-theoretic language
    1938 : Fréchet publishes Méthode des fonctions arbitraires (Paris)
    1948 : Shannon's A Mathematical Theory of Communication adopts the Markoff process via Fréchet
    1955 : Morris Halle (MIT) circulates an unpublished English translation of the Onegin paper
    2006 : David Link et al. publish the first widely available English translation in Science in Context

Plain-words glossary

Weak Law of Large Numbers (WLLN) — The theorem that, for an arithmetic average over many trials, the average converges in probability to the expected value. Bernoulli proved a special case in 1713; Chebyshev gave a broader version. Markov’s 1906 paper showed it still holds when successive trials depend on each other in a chain.
Independence (of random variables) — Two random variables are independent if knowing one tells you nothing about the other. Coin flips are the textbook example. Nekrasov, as Markov read him, argued that the WLLN held only for independent variables; Markov’s 1906 paper supplied a counterexample.
Markov chain — A sequence of random states where the probability of the next state depends only on the current one (first-order chain) — or, in higher orders, only on the last $n$ states. Markov’s 1906 proof was for a two-state first-order chain with all four transition probabilities strictly between 0 and 1.
Transition probability — In a Markov chain, the probability of moving from one state to a specified next state. Markov’s Onegin analysis estimated, for example, that the probability of a vowel after a consonant in Pushkin’s Cyrillic was $p_0 \approx 0.663$ ; the probability of a vowel after a vowel was $p_1 \approx 0.128$ .
Dispersion coefficient — A goodness-of-fit statistic: the ratio of empirical variance to the variance the chain model predicts. A coefficient near 1 means the model fits; departure from 1 measures how much the data’s variability deviates from what the model would produce. For the Onegin counts the empirical value was 0.208; the second-order chain predicted 0.195 (good fit), the first-order chain predicted 0.300 (worse fit).
N-gram — A contiguous sequence of $n$ tokens (here, letters or words) drawn from a text. Shannon’s 1948 §3 ran first-, second-, third-order, and word-level n-gram approximations to English. N-grams are the simplest predictive language model and the direct descendant of Markov’s count-and-condition method on Pushkin.

The math, on demand

Markov’s 1906 proof and 1913 demonstration both run on a short list of definitions and counts.

Two-state first-order chain. If the current state is A, the next state is A with probability $p_1$ and B with probability $1 - p_1$ . From B, A with probability $p_0$ and B with probability $1 - p_0$ . Markov 1906 proved the WLLN holds for any chain with $0 < p_0, p_1 < 1$ .
Independence test. Under independence the two transition probabilities coincide: $p_1 = p_0$ . Markov’s diagnostic was the gap $\delta = p_1 - p_0$ ; the larger $|\delta|$ , the deeper the dependence between successive trials.
Onegin block variance. Average vowels per 100-letter block: $43.19$ , so the unconditional vowel probability $p \approx 0.432$ . Sum of squared deviations $1022.8$ over $200$ blocks gives variance $5.114$ and standard deviation $\sigma \approx 2.26$ .
Onegin first-order counts. Out of $8{,}638$ vowels in $20{,}000$ letters, $p_1 = 1{,}104 / 8{,}638 \approx 0.128$ (a vowel follows a vowel). Out of the consonant-following positions, $p_0 \approx 0.663$ (a vowel follows a consonant). Therefore $\delta = p_1 - p_0 = -0.535$ — a substantial separation from independence.
Second-order chain. Conditioning on the previous two letters yields four transition probabilities $p_{1,1}, p_{1,0}, p_{0,1}, p_{0,0}$ . Markov found $p_{1,1} \approx 0.104$ — after two vowels in a row, the chance of a third vowel is lower than after just one ( $0.104 < p_1 = 0.128$ ), and far below the unconditional $p \approx 0.432$ .
Dispersion coefficient as model fit. The empirical coefficient was $0.208$ ; the second-order chain predicted $0.195$ , the first-order chain $0.300$ . The closer the model’s prediction to the empirical value, the better the fit — so the chain of dependence in Pushkin’s prose extends at least one letter back.

Russia at the turn of the twentieth century held two empires of probability theory, separated by the Pulkovo meridian and a long-running dispute over what mathematics was permitted to claim about human action. On one side stood St. Petersburg University, the secular, republican heir to a mathematical lineage stretching from Jakob Bernoulli’s 1713 Ars Conjectandi through Pafnuty Chebyshev, who had published a broader proof of the Weak Law of Large Numbers than Bernoulli’s own and whose probability-theory course passed to a single successor on his retirement in 1883. On the other stood Moscow University, an ecclesiastical stronghold whose faculty was closely aligned with the Russian Orthodox Church.

The spark for what would become the mathematical foundation of language modeling was struck in Moscow. Pavel Alekseevich Nekrasov, a mathematician on the Moscow faculty who had begun his education at a theological seminary, published a paper in 1902. Nekrasov’s claim, according to Andrei Andreevich Markov’s interpretation, was that the pairwise independence of summands was both necessary and sufficient for the Weak Law of Large Numbers to hold. The implication of this argument was sweeping. As paraphrased by the science writer Brian Hayes, the popular form of the argument ran like this: voluntary acts are independent, like coin flips; the law of large numbers applies only to such independent events; data gathered by social scientists, such as crime statistics, conform to the law of large numbers; therefore, the underlying acts of individuals must be independent and voluntary. Free will, in Nekrasov’s formulation, was mathematically demonstrable.

From St. Petersburg, Markov read the argument as a fundamental misuse of probability theory. Markov, born in Ryazan in 1856 to a forestry-service official who later managed an aristocrat’s estate, was by most accounts a nettlesome character, abrasive even with friends and fiercely combative with rivals. In private correspondence to the statistician Aleksandr Aleksandrovich Chuprov, he called Nekrasov’s work “an abuse of mathematics.” He was not a man given to measured opposition. When Tsar Nicholas II vetoed the writer Maxim Gorky’s election to the Imperial Academy of Sciences in 1902, Markov announced he would refuse all future tsarist honors — without resigning his own membership. When the Russian Orthodox Church excommunicated Leo Tolstoy in 1908, Markov asked the Church to excommunicate him as well, a request that was granted. The historian Eugene Seneta, working from the Markov-Chuprov correspondence preserved in Kh. O. Ondar’s 1981 edition, records that Markov’s antipathy did not cool with time: in late 1910, after Chuprov mentioned Nekrasov in a positive light, Markov responded with what Seneta calls “fiery post-cards.” Eight years after the original 1902 paper, the dispute was still active.

A note on the historical record is in order. Seneta is more cautious than Hayes about the substance of the 1902 argument. Markov interpreted Nekrasov as claiming that pairwise independence was necessary for the Weak Law to hold; whether Nekrasov literally argued that, in those terms, is a question this chapter cannot settle, because Nekrasov’s 1902 paper has not been read directly. The popular free-will gloss in Hayes is what Markov/Hayes took Nekrasov to imply; the more carefully phrased mathematical claim in Seneta is what Markov decided he had to refute. The two readings are compatible, but the chapter is recording Markov’s reading of Nekrasov, not Nekrasov’s verbatim self-description.

Markov’s own career had unfolded along the Petersburg lineage with unusual exactness. He had taken over Chebyshev’s probability-theory course at St. Petersburg University on Chebyshev’s retirement in 1883, the year he also married Maria Valvatieva, daughter of the owner of the estate his father had once managed. He was elected Adjunct of the Imperial Academy of Sciences in 1886 at Chebyshev’s proposal and Extraordinary Academician on January 30, 1890 (Old Style). He was promoted to full Professor at the university in 1893; in 1896, two years after Chebyshev’s death, he was elected Ordinary Academician at the Academy. The first edition of his textbook Ischislenie Veroiatnostei (Calculus of Probabilities) appeared in 1900; the second would follow in 1908. By the time Nekrasov’s 1902 argument reached St. Petersburg, Markov was a tenured pillar of the Petersburg school with two decades of teaching tied to the Bernoulli-Chebyshev tradition.

The 1906 Counterexample

Markov retired from St. Petersburg University as Emeritus Professor in 1905, though he remained an Academician and continued to teach the probability course he had inherited. It was from this position that he formulated his response. In a 1906 paper for the Kazan Izvestiia, Markov constructed a deliberate counterexample to Nekrasov’s claim. He proved that the Weak Law of Large Numbers held for a two-state chain in which all four transition probabilities lay strictly between zero and one. If the current state was A, the next state would be A with probability p1 and B with probability 1 − p1, and the law of large numbers would still apply despite the dependence between successive trials.

The mathematical force of the refutation was austere. Nekrasov had — as Markov read him — required pairwise independence for the Weak Law to hold. Markov supplied an explicit counterexample in which successive trials were demonstrably not independent, and in which the law of large numbers nonetheless converged. The two-state chain was the simplest non-trivial dependent process available, and the convergence held without leaning on any property an independent trial would have provided. After 1906, the inference from “social statistics conform to the WLLN” to “the underlying acts must be independent” was no longer mathematically licensed. Whether human acts were free or determined remained, as it had been before Nekrasov tried to rescue free will with a theorem, a metaphysical question.

The proof was symbolic, written on paper, and ran entirely without empirical illustration. It was the beginning of a six-year, eight-paper purely theoretical series extending chain dependence theory to multiple states, complex chains, and events left without observation. The follow-ups had a clear bibliographic shape. The 1907 paper “Recherches sur un cas remarquable d’épreuves dépendantes” appeared in the Bulletin de l’Académie Impériale des Sciences de St.-Pétersbourg; a slightly different French version reached Acta Mathematica in 1910, giving the work its first extended Western European readership. A 1908 paper in the Academy’s Mémoires extended the limit theorems to chain sums; a 1910 paper generalized the chain proof beyond two states; further papers in 1911 and 1912 covered “valeurs liées qui ne forment pas une chaîne véritable,” chains of multiple steps, and chains in which some events were left without observation. Across all of these papers, Markov did not look for empirical applications. In the words of David Link, during those six years Markov “could not think of a practical illustration and empirical verification of his deductions, or was not interested in finding one.” Markov himself was explicit. In a letter to Chuprov he wrote that he was concerned only with questions of pure analysis, adding that he referred to the question of the applicability of probability theory with indifference.

The Bernoulli Bicentenary Lecture

That indifference broke on January 23, 1913, by the Old Style Russian calendar — February 5 in the Gregorian. While Russia at large prepared to mark three hundred years of Romanov rule, the bicentenary of an entirely different anniversary fell that winter: the 1713 posthumous publication, by Jakob Bernoulli’s nephew Niklaus, of Ars Conjectandi, the book that had first formulated the law of large numbers. Markov organized a lecture for the physical-mathematical faculty of the Imperial Academy of Sciences in St. Petersburg to commemorate it. Link, writing as the cautious historian of the lecture, observes only that the choice “gave the physical-mathematical faculty’s event a political dimension”; Hayes, writing as Markov’s biographer, treats the choice as a deliberate counterpoint to the Romanov tercentenary. Whichever reading is correct, the same mathematical world that had heard six years of Markov’s purely theoretical chain proofs would now hear an empirical demonstration of one of them — and the empirical case would be a literary text.

His test set was literary, but the choice was not. Pushkin’s Eugene Onegin — begun during Pushkin’s southern exile in Odessa, after his banishment from St. Petersburg for political poetry, and continued at his mother’s country seat at Mikhaylovskoe — offered Markov a long, structurally homogeneous corpus of standardized Cyrillic. He took the first 20,000 letters: the entire first chapter and sixteen stanzas of the second. He stripped out the punctuation, the spaces, the hard signs (ъ), and the soft signs (ь), noting that the latter were not pronounced independently but merely modified the preceding letter.

At the start of the lecture Markov defined seven probabilities. He called the unconditional probability that a letter is a vowel p; the first-order conditional probabilities — the probability of a vowel given the previous letter was a vowel, or a consonant — p1 and p0; and the four second-order conditional probabilities, conditioning on the previous two letters, p1,1, p1,0, p0,1, and p0,0. The first three would carry the first-order argument; the four others were reserved for a sharper, second-order test of the chain.

He organized the unbroken stream of Cyrillic characters into two hundred small tables, each containing ten rows by ten columns. He counted the vowels in each column, paired columns one and six, two and seven, three and eight, four and nine, and five and ten, and entered the sums in bold. This layout permitted simultaneous horizontal and vertical aggregation across each table; forty such tables of paired sums fill an entire page of the published lecture. Link, examining the apparatus a century later, characterized the transformation Markov had performed on Pushkin’s text as “a paper-automaton” and a “writing game with discrete characters” — the alphabet recast as a finite state space, the text recast as a sequence of transitions between those states.

The arithmetic mean number of vowels per hundred-letter block was 43.19, giving an unconditional vowel probability p ≈ 0.432. The variance, computed via the method of least squares — the technique developed by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss in 1809 — came to 1022.8 / 200 = 5.114, with a corresponding standard deviation of about 2.26. Plotted, the per-block vowel counts fell along a normal curve almost as cleanly as if the letters had been drawn independently. Independence, in other words, was not contradicted at the block level — a long sample averages out, regardless of the dependence within. The dependence would have to be looked for at the level of pairs.

Returning to the unbroken stream of letters, Markov counted the pair patterns. Out of 8,638 total vowels, he found 1,104 vowel-vowel pairs, giving p1 — the probability of a vowel after a vowel — as 1104 / 8638 = 0.128. Excluding the first letter, which has no predecessor, he found 7,534 consonants following a vowel and 3,827 consonant-consonant pairs out of 11,361 available consonant-following positions, giving p0 — the probability of a vowel after a consonant — as 0.663. The probability of a consonant following a vowel, q1, was therefore 0.872. The diagnostic signature of the chain was the difference δ = p1 − p0 = 0.128 − 0.663 = −0.535. Under independence, p1 and p0 would coincide; the larger the absolute gap, the deeper the dependence between successive letters. Half a unit was a substantial separation.

Markov did not stop at first-order dependence. He counted the triple patterns to examine the second-order chain: 115 vowel-vowel-vowel sequences and 505 consonant-consonant-consonant sequences in the 20,000-letter sample. After two vowels in a row, the probability of a third vowel collapsed to 0.104 — lower than the unconditional rate of 0.432 by a factor of more than four, and lower even than p1, the probability of a vowel after a single vowel. After two consonants, the probability of another consonant was 1 − 0.132 = 0.868. When these figures were plugged into a two-step chain, the theoretical dispersion coefficient came to 0.195. This was a much tighter fit to the empirical dispersion coefficient of 0.208 than the 0.300 yielded by the one-step chain. The chain of dependence was real, and it demonstrably extended at least one letter back.

The labor required for this pencil-and-paper tabulation was immense. A century later, attempting a similar count on an English translation, Hayes circled vowel-vowel pairs on a printout with a pen and missed 62 of 248 of them in the first ten stanzas; he concluded that Markov was “probably faster and more accurate than I am” and “must have spent several days on these labors.” The result was a fully empirical demonstration that the Weak Law of Large Numbers held on a chain whose links were measurably real — and a reminder, for any reader inclined to assume that the late-twentieth-century statistical turn in language modeling was the first time anyone had counted letters by hand, that the counting had been done in 1913, on Pushkin, against a free-will theorem.

The Forty-Year Silence

Markov’s Onegin analysis was published in the Bulletin of the Imperial Academy of Sciences of St. Petersburg in Russian only. A few years later he repeated the exercise on a different corpus: the first 100,000 letters of Sergei Timofeevich Aksakov’s autobiographical Childhood Years of Bagrov’s Grandson, a memoir Aksakov had dictated to his daughter because he had been blind for ten years. This second analysis was included in the appendix to the posthumous 1924 fourth edition of Markov’s Calculus of Probability. Markov had survived the October Revolution by five years; in 1921 he had complained to the Academy that he could not attend meetings for lack of suitable footwear, and a committee chaired by Maxim Gorky — the same Gorky whose 1902 election the Tsar had vetoed — eventually located him a pair, which Markov dismissed as “stupidly stitched.” Markov died in Petrograd on July 20, 1922; Nekrasov followed in 1924, his work falling into post-revolutionary obscurity.

Markov’s chains traveled west all the same — not through the Onegin paper, but through the term-of-art “Markoff process.” Heinrich Liebmann’s 1912 German translation of Markov’s Calculus of Probabilities seeded the chain machinery in the European mathematical community. Andrei Kolmogorov’s 1933 Grundbegriffe der Wahrscheinlichkeitsrechnung recast probability theory — Markov chains included — in measure-theoretic language. Maurice Fréchet’s 1938 Méthode des fonctions arbitraires. Théorie des événements en chaîne dans le cas d’un nombre fini d’états possibles, published in Paris by Gauthier-Villars, generalized the Markoff-process formalism in French. By 1948 the Markoff process was a settled mathematical term in three languages with an established literature, even as the Onegin paper itself remained unread outside Russia.

It was in 1948 that Claude Shannon published “A Mathematical Theory of Communication” in the Bell System Technical Journal. In the third section, “The Series of Approximations to English,” Shannon adopted discrete Markoff processes as the model of an information source. He presented worked first-, second-, third-order, and word-level n-gram approximations to English text. The second-order output read “OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL”; the third-order produced “IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.”

Shannon generalized the construction in §4, “Graphical Representation of a Markoff Process.” Section 5, “Ergodic and Mixed Sources,” supplied the conditions under which the statistics of such a source are well-defined; section 6, “Choice, Uncertainty and Entropy,” attached the entropy measure that would carry the rest of the paper. Crucially, the only footnote routing readers to a detailed treatment of the Markoff-process formalism pointed not to Markov but to Fréchet’s Méthode des fonctions arbitraires. The transmission was conceptual, moving via the genus name of the process rather than direct citation of the 1913 or 1906 papers. Shannon’s text approximations were themselves a hand-counting exercise. As Shannon detailed in section three, his second-order sample was constructed using a book of random numbers in conjunction with a table of letter frequencies. The higher-order examples were generated by opening a book at random, finding the previous letters, and copying the next one — the same kind of pencil-and-paper procedure Markov had used. The labor involved, Shannon noted, became enormous at the next stage. The same constraint Markov had hit on Pushkin in 1913 was, in 1948, still the active constraint at Bell Labs.

Markov’s Onegin paper remained inaccessible to the English-speaking world even as Shannon was building on the Markoff-process formalism. In 1955, the MIT linguist Morris Halle produced a mimeographed English translation at the request of colleagues exploring statistical approaches to language. That translation was never published; it survived only in mimeograph form in a few libraries. Fifty-one more years passed before David Link and his colleagues published a widely available English version in Science in Context in 2006 — ninety-three years after the lecture, and fifty-eight after Shannon’s paper had already absorbed the formalism through Fréchet.

The Honest Close

What lay between Markov’s pencil-and-paper Pushkin tables and any practical use of n-gram models was four decades and an institutional infrastructure that did not yet exist. There was no published English translation of the Onegin paper; there was no theory of communication that needed a stochastic source as input; and there was no digital storage capable of carrying transition matrices for any vocabulary larger than two states. The practical questions modern Markov-chain research cares about — convergence speed, error bounds for prematurely terminated simulations, methods for sampling from intractable distributions — are questions only a computer can answer, and Markov never had one.

Markov did not foresee text generation. He proved a theorem against a theological misuse of mathematics, gave it a literary demonstration, and walked away. The chain of transmission to twenty-first-century language modeling is real, but it is deeply indirect — running through Liebmann, Kolmogorov, and Fréchet for the formalism, through Shannon’s 1948 §3 for the n-gram demonstration on English, and through a fifty-year delay before published English access to the Onegin paper itself. It would take the Cold War theory of communication explored in Chapter 9, and the eventual emergence of the statistical underground covered in Chapter 30, to turn a mathematical counterexample into the foundation of a new artificial intelligence.

One last clarification belongs at the close. The autoregressive language models of the early twenty-first century are statistical predictors of the next token in a sequence, and to that extent they share a family with Markov’s count-and-condition method on Pushkin. They are not, however, Markov chains in any technically meaningful sense. The Markov property — that the next state depends only on the current one, or in higher orders only on the last n — is precisely what a transformer’s full-context attention mechanism violates. The 1906 theorem says something specific about a class of stochastic process; modern language models are not in that class. The lineage from Markov to large language models is genuine, but it runs through the idea of statistical conditioning on prior text, not through the machinery of finite-state chains. That indirection is itself the chapter’s point.