Chapter 34: The Accidental Corpus

Цей контент ще не доступний вашою мовою.

Cast of characters

Name	Lifespan	Role
Tim Berners-Lee	—	Software engineer at CERN’s Data Documentation (DD) division; authored “Information Management: A Proposal” in March 1989; named the system “World Wide Web” while writing the code in 1990.
Henry Kučera	—	Brown University co-compiler (with W. Nelson Francis) of the Brown Corpus, the one-million-word benchmark that defined “large” in NLP for three decades.
W. Nelson Francis	—	Brown University co-compiler of the Brown Corpus; co-author of the 1967 analytic volume and the 1979 Manual revision.
Philip Resnik	—	University of Maryland linguist; built STRAND (1999), an early system for mining parallel bilingual text from the open Web.
Michele Banko / Eric Brill	—	Microsoft Research co-authors of the 2001 ACL paper that tested how more data changed disambiguation performance.
Kenneth Church	—	Johns Hopkins computational linguist; key figure in the 1990s empirical revival; authored “A Pendulum Swung Too Far” (2011) warning that data-driven success had marginalised rationalist methods.

Timeline (1961–2011)

timeline
    title The Accidental Corpus — from Brown to the Web
    1961 : Source publication year for all 500 Brown Corpus samples
    1967 : Kučera & Francis publish Computational Analysis of Present-Day American English
    1979 : Brown Corpus Manual revised and amplified by Francis & Kučera
    1980 : Berners-Lee writes Enquire at CERN — personal hypertext precursor
    1989 : March — Berners-Lee drafts Information Management: A Proposal at CERN : System internally called "Mesh"; AI never mentioned
    1990 : Berners-Lee implements the system; renames "Mesh" to "World Wide Web"
    1993 : SIGDAT founded; first Workshop on Very Large Corpora held
    1999 : Resnik publishes STRAND for mining bilingual text from the Web
    2001 : Banko & Brill ACL paper tests data scale in disambiguation
    2006 : Google releases Web 1T 5-Gram corpus via LDC — 1.024 trillion tokens
    2007 : Brants et al. describe distributed LM infrastructure at 2 trillion tokens; introduce Stupid Backoff
    2009 : Halevy, Norvig, and Pereira publish The Unreasonable Effectiveness of Data
    2011 : Church publishes A Pendulum Swung Too Far

Plain-words glossary

Brown Corpus — A 1,014,312-word collection of edited American English assembled at Brown University from texts published in 1961. Divided into 500 samples across fifteen genre categories, it was the standard NLP benchmark for three decades — “large” by the standards of the era, but the size of a few thick novels.
Empirical revival — The shift in computational linguistics, already underway by the late 1980s and institutionalised with SIGDAT in 1993, toward statistical and data-driven methods after a period dominated by symbolic AI and generative grammar. The Web amplified the swing; it did not start it.
STRAND — Philip Resnik’s 1999 pipeline for finding parallel bilingual text on the Web without human curators.
Confusion-set disambiguation — The task of choosing the correct word from a set of look-alikes, such as then vs. than.
Log-linear learning curve — A pattern where each ten-fold increase in training data produces a roughly equal gain in accuracy.
Stupid Backoff — A simplified n-gram smoothing technique designed to work cheaply at very large scale.
Web 1T 5-Gram corpus — A Google-released frequency table of English word sequences derived from public Web text.

The decisive shift in natural language processing during the 1990s and early 2000s was not, at its core, a revolution of algorithms. The statistical techniques that would eventually dominate the field—probabilistic models, n-grams, and empirical inference—were already well-understood in principle. Many had been part of the disciplinary toolkit since the 1950s, only to fall out of fashion during the ascendancy of symbolic AI and generative grammar. By the late 1980s, an empirical revival was already underway, but it faced a binding physical constraint: a scarcity of training data so severe that researchers measured progress in thousands of words rather than billions.

What changed was the emergence of a digitized, public, planet-scale corpus that no linguist had budgeted, indexed, or curated. In March 1989, Tim Berners-Lee wrote an internal proposal at CERN for a distributed hypertext system to manage information about particle accelerators and experiments. His document did not mention artificial intelligence, training data, or machine learning. Yet, over the subsequent fifteen years, the World Wide Web became an accidental laboratory. Researchers discovered that this unintended collection of human language reset the constant on the data side of the machine-learning trade-off so dramatically that simple models trained on web-scale data routinely outperformed sophisticated models trained on curated, hand-labeled datasets. This realization reorganised the field’s research priorities, shifting the centre of gravity from the refinement of learners to the acquisition of data.

The Data Famine

To understand the scale of the change, one must first measure the famine that preceded it. In the early 1990s, the “large corpus” against which empirical methods were benchmarked was the Brown Corpus. Compiled by Henry Kučera and W. Nelson Francis at Brown University in the early 1960s, it comprised exactly 1,014,312 words of edited American English. The compilers had meticulously selected 500 samples of approximately 2,000 words each, representing fifteen categories of prose published in the calendar year 1961. These categories were a snapshot of a vanished linguistic world: press reportage, editorials, and reviews; religion, skills and hobbies, and popular lore; and a broad range of fiction including detective stories, westerns, and romance.

For three decades, this million-word collection was the gold standard. It was a “Standard Corpus of Present-Day Edited American English,” and its manual, revised and amplified by Francis and Kučera in 1979, provided the metadata for thousands of linguistic experiments. As Kenneth Church and Robert Mercer noted in 1993, the Brown Corpus was “considered large” only ten years earlier. By the time of the empirical revival in the late 1980s, researchers were beginning to assemble larger collections, such as the Birmingham/COBUILD Corpus, but the ceiling remained low. Statistical natural language processing was viable in principle, but in practice, it was starved. Models of language were essentially being built on a foundation the size of a few thick novels.

The Brown Corpus was not small because its compilers lacked ambition. It was small because a million words was already an infrastructural achievement. The manual described a sample assembled from 1961 American publications, balanced across categories chosen at Brown after a 1963 conference and circulated first through the machinery of magnetic tape. Even its taxonomy carried the habits of print culture: reportage and editorials near the front; learned prose and belles-lettres in the middle; mysteries, westerns, romances, and humor at the end. It was a carefully engineered slice of edited language, not a stream. That care made it valuable, but it also made clear why statistical language work remained constrained. A corpus that could be inspected, documented, and distributed was still a thing one counted in reels, samples, and printed manuals.

The revival of empiricism during this period was not a choice made from a position of abundance; it was a pragmatic turn emerging from the shadow of the second AI Winter. Between 1987 and 1993, funding for grand, symbolic AI projects had contracted after a cycle of disappointed hype. In this environment, researchers turned toward linguistic data and corpus-based approaches precisely because they offered incremental, deliverable progress on specific tasks like speech recognition and machine translation. The founding of SIGDAT—the Special Interest Group on Linguistic Data—and the first Workshop on Very Large Corpora in 1993 marked the institutionalisation of this shift. Yet, the “very large” corpora of the early 1990s were still measured in the millions of words. This scale forced researchers to focus on the efficiency and sophistication of their algorithms; they had to squeeze every possible insight from the limited data available, as the data ceiling was the binding constraint on what their models could learn.

There is an important wrinkle in this scarcity story. Church and Mercer were already aware, in 1993, that much larger corpora — Birmingham/COBUILD’s Bank of English approached 200 million words around then — existed at some institutions. The problem was not that nobody anywhere had ever stored more than a million words. It was that the reusable, documented, task-ready material available to a typical experiment was still small enough that a million-word corpus could remain the reference point. Data existed unevenly, behind institutional walls, in formats not built for general use, and without the shared evaluation culture that would make it a common substrate. The famine was therefore practical as much as absolute: researchers needed corpora they could train on, compare against, and cite.

Church’s later description of the period matters because it keeps the chronology straight. The empirical turn did not wait for the Web to become enormous. Statistical methods had been present in computational linguistics since the 1950s, had lost prestige during the rationalist period associated with Chomsky and Minsky, and were being revived before web-scale crawls were available to researchers. Karen Spärck Jones made the same disciplinary memory explicit in 2005, describing statistical methods as introduced with computers, not always respected, and then revived in the 1990s. The Web, in other words, did not cause empiricism to reappear. It arrived while the pendulum was already moving, and it changed the force of the swing.

The CERN Memo

While linguists were scouring tape drives for million-word collections, the seed of a much larger corpus was being planted at CERN for entirely unrelated reasons. In March 1989, Tim Berners-Lee, then at CERN, drafted a proposal titled “Information Management: A Proposal.” The document was not a manifesto for a new era of machine intelligence; it was a pragmatic attempt to solve a specific institutional headache.

CERN was a place of high turnover, a hub for thousands of visiting physicists who arrived for short-term experiments and then moved on. “When two years is a typical length of stay,” Berners-Lee wrote, “information is constantly being lost.” The proposal aimed to address the loss of institutional knowledge about the laboratory’s accelerators and experiments. It envisioned a system where linked information could be stored and retrieved across a heterogeneous network of computers, including the VM/CMS mainframes, Macintoshes, and VAX/VMS systems then in use.

The document’s “CERN Requirements” list focused on remote access, non-centralization, and the ability to link to existing data. It was a world of constrained interfaces; Berners-Lee noted that ASCII text on standard 24x80 screens was “in the short term sufficient, and essential.” He even allowed for “Bells and Whistles” like graphics, but only as a secondary concern. Crucially, the proposal was silent on the topics of training data, corpora, or machine learning. Even the name was not yet fixed; Berners-Lee’s prefatory note in the subsequent HTMLized version explains that “the only name I had for it at this time was ‘Mesh’—I decided on ‘World Wide Web’ when writing the code in 1990.”

The technical problem he described was also local in a precise way. CERN’s information was scattered across VM/CMS mainframes, Macintoshes, VAX/VMS machines, and Unix systems. The laboratory needed a way to cross those boundaries without forcing everyone into a single database or a single machine. Berners-Lee’s earlier Enquire system, written in 1980 for his own work in the Proton Synchrotron control system, supplied a personal precedent: it had tracked people, groups, experiments, software modules, and hardware devices as linked objects. The 1989 proposal generalized that experience into a networked information layer. It asked for heterogeneity, remote access, and non-centralisation because CERN’s working life already had all three.

This matters for the chapter’s accident. The proposal’s modesty was real. It did not describe a crawler, an index, a ranking system, a training pipeline, or a store of language examples. It treated plain text on terminal screens as the essential first step and graphics as an optional extra with less immediate reach. Its imagined users were physicists and engineers trying not to lose track of the laboratory’s own knowledge. If the later Web became the largest language collection available to machine-learning researchers, that was not because the 1989 memo contained a hidden AI program. It was because a tool designed to make linking and publishing cheap made something else cheap as a side-effect: leaving human language in machine-readable form.

This was the birth of the accidental corpus. The Web was designed as a medium for human-to-human information management, a way to keep track of software modules and detector configurations, not as a fuel for AI. Its value as a dataset was an emergent property of its success as a publishing platform. By providing a low-friction way for millions of people to publish text, the Web unintentionally digitized a significant portion of human knowledge in a format that was—compared to the paper archives of the past—extraordinarily easy for machines to ingest. The CERN memo solved the problem of losing information in particle physics, but it also accidentally solved the problem of the data famine in linguistics.

Mining the Wild

By the late 1990s, the Web had grown large enough that researchers began to treat it as a research instrument rather than just a publishing medium. One of the first formal demonstrations that the “wild” Web could be mined for high-quality linguistic data came from Philip Resnik at the University of Maryland.

In 1999, Resnik published his work on STRAND (Structural Translation Recognition for Acquiring Natural Data). His motivation was the “Hansard problem”—the fact that almost all pre-existing parallel-text resources (text translated into two or more languages) were restricted to very specific domains, such as the proceedings of the Canadian Parliament. These resources were often subject to “licensing restrictions, usage fees, restricted domains or genres,” and were frequently filled with “dated text (such as 1980’s Canadian politics).”

The byline also matters. Resnik was not writing from a search company with proprietary logs. He was at the University of Maryland’s Department of Linguistics and Institute for Advanced Computer Studies, and the work acknowledged support from the Department of Defense, DARPA/ITO, and Sun Microsystems Laboratories. The Web was public enough that a university lab could turn a commercial search engine into an experimental front end. That openness was part of the accident: the same pages that people published for readers could be recruited into a research pipeline by someone outside the organizations that hosted them.

The opportunity was not infinite, but it was newly visible. Resnik cited the 1997 Babel survey’s estimate of roughly 63,000 primarily non-English Web servers across fourteen languages. That figure is tiny by later standards, but it was already enough to change the search problem. A bilingual corpus no longer had to begin with a parliament, a treaty organization, or a publisher willing to license text. It could begin with a public page that linked to its own translation, a government office that duplicated its forms in two languages, or an organization that maintained mirrored HTML pages for different audiences.

Resnik’s insight was that the Web contained large numbers of “parallel page pairs”—documents that were translations of each other, often linked by structural similarities in their HTML. He built a pipeline that used the AltaVista search engine to find candidate pages using date-range queries. This was a piece of late-1990s infrastructure, not a neutral abstraction: STRAND depended on the existence of a commercial search index and on AltaVista’s advanced-search behavior. To get around a limit of 1,000 hits per query in his scaled-up run, Resnik issued mutually exclusive date-range searches, breaking the hunt into smaller slices.

The core of STRAND was a structural-diff alignment. Resnik reduced both pages to linear sequences of HTML markup tokens, including markers for titles and chunks, and represented chunks partly by their length. The method deliberately ignored the meaning of the words at first. If two pages in different languages shared a nearly identical sequence of titles, paragraphs, lists, and chunk lengths, there was a high probability they were translations of each other. This was a clever inversion of the usual NLP burden. STRAND did not need to understand French or Spanish to suspect a translation pair; it used the page designer’s layout choices as evidence.

Language identification then made the filter stricter. Resnik trained character 5-gram language models on 100,000 characters per language from the European Corpus Initiative, a corpus available through the Linguistic Data Consortium. Candidate pages that looked structurally aligned also had to look linguistically plausible. The machinery was simple enough to describe, but the result was a new kind of acquisition pipeline: search engine first, HTML structure second, character statistics third, human evaluation last.

To verify the quality of this “wild” corpus, Resnik employed a rigorous evaluation methodology. He used two independent native-speaker judges to evaluate the results, calculating Cohen’s kappa for inter-judge agreement. The formal evaluation showed that the method was remarkably precise. Under a stricter language-identification criterion based on character n-grams, STRAND automatically extracted 2,491 English-French document pairs—roughly 1.5 million words per language—with estimated 100% precision and 64.1% recall. Resnik’s work proved that the Web could be turned into a parallel-translation corpus without the need for human librarians or expensive hand-curation. As he concluded, the work “places acquisition of parallel text from the Web on solid empirical footing.” The Web was no longer just a place to read; it was a place to extract.

The intermediate results keep the claim from becoming too smooth. In an English-Spanish experiment, STRAND reached 92.1% precision and 47.3% recall. In an initial English-French evaluation, the numbers were 79.5% precision and 70.3% recall. Tightening the language-identification criterion traded some recall for reliability, yielding the 100% precision and 64.1% recall result on the final English-French set. The lesson was not that the Web automatically produced clean data. The lesson was that noisy public pages could be filtered, measured, and turned into a usable parallel corpus when the structure of the medium itself supplied clues.

The Log-Linear Surprise

If Resnik proved that the data could be found, Michele Banko and Eric Brill at Microsoft Research proved that the scale of that data changed the fundamental rules of the field. In 2001, they challenged the prevailing research culture: instead of building more sophisticated learners for small datasets, what happens if you train existing learners on three orders of magnitude more data?

They chose a test problem called confusion-set disambiguation—correcting errors like {principle, principal}, {then, than}, or {to, two, too}. This was a “surface-apparent” problem where the correct answer was usually contained within the text itself, making it possible to generate labeled training data automatically. Banko and Brill assembled a training corpus of one billion words from news articles, scientific abstracts, government transcripts, literature, and “other varied forms of prose”—a collection roughly a thousand times larger than the largest training corpus previously used for the problem.

The choice of task was narrower than the later mythology of the paper sometimes suggests. Confusion sets such as {weather, whether} are errors a grammar checker might flag because the alternatives are already known. If a corpus contains the sentence with the correct spelling, that occurrence can become a training example without a human annotator deciding the label from scratch. That is why Banko and Brill could scale the experiment so aggressively. They were not solving every semantic ambiguity in English. They were selecting a class of disambiguation problem where ordinary text already carried the answer.

The experimental design was deliberately plain. They held out a one-million-word Wall Street Journal test set, excluded that source from the training corpus, and sampled sentences from the remaining sources probabilistically, weighted by source size, to avoid the artificial bias of simply concatenating one collection after another. The cutoff points were powers of ten: one million, ten million, one hundred million, and one billion words, with intermediate plotted points in the learning curves. The point was not to introduce a new learner. It was to ask, given a fixed amount of engineering time, whether performance came faster from algorithmic refinement, feature invention, or more examples.

They ran four standard learners—winnow, perceptron, naive Bayes, and a memory-based learner—on logarithmically spaced slices of this data. Winnow and perceptron came from Dan Roth’s implementations; naive Bayes supplied the probabilistic baseline; the memory-based learner was particularly simple, using only the word before and the word after as features. The broader feature set for the other learners included nearby words and collocations with words and parts of speech. The results, published as Figure 1 in their ACL paper, were what they described as a “log-linear” surprise.

The learning curves did not flatten out. Even at one billion words, “none of the learners tested is close to asymptoting in performance,” they wrote. Each tenfold increase in data produced a consistent, significant improvement in accuracy. The most striking observation was visible in the comparison: the “worst” learner trained on a billion words—reaching test accuracies of roughly 96%—outperformed the “best” learner trained on only a million words, which hovered around 85-88%. The gap in performance between algorithms narrowed as data grew, while the overall floor of performance rose.

Banko and Brill also noted a physical cost to this performance. Their second figure showed that the size of the learned representations grew roughly linearly with the training corpus size. At a billion words, the winnow and memory-based learners were holding up to a million representation entries. “For some applications, this is not necessarily a concern,” they noted, but for others where space was at a premium, the gains of a billion words might require new compression techniques.

They also tested an intuitively attractive alternative: combining learners by voting. On small training sets, voting can help because different learners make different errors. But above roughly a million words, Banko and Brill found little gain from voting, and on the largest training sets it actually hurt accuracy. Scale changed the usefulness of the ensemble trick. The simple act of adding more labeled examples did more work than aggregating the opinions of learners trained on smaller views of the problem.

Their conclusion was a call for the research community to reconsider its priorities. “A logical next step,” they wrote, “would be to direct efforts towards increasing the size of annotated training collections, while deemphasizing the focus on comparing different learning techniques trained only on small training corpora.” For this class of natural language problem, the bottleneck was not the cleverness of the algorithm, but the volume of the evidence.

However, they included a crucial warning: when they tested unsupervised methods on the same corpus, the accuracy eventually plateaued and then declined as more data was added. The contrast was sharp in their later tables. On {then, than}, the supervised billion-word result reached 0.9878 accuracy; on {among, between}, it reached 0.9021. The unsupervised variants did not simply inherit the supervised learning curves by being fed the same pile of text. More data was most effective when the task supplied reliable labels. It was not a magic substitute for annotation, and Banko and Brill’s paper was more careful than the slogans later attached to it.

That caveat does not weaken the paper’s importance; it explains it. The result was powerful because it identified a condition under which scale beat sophistication. When the labels were reliable, when the features were simple but relevant, and when the task was one where local textual evidence carried much of the answer, three extra orders of magnitude in training data changed the ranking of methods. Outside those conditions, the paper offered warning signs rather than a universal law.

The Doctrine Articulated

The realization that “more data” changed the game reached its institutional peak at Google in the mid-2000s. In September 2006, Google contributed the Web 1T 5-Gram corpus to the Linguistic Data Consortium (LDC). This was an institutional milestone: a trillion-token dataset, including 95 billion sentences, 13.6 million unigrams, and 1.18 billion 5-grams, made available to academic researchers. It was exactly one million times larger than the Brown Corpus that had defined the field forty years earlier.

The LDC release was not the whole Web and not a full-text archive. It was a frequency table: English word n-grams and observed counts from publicly accessible web pages, normalized into UTF-8, tokenized in a style similar to the Penn Treebank’s Wall Street Journal conventions, and distributed as compressed text. Its exact catalog figures were almost absurd beside the Brown Corpus: 1,024,908,267,229 tokens; 95,119,665,584 sentences; 314,843,401 bigrams; 977,069,902 trigrams; 1,313,818,354 fourgrams; and 1,176,470,663 fivegrams. The field had moved from a million-word corpus that could be described in a manual to a dataset whose published form was a 24-gigabyte compressed count table.

The release date, September 19, 2006, is a useful marker because it shows the new arrangement in institutional form. Google had the crawl and the processing capacity. The Linguistic Data Consortium supplied a channel through which researchers could obtain a standardized artifact. What moved across that channel was not the messy source Web but a derived representation of it: counts of word sequences up to length five. The public academic object was therefore already shaped by what was feasible to distribute, license, and compute with. It gave researchers access to scale, while also reminding them that scale increasingly arrived as a processed product of someone else’s infrastructure.

By 2007, Thorsten Brants and his colleagues at Google were documenting the infrastructure required to actually work at this scale. They described a distributed language-modeling system that handled up to two trillion tokens and 300 billion n-grams. To manage this volume, they introduced “Stupid Backoff,” a simplified smoothing technique that avoided the complex discounted-probability calculations of Kneser-Ney smoothing. In ordinary n-gram language modeling, a system estimates the probability of a word from the preceding words; when the exact long context is rare or absent, smoothing determines how to fall back to shorter contexts. Stupid Backoff made that fallback cheap. Instead of computing a fully normalized probability distribution, it backed off with a fixed penalty, a choice that fit distributed lookup better than elegant probability theory.

Crucially, Brants and his colleagues showed that as the data grew, the performance of this “stupid” technique approached the quality of the most sophisticated smoothing methods. The paper’s infrastructure comparison is as revealing as the name of the method. Their system served smoothed probabilities from distributed workers, avoiding approaches that required contacting every worker for every n-gram request. At hundreds of billions of n-grams, a model was no longer only a statistical object. It was also a storage layout, a lookup protocol, and a latency budget. The infrastructure was becoming as important as the model.

In 2009, Alon Halevy, Peter Norvig, and Fernando Pereira distilled these lessons into an article for IEEE Intelligent Systems titled “The Unreasonable Effectiveness of Data.” It was here that the new consensus was articulated as a doctrine: “simple models and a lot of data trump more elaborate models based on less data.” They argued that the first lesson of Web-scale learning was to “use available large-scale data rather than hoping for annotated data that isn’t available.”

Halevy, Norvig, and Pereira were careful to frame this as a pragmatic observation. They pointed to the Hays-Efros experiment in scene completion, which showed that while an algorithm might fail when trained on a corpus of thousands of photos, it could perform remarkably well when given millions. They also emphasized that rare events—the “long tail” of language—were collectively frequent on the Web. “Throwing away rare events is almost always a bad idea,” they wrote, because the Web consists of “individually rare but collectively frequent events.”

Yet, they also acknowledged the limits of the approach. Semantic interpretation, they noted, remained hard; data alone did not solve the problem of meaning. Their article contrasted the appeal of adding structure to the Web with the harder task of getting machines to interpret what the structure meant. They advised the field to “follow the data” and choose representations that could leverage the plentiful unlabeled text of the Web, but they warned that the “unreasonable effectiveness” had its boundaries. Not every problem could be solved by simply adding another zero to the token count.

The institutional setting sharpened the point. Resnik’s STRAND work had come from a university laboratory with DoD, DARPA, and Sun support; Banko and Brill worked at Microsoft Research; the trillion-token release and distributed language-model paper came from Google. This pattern does not require a grand theory to notice. Web-scale empiricism rewarded organizations that already had access to large crawls, storage systems, and distributed computation. The LDC release partly narrowed that gap by giving academic researchers a public route to Google’s n-gram counts, but it also made visible the new asymmetry. The question was no longer only who had a better learner. It was who could gather, clean, store, and query the evidence at all.

The Pendulum Caveat

By 2011, the empirical turn was so complete that some of the very researchers who had helped start it began to voice concerns about what had been lost. Kenneth Church, writing in Linguistic Issues in Language Technology, described the field as a “pendulum swung too far.”

Church, who had been a key figure in the 1990s revival, argued that the success of empiricism had marginalized the “rationalist” tradition of linguistics—the deep, theory-driven work of thinkers like Noam Chomsky, Marvin Minsky, and John Pierce. He proposed that the field had forgotten the “PCM” coalition’s arguments. These arguments had been controversial because they had partly triggered the funding winters of the 1970s and 80s, but they contained fundamental insights about the limits of statistical inference.

This was not a complaint from outside the empirical movement. Church placed himself inside the history he was criticizing. His pendulum sketch ran from 1950s empiricism, through 1970s rationalism, through the 1990s return of empiricism at places such as the IBM Speech Group and AT&T Bell Labs, and then toward a possible 2010s correction. The worry was not that data had failed. It was that data had succeeded so thoroughly that the field was tempted to forget why rationalist objections had mattered in the first place.

The accidental corpus had reset the constant on the data side of the equation, but it had not resolved the underlying philosophical tensions of the field. It had made simple models extraordinarily powerful, but it had also moved the centre of gravity toward industry players who possessed the massive compute and infrastructure—like Google’s distributed-LM clusters—required to process trillions of tokens.

The era of the accidental corpus ended not with a final victory for empiricism, but with a recognition of its costs. The field had learned to follow the data, and in doing so, it had discovered a world where “more” was indeed “different.” The bottleneck was no longer the million words of the Brown Corpus; it was now the infrastructure required to harness the trillions of words that humanity had, quite by accident, left behind on the Web. The next challenge would not be finding the data, but building the machinery to make sense of it—a challenge that would soon lead to the rise of MapReduce and the massive data-centers of the late 2000s.