Skip to content

Chapter 35: Indexing the Mind

Cast of characters
NameLifespanRole
Lawrence (“Larry”) Pageb. 1973Stanford CS PhD candidate; first author of the PageRank paper (1999) and co-author of the Anatomy paper (1998); acknowledged for hardware-architecture contributions in Barroso 2003.
Sergey Brinb. 1973Stanford CS PhD candidate, NSF Graduate Fellow; first author of the Anatomy paper (1998) and co-author of the PageRank paper; brought data-mining and large-scale text-collection research background to the project.
Luiz André BarrosoGoogle Systems Lab engineer; first author of Web Search for a Planet (IEEE Micro, March-April 2003); the paper’s primary voice for the commodity-cluster architecture.
Jeffrey DeanGoogle distinguished engineer; second author of Barroso-Dean-Hölzle 2003; worked on crawling, indexing, and query-serving systems at scale.
Urs HölzleGoogle Fellow; third author of Barroso-Dean-Hölzle 2003; vice president of engineering who managed Google’s search-engine operation during its first years.
Sanjay GhemawatGoogle engineer; first author of The Google File System (SOSP 2003), which independently confirmed the design-for-failure thesis.
Timeline (1994–2003)
timeline
title From the Web Worm to the Warehouse Cluster
1994 : World Wide Web Worm indexes 110,000 pages : averaging ~1,500 queries/day
1995 : Brin receives MS from Stanford : Page receives BSE from University of Michigan
1997 : AltaVista claims ~20 million queries/day : Anatomy paper notes "only one of the top four search engines finds itself"
1998 : PageRank paper dated January 29 on cover : Anatomy paper presented at WWW7, Brisbane, April : Google prototype indexes 24 million pages on Solaris/Linux at google.stanford.edu
1999 : PageRank formally published as Stanford InfoLab Tech Report 1999-66
2002 : RackSaver.com rack of 88 dual-CPU Xeon servers priced around 278,000 dollars vs. 758,000 dollars for comparable single high-end server
2003 : Barroso, Dean, Hölzle publish Web Search for a Planet in IEEE Micro vol. 23 no. 2 : GFS paper corroborates design-for-failure thesis at SOSP
Plain-words glossary
  • PageRank — An algorithm that scores a web page’s importance by examining which other pages link to it, and weighting each link by the importance of the linking page. Importance is computed globally across all pages simultaneously, not page by page.
  • Eigenvector (principal) — In a matrix equation, the vector that points in the same direction after the matrix is applied.
  • Random surfer model — An intuitive interpretation of PageRank based on a user moving through links and sometimes jumping elsewhere.
  • Damping factor — The parameter that balances following links against jumping to another page in the random-surfer model.
  • Commodity cluster — A computing infrastructure built from many inexpensive, off-the-shelf machines.
  • Warehouse-scale computing — The engineering discipline that treats a datacenter as one coordinated computing system.

By the middle of the 1990s, the World Wide Web had transitioned from a curiosity of the academic world into a sprawling, uncurated frontier of human expression. This transition created an immediate and existential crisis for the technologies designed to navigate it. In 1994, the World Wide Web Worm (WWWW)—one of the first major attempts to index the nascent medium—cataloged a total of 110,000 web pages and web-accessible documents. At the time, this seemed like a vast dataset, receiving an average of about 1,500 queries per day. But it was merely the baseline for an explosion of content and traffic that would soon outpace every existing search technology.

By November 1997, the top search engines of the era claimed indices that ranged from two million documents for WebCrawler to a hundred million documents for the most aggressive crawlers, according to industry reports from Search Engine Watch cited by Brin and Page. The volume of query traffic had risen just as dramatically; AltaVista, then a leading commercial search engine, reported handling roughly 20 million queries every day. However, this quantitative growth was not met with a qualitative improvement in relevance. As the web grew larger, it became paradoxically more difficult to find anything of value. The PageRank paper would describe the web of that moment as already more than 150 million pages, with a doubling time of less than a year, and as a link graph of roughly 150 million nodes connected by about 1.7 billion edges. Search was no longer a problem one could solve by making a catalog a little larger.

The primary failure of the 1990s web was structural. Early search engines operated on a model derived from traditional information retrieval: keyword matching. They crawled the web, recorded the words on each page, and used the frequency and location of those words as major signals. That could be useful inside a smaller, curated, or at least less adversarial corpus. The web was different. A page could mention a term often without being a good answer to a user’s question; a page could be popular in one neighborhood of the web and invisible in another; and the act of publishing a page was open to anyone, including anyone trying to game the retrieval system. Brin and Page’s 1998 Anatomy paper put the problem plainly: automated search engines that relied on keyword matching usually returned too many low-quality matches.

The crisis was so acute that it threatened the utility of the web itself. The web had become a massive dataset that could not be trusted. The “junk results” and “low-quality matches” were not merely a technical annoyance; they were an intellectual limit. According to the Anatomy paper published by Larry Page and Sergey Brin in 1998, the failure was diagnostic: as of November 1997, “only one of the top four commercial search engines finds itself” when queried with its own name. To find a relevant needle in a hundred-million-document haystack, search had to be redefined. It was no longer a text-retrieval problem; it was a problem of authority and trust.

At Stanford University, two PhD candidates in the Computer Science Department approached this problem not as a commercial opportunity, but as a research challenge. Lawrence Page and Sergey Brin brought distinct but complementary backgrounds to the project. Brin, who had received his BS from the University of Maryland in 1993 and his MS from Stanford in 1995, was an NSF Graduate Fellow whose research interests included search engines, information extraction from unstructured sources, and data mining of large text collections. Page, a graduate of the University of Michigan with a 1995 BSE in Computer Engineering, was interested in the structural properties of the web’s link graph and the human-computer interaction challenges of massive-scale search.

Later accounts of the Stanford years describe borrowed equipment and lab space in the Gates Computer Science Building; the primary published trace is narrower but still revealing. Brin and Page began a research project remembered as BackRub, a name that fit its function: it was designed to crawl the web and track “backlinks”—the incoming links from one page to another. The existence of this predecessor is confirmed in the published record by the URL google.stanford.edu/~backrub/, which appeared in their early research references.

The core insight that Page and Brin brought to BackRub was elegant and rooted in the very environment where they worked: the academic citation. In the world of scientific research, the importance of a paper is not measured by how many times a word like “physics” appears in its abstract, but by how many other researchers cite it. Page and Brin realized that a hyperlink on the World Wide Web was effectively a digital citation. Every link from one page to another was a “vote” of confidence, a signal of human interest and attention.

In the PageRank paper, dated January 29, 1998 on its cover and formally published in 1999 as Stanford InfoLab Tech Report 1999-66, Page, Brin, Rajeev Motwani, and Terry Winograd explicitly invoked this analogy. They wrote that standard citation-analysis techniques naturally suggested themselves for the web’s hypertextual citation structure and that each link could be thought of as an academic citation. The contribution was not the invention of citation analysis. The paper begins by acknowledging a large literature on the subject, including works by Eugene Garfield and William Goffman, as well as the contemporaneous “Hubs and Authorities” research of Jon Kleinberg.

However, the PageRank authors also identified why the web’s link graph was a far more challenging environment than a database of peer-reviewed journals. In academic publishing, there is a baseline of quality control and a roughly similar assumption of the “paper” as a unit of thought. On the web, there was no such control. Pages varied wildly in size, durability, purpose, and authorial intent. Some were scholarly documents; some were home pages; some were lists of links; some were commercial pages; many were fragments. The paper’s modesty here matters: it does not present the web as a cleaner citation database, but as a harder and less regulated one.

Their contribution was to generalize citation analysis for this massive, uncurated graph. They recognized that simple backlink counting—simply totaling the number of links to a page—was inadequate. The PageRank paper says this directly: simple backlink counts have problems on the web that ordinary academic citation databases do not. In the academic world, a citation from a landmark paper in Nature or Science carries more weight than a citation in an obscure regional newsletter. On the web, the same logic had to apply: a link from a high-importance, highly cited page should weigh more than a link from a page that no one else points to.

This realization made the definition of importance recursive: a page is important if important pages point to it. To solve this, the authors moved away from simple counting and toward a mathematical model that could calculate the “objective” importance of every node on the web simultaneously. The Anatomy paper called PageRank “an approximation of citation importance on the web”; the PageRank paper framed it as a method for mechanically measuring the human interest and attention directed toward pages. That phrasing is carefully chosen. PageRank did not understand meaning the way a human reader understands meaning. It converted a distributed pattern of human linking behavior into a number that could be computed over the whole graph.

To move from an analogy of academic citations to a working algorithm, Page and Brin formalized the web’s link structure as a massive probability distribution. They introduced the “Random Surfer Model” as an intuitive way to understand how PageRank actually works.

The model imagines a user who navigates the web by clicking links at random. They start on a page, pick one of the available outlinks, and move to the next page. In a perfectly connected web, the surfer would eventually reach every page. The PageRank of a specific page is, mathematically, the probability that this random surfer will be on that page at any given moment. Pages with many incoming links from other highly ranked pages will naturally have a higher standing probability of being visited.

However, the real web is not a perfectly connected graph. A surfer might get trapped in a small loop of pages, clicking forever inside a closed neighborhood rather than sampling the whole web. To account for this, the PageRank paper added a personalization vector, or random-jump source. The paper’s own intuition was deliberately ordinary: a real web surfer eventually gets bored and, instead of clicking another link, jumps to some other page.

In their experiments, they used a jump probability of 0.15. This meant that at each step, the surfer had an 85% chance of following a link and a 15% chance of starting over on a random page. This 0.85 “damping factor” ensured the algorithm converged to a stable solution on the corpora the authors tested.

The mathematical formulation of this idea is both dense and powerful. If A is the matrix representing the link-transition probabilities of the entire web, and E is a vector representing the random-jump destinations, then the vector of PageRanks, R', is an eigenvector of the matrix (A + E·1), up to a normalizing constant. In historical terms, this is the chapter’s crucial turn: a question that had looked like editorial judgment became a linear-algebra calculation on a graph. Because the rank of a page depends on the ranks of the pages pointing to it, the solution cannot be calculated in a single pass. It requires a fixed-point computation. The PageRank paper gives the iterative algorithm as R_{i+1} ← AR_i + dE, repeated until the difference between successive rank vectors falls below a small threshold. The algorithm’s power lay partly in this operational simplicity. It did not require a human editor to inspect the web; it repeatedly redistributed rank through the link graph until the distribution stopped changing meaningfully.

That recursive structure also explains why PageRank felt different from a keyword score. A keyword score could be computed locally by looking at a page and counting features. PageRank had to be computed globally, because the value of a page depended on values elsewhere in the graph, and those values depended on still other pages. The algorithm did not make every query answer correct, and the Anatomy paper still combined PageRank with text matching and other signals. But it gave the search engine a durable ordering principle: before reading the words as an answer to a query, ask how the web itself had routed attention.

The scale of their early experiments was unusually large for a university research project. Working with the Stanford WebBase crawl, they tested PageRank on a repository of approximately 24 million web pages. The paper estimated the broader crawlable web as roughly 150 million nodes and 1.7 billion edges, with about 11 links on each page. The same experimental world appears from another angle in the Anatomy paper: a crawling and indexing system processing about 50 pages per second, a 24-million-page corpus, and a search engine already large enough that the bottlenecks were not abstractions but disks, files, and machines.

One of the more telling experiments involved personalizing the jump vector. If the vector was uniform, the bored surfer could restart anywhere, producing a general-purpose ranking. But the vector did not have to be uniform. By setting the random-jump probability so that the surfer returned to a specific page or set of pages, the authors could create a version of the web as seen from a particular viewpoint. In modern language, it was a way to bias the global graph toward a user’s interests without throwing away the global graph.

In one such experiment, they used the personal home page of John McCarthy as a personalization seed. McCarthy, a professor emeritus at Stanford, had named the field of “Artificial Intelligence” at the 1956 Dartmouth workshop. In the 1998 experiment, setting the personalization vector to his home page resulted in a PageRank distribution where his site sat at the 100th percentile from his own perspective. It was a quiet intersection of two eras of computer science: the man who had defined the field of AI in the 1950s was now a data point in the algorithm that would build the infrastructure for its next century.

The system that Page and Brin described in their April 1998 paper, The Anatomy of a Large-Scale Hypertextual Web Search Engine, was a sophisticated academic prototype designed to prove that PageRank could scale. Hosted at google.stanford.edu, the system was a proof-of-concept for an architecture that would eventually move beyond the university.

The 1998 architecture was a pipeline of specialized components implemented primarily in C and C++ for efficiency, running on a mix of Solaris and Linux machines. The process began with a “URLserver” that handed out lists of web addresses to distributed “crawlers.” These crawlers, which could process about 50 pages per second, downloaded the content and sent it to a “storeserver.” The storeserver compressed the pages and wrote them into a repository. At the other end of the pipeline, the searcher was still a research system: fast enough to demonstrate the architecture, but not yet the always-on global utility that Google would later become.

From there, an “indexer” parsed the compressed pages, extracted every word and its location (called a “hit”), and built a “lexicon”—a list of every word found on the web, fixed at 14 million words for the initial experiments. A “sorter” then took the unsorted hits from the indexer and reorganized them by word ID to create an “inverted index.” In the reported run, four sorter machines took about 24 hours to do their part, while the indexer processed roughly 54 pages per second. This inverted index was the heart of the search engine, allowing it to provide a near-instant list of pages for any given query.

The storage requirements for this prototype were immense for the era. Their table of system statistics records 24 million pages fetched, 76.5 million URLs seen, and 1.6 million 404 errors. The fetched pages totaled 147.8 gigabytes of raw data, compressed down to 53.5 gigabytes. Including the lexicon, the inverted index, and the other system data outside the repository, the total footprint reached 108.7 gigabytes. Operating at this scale forced the students to invent new software abstractions to overcome the limitations of late-1990s operating systems. They built a “BigFiles” system: virtual files spanning multiple file systems, addressable by 64-bit integers, with file-descriptor allocation handled inside Google’s code because the operating systems did not provide enough descriptors for the indexing workload.

They also had to optimize every operation to minimize disk seeks. In 1998, a disk seek took approximately 10 milliseconds—an eternity for a system trying to process millions of documents. By carefully organizing the index and keeping the lexicon in memory, they were able to serve most queries with a latency of between one and ten seconds. However, this prototype was still fundamentally a research tool. It ran on a relatively small number of machines and was frequently bottlenecked by the network file system (NFS) they used for storage. The crawler downloaded the final 11 million pages in 63 hours, averaging about 48.5 pages per second; the full crawl, including errors, took roughly nine days.

The transition from this Stanford prototype to a global utility would require more than just refined math. It would require a fundamental rethink of the relationship between software and hardware.

By early 2003, the Google architecture had undergone a transformation that was orders of magnitude beyond the 1998 prototype. As documented in the paper Web Search for a Planet: The Google Cluster Architecture, the system had evolved into a production service running on more than 15,000 commodity-class PCs across geographically distributed clusters.

This shift was driven by a brutal economic and physical reality. As the web grew into the billions of pages, the cost of scaling up traditional enterprise-grade hardware became unsustainable. In the late 1990s and early 2000s, high-end computing usually meant buying expensive multiprocessor servers designed around hardware reliability. These machines were built with features such as redundant power supplies, RAID storage, high-quality components, and SCSI disks. Google moved in the opposite direction: inexpensive x86 machines, IDE disks, and software written on the assumption that some part of the system would always be broken.

Google’s engineers, including Urs Hölzle, Luiz André Barroso, Jeffrey Dean, Sanjay Ghemawat, and many others, treated web search as a throughput problem spread across many independent machines. Barroso was working in Google’s Systems Lab on web-search efficiency and hardware architecture; Dean’s author bio described work on crawling, indexing, query-serving systems, scalability, and relevance; Hölzle’s bio placed him at the center of search-engine development and operation during Google’s first years. The 2003 paper itself did not allocate credit to a single heroic engineer. Its acknowledgments named Gerald Aigner, Ross Biro, Bogdan Cocosel, and Larry Page as contributors to Google’s hardware architecture. The system was a collective engineering object, not merely a founder anecdote rendered in metal.

They codified this approach into explicit design principles. First, they prioritized software-level reliability over hardware fault tolerance. Instead of buying a server that would never crash, they built a software stack that expected machines to fail. Second, they used replication for both throughput and fault tolerance. By keeping “dozens of copies of the Web” across their clusters, they could serve more queries and survive the loss of hardware without treating every individual machine as precious. Third, they prioritized price/performance over peak performance. They would rather have many cheap, slightly slower CPUs than one expensive, top-of-the-line processor.

The cost comparison provided in the 2003 paper was a definitive argument for this new model. In late 2002, a rack from RackSaver.com containing 88 dual-CPU 2GHz Intel Xeon servers, each with 2GB of RAM and an 80GB disk, cost about 278,000 dollars. A comparable high-end x86 server with eight CPUs, 64 gigabytes of RAM, and 8 terabytes of disk cost approximately 758,000 dollars. This meant the high-end alternative cost nearly three times as much while offering roughly one twenty-second the CPU count. For an application that could be divided across shards and replicas, the arithmetic was devastating.

This choice forced the creation of a new kind of engineering discipline: warehouse-scale computing. Because the clusters were built from cheap Celeron and Pentium III PCs, hardware failure was no longer an unlikely accident; it was a statistical certainty. The contemporaneous Google File System paper stated the premise bluntly: component failures were normal rather than exceptional in a file system built from inexpensive commodity hardware. Its largest deployed clusters already held hundreds of terabytes across thousands of disks on more than a thousand machines. The search cluster described by Barroso, Dean, and Hölzle pushed the same design assumption into query serving at still larger scale.

The hardware was heterogeneous and short-lived. CPU generations from single-processor 533 MHz Intel Celeron through dual 1.4 GHz Pentium III were in active service simultaneously. Individual servers carried one or more 80GB IDE drives. Inside a rack, machines were connected by 100 Mbps Ethernet; between racks, one or two gigabit uplinks connected to a core gigabit switch. A rack might hold 40 to 80 x86 servers. Servers were replaced every two or three years because the performance disparity between old and new machines made load balancing increasingly difficult. This was not the clean symmetry of a supercomputer brochure. It was a constantly aging population of cheap machines organized by software.

The storage choice was part of the same discipline. SCSI disks were faster and more conventional in high-end servers, but the 2003 paper noted that they cost two or three times as much as equal-capacity IDE drives. RAID could mask disk failures, but RAID also added hardware cost to every box. Google instead accepted the cheap disk and moved the protection upward into replication. The same duplicate copies that increased throughput also made failure less catastrophic. In the paper’s phrasing, fault tolerance almost came for free because the service already needed many copies to handle query volume.

To manage this chaos, Google’s systems engineers developed a software stack that could route queries around dead machines in real time. The system was geographically distributed, and DNS-based load balancing directed users toward one of several clusters. Inside a cluster, a hardware load balancer fronted services including the Google Web Server, index servers, document servers, a spell checker, and an ad server. The index was partitioned into “index shards”—random subsets of the entire document collection—and each shard was served by a pool of machines. If one machine failed, a load balancer shifted traffic to another member of the pool. The user at the other end of the search query did not need to know which disk, rack, or cluster had answered.

The 2003 production architecture also treated energy efficiency as a first-class design constraint. A dual-1.4GHz Pentium III server drew roughly 90 watts of DC power under load: about 55 watts for the CPUs, about 10 watts for the disk, and about 25 watts for memory and motherboard components. With power-supply losses included, this translated to about 120 watts of AC power per server. A rack containing up to 80 servers ran at approximately 10 kilowatts, with a power density of 400 to 700 watts per square foot. This density was already pushing the upper limits of what commercial datacenter cooling systems could handle in 2003. Watts per query, not just CPU performance, became the metric that mattered.

The commodity-cluster pattern that Google crystallized between 1998 and 2003 was not the first commodity cluster. Beowulf clusters, Inktomi, and academic Linux clusters had already shown that cheap machines could be aggregated for serious computing. Google’s 2003 paper mattered because it documented a particular combination at production scale: commodity PCs, software-level fault tolerance, aggressive replication, a naturally parallel application, and an operational discipline built around cost and energy. That combination was eventually generalized into a new academic and industrial vocabulary. In 2009, Luiz André Barroso and Urs Hölzle published The Datacenter as a Computer, formalizing the idea that the entire warehouse, not the individual server, was the basic unit of computing. By 2011, John Hennessy and David Patterson—the authors of the canonical computer architecture textbook—had added a chapter on “Warehouse-Scale Computers,” codifying the pattern as part of modern computer architecture.

The acknowledgments in the 2003 paper also offer a necessary corrective to the common founder myth. While Larry Page and Sergey Brin are often credited with the mathematical breakthrough of PageRank, the 2003 paper explicitly thanks Larry Page for contributions to Google’s hardware architecture, alongside Aigner, Biro, and Cocosel. It was a reminder that the algorithm and the infrastructure were never truly separate. The math required the cluster, and the cluster was built to serve the math.

This transition—from PageRank as an academic citation experiment to the warehouse-scale computer as a documented engineering discipline—represented more than just a better way to find websites. It showed how a planetary-scale service could treat failure as a constant and scale as a requirement. A decade later, this same architecture—fault-tolerant, massively parallel, and composed of thousands of interconnected processing units—would become part of the substrate on which massive neural networks began to train. The story of those networks belongs to the chapters ahead, but their foundation was laid here, in the shift from the text-retrieval box to the planetary index.