Chapter 40: Data Becomes Infrastructure

Цей контент ще не доступний вашою мовою.

Cast of characters

Name	Lifespan	Role
Fei-Fei Li (Li Fei-Fei)	—	Co-author of the 2009 ImageNet paper; per ACM’s 2017 TechTalk page, inventor of ImageNet and the ImageNet Challenge; associate professor at Stanford and director of SAIL at the time of the talk.
Jia Deng	—	First author of the 2009 ImageNet CVPR paper; named ImageNet and ILSVRC contributor in Li’s 2017 slides.
Wei Dong, Richard Socher, Li-Jia Li, Kai Li	—	Co-authors of the 2009 ImageNet paper; Kai Li appears in Li’s 2017 slides as a Princeton collaborator.
George A. Miller	—	Led the Princeton group that developed WordNet from 1985; co-author of the WordNet introduction used as ImageNet’s semantic backbone.
Christiane Fellbaum	—	WordNet co-author; identified in Li’s 2017 slides as a Princeton senior research scholar and Global WordNet Consortium president.
Olga Russakovsky	—	Lead author of the 2015 ILSVRC paper; key contributor to the annual benchmark that turned ImageNet into a competition fixture from 2010 onward.

Timeline (1985–2014)

timeline
    title ImageNet: From WordNet to Benchmark Infrastructure
    1985 : Princeton group begins developing WordNet as an online lexical database
    1990 : WordNet paper collection reports synsets and semantic relations
    2005 : Amazon announces Mechanical Turk (November 2) as a web-services API for distributable microtasks
    2007 : ImageNet crowdsourced-verification phase begins — 49k workers across 167 countries
    2009 : Deng, Dong, Socher, Li-Jia Li, Kai Li, and Li Fei-Fei publish ImageNet at CVPR : 5,247 synsets and 3.2 million verified images reported
    2010 : ILSVRC begins as an annual large-scale visual recognition benchmark built from ImageNet
    2014 : ImageNet reaches 21,841 synsets and 14,197,122 annotated images (August)

Plain-words glossary

WordNet — An online lexical database developed at Princeton from 1985. Instead of listing words alphabetically, it groups English nouns, verbs, and adjectives into synonym sets linked by semantic relations such as hypernym (broader category) and hyponym (more specific instance).
Synset — A synonym set in WordNet: a cluster of words that all name the same underlying concept. ImageNet’s unit of organization was the synset, not a bare word or arbitrary tag.
Hypernym / hyponym — WordNet terms for conceptual hierarchy: hypernyms are broader categories, while hyponyms are more specific instances.
Amazon Mechanical Turk (AMT) — A crowdsourcing marketplace for routing small tasks to distributed workers. ImageNet used it to verify whether each candidate image contained the target synset’s object, not to generate labels from scratch.
ILSVRC (ImageNet Large Scale Visual Recognition Challenge) — An annual benchmark competition that began in 2010, built on the ImageNet database. It provided a common training set, held-out evaluation images, and a workshop; it scaled from PASCAL VOC’s 20 classes and ~20,000 images to 1,000 classes and over 1.4 million images.
Query expansion — A search-strategy technique that widens or clarifies a search by adding related terms instead of relying on one literal keyword.
Precision (ImageNet sense) — The fraction of collected images correctly showing the target concept after verification; the ImageNet paper reported about 99.7 percent after verification versus about 10 percent average raw search-result accuracy.

The era of computer vision preceding ImageNet was characterized by a distinct and disciplined approach to progress. Researchers operated within the formal confines of established datasets, such as the PASCAL Visual Object Classes (VOC) challenge, where the community tested and compared their algorithms against a carefully curated, but ultimately limited, number of images and categories. The field had reached a scale boundary. Progress was frequently measured in painstaking, incremental improvements to hand-engineered feature stacks and complex mathematical models. The prevailing wisdom held that the key to unlocking robust visual recognition lay primarily in refining these algorithms—in tuning and tweaking the models to extract ever more subtle patterns from the available, modestly sized datasets.

That earlier benchmark culture was not a failure. It had taught computer vision to value common tasks, shared test sets, and comparable numbers rather than isolated demonstrations. The limit that ImageNet exposed was a limit of scale and semantic breadth. A benchmark with a few dozen categories could discipline algorithm design, but it could not force a recognizer to confront the long tail of ordinary visual life: dog breeds, plants, tools, vehicles, instruments, foods, and household objects whose names were neither glamorous nor rare in the world. The question became whether object recognition could be organized around a much larger inventory of things without dissolving into an unmanageable pile of labels.

According to later public accounts, Fei-Fei Li began to formulate a different perspective on the field’s stagnation. She hypothesized that the primary bottleneck stifling computer vision was no longer just the elegance or complexity of the algorithms, but rather the sheer volume and diversity of the data used to train and evaluate them. In a 2017 retrospective TechTalk delivered at the Association for Computing Machinery (ACM), Li characterized this transition as a fundamental pivot in the philosophy of visual recognition. To break through the performance ceiling, she argued, the field needed to shift its emphasis away from an exclusive focus on modeling and direct its attention squarely toward data—and specifically, toward what she described as “lots of data.”

This conceptual pivot demanded an unprecedented scale of data collection, a task that quickly proved incompatible with traditional laboratory methods. The ambition of capturing the vast complexity of the visual world could not be achieved through ad hoc scraping or small-team annotation. In her 2017 slides, Li detailed the daunting arithmetic that scuttled early, more conventional attempts to launch such a massive dataset. The initial ideas relied on standard academic labor models. One early strategy involved a psychophysics approach, relying on the labor of undergraduate students to manually sort and verify images. The retrospective calculation presented in her talk was stark and revealing: attempting to gather and verify images for 40,000 distinct visual categories, targeting 10,000 candidate images per category, and requiring two to five human verifiers to confirm each image, would take approximately 19 years to complete.

The point of that calculation was not merely that the proposed dataset was large. It showed that ordinary academic labor units broke under the multiplication. Forty thousand concepts multiplied by 10,000 candidate images already produced 400 million image-category candidates before quality control. Requiring even a few human checks per image turned that into more than a billion individual judgments. The problem was no longer only “Where do we get pictures?” It was “How do we route hundreds of millions of small visual judgments through people, keep the task simple enough to distribute, and still recover a dataset that scientists could trust?”

This impossible timeline demonstrated a profound reality about the next era of artificial intelligence: before data could change modeling, data collection itself had to be solved as a massive logistical and engineering challenge. The scale required demanded a transition from laboratory experiments to a true engineering pipeline. Data was no longer just the byproduct of research; it had to be built as foundational infrastructure.

To build a dataset of the magnitude Li envisioned, the project first needed a rigorous, underlying organizational structure. A simple alphabetical list of tags or an ad hoc hierarchy of desktop folders would rapidly collapse under the semantic weight of millions of images. The researchers required a system capable of handling the profound ambiguity of human language—a structure that could systematically distinguish between a “bank” as a financial institution and a “bank” as the side of a river. The solution did not originate from within the computer vision community, but from linguistics and cognitive psychology, specifically a pioneering project born at Princeton University decades earlier.

Beginning in 1985, a dedicated group of psychologists and linguists at Princeton, led by George A. Miller, embarked on the development of WordNet. As detailed in a comprehensive 1990 collection of papers reporting on the project’s state, WordNet was conceived as a vast, online lexical reference system. Its fundamental design was deeply inspired by prevailing psycholinguistic theories concerning how human lexical memory is organized and accessed. Unlike a standard dictionary that simply lists words alphabetically, WordNet organized English nouns, verbs, and adjectives into distinct synonym sets, which the creators termed “synsets.” Each synset represented a single, underlying lexical concept, effectively disambiguating words with multiple meanings by grouping them with their conceptual equivalents.

This was a subtle but crucial distinction. A string of letters could not be the basic unit for ImageNet, because a string of letters often names several different things. The word “bank,” for example, can mean a financial institution or the side of a river; WordNet’s synset design treated those as separate concepts rather than as a single word with a muddled list of meanings. That gave ImageNet a way to ask for images of a concept, not simply images retrieved by a spelling. In visual recognition, this difference mattered. The label attached to an image had to mean one thing clearly enough that a human verifier and an algorithm could be judged against the same target.

Crucially for its future application in computer vision, WordNet did not isolate these synsets; it linked them together through a dense, logical network of semantic relations. These relations functioned as explicit, machine-readable pointers between concepts. For nouns, WordNet defined a meticulously detailed hierarchy of hypernyms (broader, overarching categories) and hyponyms (more specific instances). For example, a specific breed of dog would be linked as a hyponym to the broader synset for “dog,” which in turn pointed upward through broader synsets such as “canine” and “animal.”

The hierarchy also gave ImageNet more than a flat classification list. A flat list can say that “whippet,” “dog,” and “animal” are three labels; a hierarchy can say how those labels imply one another. That implication was part of the scientific promise. If a system recognized a whippet, the ontology made it possible to treat that success as related to recognition of dogs and animals, rather than as an isolated class with no semantic neighbors. In practice, the early ImageNet paper still had to build concrete image collections one synset at a time. But the design meant that those collections sat inside a wider map of concepts instead of in a spreadsheet whose rows happened to have names.

This strict, relational architecture meant that WordNet was fundamentally an ontology—a formal map of semantic relationships. The ImageNet project adopted this structure as the central nervous system of its visual database. The defining 2009 CVPR paper, co-authored by Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, explicitly introduced ImageNet not merely as a collection of pictures, but as a large-scale ontology of images constructed directly upon the backbone of the WordNet structure.

The ambition laid out by the team was staggering in its scope. They aimed to populate the vast majority of WordNet’s roughly 80,000 noun synsets. For each of these distinct lexical concepts, their goal was to provide an average of 500 to 1,000 clean, full-resolution images. In essence, the project sought to take WordNet’s abstract semantic skeleton and flesh it out with visual reality, replacing textual definitions with millions of pixels while preserving the rigid, logical hierarchy that gave the dataset its structure. ImageNet would become a navigable, computable map of the visual world, allowing algorithms to understand that an image of a whippet was also, implicitly, an image of a dog and an animal.

That ambition also marked a boundary between ImageNet and ordinary dataset collection. The team did not begin by asking which labels were fashionable in computer vision or easy for a small group to annotate. They began with an existing lexical inventory and asked how much of it could be made visible. Many synsets were not equally useful for image recognition, and the 2009 system reported progress through selected subtrees rather than completion of the whole noun hierarchy. Still, the unit of work was inherited from WordNet: one concept, many candidate images, repeated across thousands of concepts. The infrastructure was semantic before it was photographic.

With the WordNet ontology providing the structural blueprint, the ImageNet team confronted the daunting physical task of sourcing millions of images to populate it. The internet represented an unimaginably vast reservoir of visual data, but it was fundamentally unstructured, unlabeled, and notoriously noisy. Gathering this material required an automated pipeline that was massively scalable and highly resilient to the chaotic nature of search engine algorithms.

The 2009 paper outlined a long-term goal of collecting around 50 million images, a scale that necessitated programmatic, continuous harvesting. To build their initial candidate pools, the researchers turned to the web’s existing image search engines. The procedural pipeline was initiated by automatically querying these search engines using the specific synonyms associated with each WordNet synset.

The method was simple enough to repeat and complicated enough to matter. For every synset, the system could draw on the words WordNet associated with that concept, issue those words to image search engines, and collect the returned pictures as candidates. At this stage, the pipeline did not need the search engine to be right most of the time. It needed the search engine to return enough plausible material that later filtering could extract a clean core. The design therefore separated recall from precision. First, gather aggressively. Then, clean rigorously.

However, relying on simple keyword searches immediately exposed the limitations of search engine retrieval and the inherent ambiguity of language. To increase the volume, accuracy, and diversity of the retrieved images, the team engineered a sophisticated strategy of dynamic query expansion. They leveraged the rich, interconnected data already present within the WordNet hierarchy. When a synset’s gloss—its textual definition—contained words that could help contextualize and disambiguate the core term, the system automatically expanded the search queries by appending terms from the synset’s parent category (its hypernym). For instance, if the target synset was the “whippet” breed of dog, the automated query might be expanded from a simple “whippet” to a more precise phrase such as “whippet dog” by appending the parent-category term. This programmatic expansion helped prevent search engines from returning images of entirely unrelated entities that happened to share the same name, significantly improving the initial relevance of the search returns.

The example captures how the WordNet hierarchy became active machinery rather than background decoration. Parent concepts were not only useful for organizing finished labels; they could improve the search queries that produced raw candidates. A synset’s definition and its position in the hierarchy supplied context that a bare search term lacked. The pipeline therefore reused the same semantic structure twice: first to decide which visual categories existed, and again to make web search less ambiguous when harvesting candidate images for those categories.

Even with query expansion, English-language searches alone proved insufficient to reach the project’s ambitious goals for scale and global diversity. To widen the net and capture a more robust slice of the visual world, the team deliberately enlarged the candidate pool by translating their queries into multiple other languages, specifically Chinese, Spanish, Dutch, and Italian. Crucially, they did not rely on brittle, word-to-word dictionary translations that frequently lose semantic nuance. Instead, they utilized established WordNets in those respective languages, ensuring that the translated queries maintained their precise ontological meaning across linguistic boundaries.

This automated, multilingual pipeline succeeded in rapidly yielding a massive reservoir of images, but it simultaneously created a severe quality-control crisis. The internet was abundant, but it was dirty. The 2009 paper reported a sobering metric: the average search-result accuracy across these queried engines hovered around a mere 10 percent. The system was designed to deliberately overcollect a noisy mass of candidate images, operating under the assumption that the vast majority would eventually have to be discarded. Gathering the raw, unverified material from the web was only the first half of the infrastructural challenge; the far more complex task was building a machine capable of systematically cleaning it at scale.

This inversion is easy to miss. A 10 percent average accuracy rate would be disastrous if the search engine were being treated as the dataset creator. In ImageNet, it was treated as a candidate generator. The low precision was acceptable only because the next stage could reject bad images in bulk. The web supplied breadth; the ontology supplied target meanings; human verification would supply the discipline that search engines could not. The architecture depended on all three layers working together.

To refine this chaotic harvest of millions of web images into a disciplined, reliable dataset, the ImageNet team required human judgment at an unprecedented, industrial scale. They found the solution to their labor bottleneck in Amazon Mechanical Turk (AMT). Publicly announced by Amazon on November 2, 2005, as a novel web-services API designed to provide “Artificial Artificial Intelligence,” MTurk established a global, decentralized marketplace where distributed human workers could complete microtasks that algorithms still struggled to perform.

For ImageNet, MTurk mattered less as a novelty than as a routing layer. A requester could break a large job into small Human Intelligence Tasks, specify what counted as acceptable work, and have remote workers complete those tasks through Amazon’s platform. That model fit ImageNet’s bottleneck with unusual precision. The project did not require a small number of expert annotators to write long descriptions. It required a very large number of short, repetitive judgments, each narrow enough to be explained on a task page and repeated across image grids.

In Li’s 2017 retrospective slides, the integration of crowdsourced labor via AMT is depicted as the pivotal operational breakthrough that finally made ImageNet viable, succeeding where the earlier psychophysics and undergraduate-based attempts had stalled. The scale of the workforce mobilized for this effort was large enough to change the nature of the project: Li’s slides report that ImageNet used 49,000 workers distributed across 167 countries during the core construction phase from 2007 to 2010.

The crucial engineering choice in the ImageNet pipeline was the precise nature of the task presented to these distributed workers. The humans were not asked to look at a random image and generate a label from scratch—a process that would be slow, subjective, and practically impossible to standardize across tens of thousands of categories. Instead, the workflow was strictly constrained to verification. The custom task interface presented a worker with a grid of candidate images alongside the explicit definition of the target WordNet synset. The worker’s job was simply a binary decision: to verify whether or not each candidate image actually contained the specific object described by the synset.

That binary design carried much of the system’s load. It reduced the cognitive burden on each worker, limited the range of possible answers, and tied the worker’s decision to a definition rather than to a private guess about what the category might mean. It also preserved the WordNet unit all the way through the pipeline. A worker was not deciding whether an image was generally “good,” “relevant,” or “about dogs.” The worker was deciding whether it contained the target synset. The instruction was narrow because the desired output had to be narrow.

While this verification model streamlined the task, human error, misunderstanding, malicious clicking, and the inherent subtlety of certain visual categories meant that a single worker’s vote could never be implicitly trusted as absolute ground truth. The ImageNet team had to engineer a rigorous, statistical quality-control system capable of extracting reliable data from fallible human labor.

As detailed in the 2009 CVPR paper, the system’s core defense against noise was calculated redundancy. Every single candidate image was subjected to multiple, independent votes from different workers. To establish a baseline of reliability, the researchers deployed initial subsets where images received at least 10 independent votes each. By rigorously analyzing how different users agreed or disagreed on these dense initial subsets, the team constructed sophisticated confidence score tables. These statistical tables allowed them to establish dynamic, threshold-based stopping rules for the remaining candidate images. The system automatically determined exactly how many independent votes were required to confidently accept or reject an image for any given synset, optimizing the balance between accuracy and the cost of human labor.

The redundancy system treated disagreement as information. Some synsets were visually straightforward; others were subtle, confusable, or dependent on fine distinctions. Some categories were straightforward, while others depended on fine distinctions that produced disagreement between workers when the visual evidence did not show the relevant features clearly. The pipeline therefore could not use one universal number of votes as if every category had the same difficulty. Dense voting on initial subsets let the team estimate how much agreement was needed before the system should stop asking for more judgments. In that sense, the quality-control system learned something about the task environment before scaling across the remaining images.

This pipeline was not merely a mechanism for gathering raw clicks; it was an intricate, statistically governed machine designed to distill messy human subjectivity into a clean, objective ground truth. The results of this rigorous filtering were striking. The 2009 paper reported that, when tested on sampled synsets, the ImageNet verification process achieved a precision rate of approximately 99.7 percent. This extraordinary level of operational precision was the true hallmark of ImageNet’s status not just as a collection of pictures, but as a robust, dependable piece of scientific infrastructure.

The precision figure should be read as a reported validation result, not as a claim that every image in every category became perfect. Its importance lay in what it demonstrated about the construction method. Web search had begun with noisy results; distributed labor had introduced human variability; the final dataset still reported a level of cleanliness high enough for a scientific benchmark. The achievement was not that people on MTurk were infallible. It was that the system could combine many small, fallible judgments into a dataset reliable enough to support shared evaluation.

The immediate culmination of this immense organizational, computational, and labor effort was the publication of “ImageNet: A Large-Scale Hierarchical Image Database” at the Computer Vision and Pattern Recognition (CVPR) conference in June 2009. The paper introduced the dataset to the broader research community not as a finished, static artifact, but as a vast, expanding infrastructural resource.

In its 2009 iteration, the researchers reported that ImageNet already contained 12 subtrees, encompassing 5,247 synsets and an unprecedented 3.2 million verified images. It stood as a towering achievement in data curation, dwarfing contemporary image datasets in both physical scale and semantic granularity. Yet, building the infrastructure and presenting it at a conference was only the first step; the critical next phase was successfully integrating this massive resource into the daily workflows and evaluation rhythms of computer vision research.

The 2009 paper also had to make a methodological argument. ImageNet was large, but scale alone would not have made it useful. It was organized by a semantic hierarchy, populated with full-resolution images, and built through a repeatable collection-and-verification process. Those qualities made it different from a web crawl left in its raw form. They also made it different from a narrow benchmark whose categories had been selected mainly because they were convenient for a particular competition. The paper’s underlying proposition was that a dataset could be both very large and carefully structured.

The primary mechanism for this integration became the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). As detailed in a later retrospective paper authored by Olga Russakovsky alongside Hao Su, Jonathan Krause, and other contributors, ILSVRC began in 2010 as an annual fixture in the computer vision community. It formalized the dataset’s role by combining a public data release with an annual competition and an associated workshop, creating a standardized arena where researchers could pit their algorithms against a common, universally recognized standard.

The inception of ILSVRC forced the entire field to confront a massive scale jump. The contrast with the preceding era of evaluation was stark. Scaling from the widely used PASCAL VOC 2010 benchmark to ILSVRC 2010 meant transitioning from a challenge involving 20 classes and 19,737 images to an expansive arena encompassing 1,000 classes and 1,461,406 images. This expansion of scope made traditional, small-group annotation difficult to sustain and made the crowdsourced, MTurk-driven verification pipeline central to the benchmark’s feasibility.

That comparison should not be read as a dismissal of PASCAL VOC. It shows the change in regime. A smaller benchmark can be curated with intense attention by a compact organizing group; a million-image, thousand-class benchmark requires industrialized data operations. ILSVRC converted ImageNet from a database into an annual rhythm: public training data, common tasks, held-out evaluation, competition results, and a workshop where the community could compare methods under shared conditions. The data infrastructure became social infrastructure as well, because it synchronized what researchers tried, measured, and reported.

The challenge format also imposed a useful separation between building and scoring. A public dataset allowed laboratories to train and tune systems against a shared visual vocabulary. Held-out annotations and common evaluation procedures made the comparison less dependent on each group’s private test set. The result was not merely more data in circulation, but a more exact way to argue about progress. Researchers could still disagree about features, classifiers, and modeling assumptions, but the benchmark gave those disagreements a common measuring instrument.

Furthermore, the dataset continued to grow relentlessly beneath the superstructure of the annual benchmark. By August 2014, Russakovsky and her co-authors reported that ImageNet had expanded dramatically to encompass 14,197,122 annotated images, meticulously mapped across 21,841 WordNet synsets.

The difference between the 2009 and 2014 numbers matters. The 2009 CVPR paper reported a working version: 12 subtrees, 5,247 synsets, and 3.2 million images. Later accounts could describe a much larger ImageNet, but those later totals should not be folded backward into the first publication as if the dataset had always existed at that size. ImageNet was an expanding system. Its early importance came from the pipeline and the benchmark form as much as from any single final image count.

ImageNet had succeeded in its primary infrastructural goal. It had systematically transformed millions of noisy web images and distributed human judgments into a stable, widely adopted testing ground. It provided a reliable, hierarchical semantic skeleton upon which computer vision researchers could measure progress at a new scale. The later story of GPUs, convolutional neural networks, and the 2012 recognition breakthrough belongs elsewhere. Here, the turning point is more basic: before a new model could make spectacular use of visual data, visual data itself had to be collected, cleaned, organized, and made common. ImageNet made that labor visible as infrastructure.