Перейти до вмісту

Chapter 38: The Human API

Цей контент ще не доступний вашою мовою.

Cast of characters
NameLifespanRole
Venky HarinarayanAmazon inventor; IEEE Spectrum identifies him as a manager behind the internal system that became Mechanical Turk. Co-inventor on US7197459B1.
Anand RajaramanCo-inventor on US7197459B1, listed with Amazon Technologies Inc. as original assignee.
Anand RanganathanCo-inventor on US7197459B1, listed with Harinarayan and Rajaraman.
Jeff Bezos1964–Amazon founder/CEO; Computerworld reports his September 27, 2006 MIT keynote grouped MTurk with S3 and EC2 as developer-facing cloud services.
Rion Snow, Brendan O’Connor, Daniel Jurafsky, Andrew Y. NgCo-authors of the 2008 EMNLP paper “Cheap and Fast — But is it Good?” on MTurk annotation.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-FeiAuthors of the 2009 ImageNet CVPR paper; used AMT to verify candidate images at large scale.
Timeline (2001–2018)
timeline
title Amazon Mechanical Turk and the Human Annotation Infrastructure
2001 : Amazon patent work begins on routing difficult subtasks to human workers
: October 12 — filing date; inventors Harinarayan, Rajaraman, Ranganathan; assignee Amazon Technologies
2004 : Von Ahn and Dabbish publish the ESP Game paper at CHI — pre-MTurk human-computation boundary
2005 : AWS publicly launches Amazon Mechanical Turk as a web-services API
2006 : September 27 — Bezos MIT keynote groups MTurk with S3 and EC2 as "hidden Amazon" developer services
2007 : March 27 — US7197459B1 granted and published
2008 : October — Snow et al. publish "Cheap and Fast - But is it Good?" at EMNLP
2009 : June — Deng et al. publish ImageNet at CVPR with an AMT cleaning pipeline
2018 : Hara et al. CHI wage analysis documents low effective worker pay
Plain-words glossary
  • Human Intelligence Task (HIT) — The unit of work on Mechanical Turk. A requester posts a HIT with a title, instructions, a price, and a rule for when it counts as complete. A worker picks it up, does the task, and submits an answer. The HIT structure is what lets software treat human judgment as an addressable request.
  • Requester — MTurk vocabulary for the party that posts tasks. A requester could be an individual researcher, a company, or a software system issuing API calls programmatically. Requesters set pay rates, qualifications, and approval or rejection rules.
  • Turker (worker) — Colloquial name for people who accept and complete HITs on Amazon Mechanical Turk. Workers browse available tasks, choose which to accept, and receive payment only for work the requester approves.
  • Synset — A set of words in WordNet that share a single meaning, used by ImageNet as a concept scaffold.
  • Redundancy (annotation) — The practice of collecting multiple independent labels for the same item instead of trusting a single worker’s answer.
  • Artificial Artificial Intelligence — Amazon’s deliberate phrase for human judgment packaged behind an AI-like interface.

Amazon’s storefront in the early 2000s had grown far beyond a tidy digital bookshelf. As the company expanded into a vast, universal retailer, its product catalog swelled relentlessly, fed by many disparate and overlapping product sources. This rapid growth brought a persistent and frustrating operational headache: duplicate product listings began to fracture search results, clutter the storefront, and degrade the shopping experience. The traditional software engineering instinct, when faced with such an issue, was to write a better automated detector—an algorithmic filter designed to scan the database, find identical items, and merge them. However, duplicate detection sat in an awkward and stubborn zone of computational capability. It required comparing images and text, making similarity judgments that were relatively easy for a person quickly glancing at two product pages, but which were described as insurmountable for the computer systems of the period. According to later reports, ordinary software simply could not reliably resolve the problem. The automated approaches failed to capture the nuances that a human eye could parse in an instant.

The important point is not that Amazon discovered a new philosophical category of work. People had been labeling, classifying, transcribing, and correcting information for computers long before Mechanical Turk appeared. The sharper claim is infrastructural. Amazon had a class of small, repetitive judgments that were too ambiguous for the automated systems available to it and too numerous to treat as one-off clerical exceptions. A catalog problem became interesting when it was translated into a systems problem: how could software ask for a human judgment without pausing the whole machine around that person?

The solution to this intractable catalog problem emerged not as a better image-matching algorithm, but as a completely new architectural abstraction. The blueprint for this approach was formalized in a patent with a priority date of March 19, 2001, filed on October 12 of that year by inventors Venky Harinarayan, Anand Rajaraman, and Anand Ranganathan, with Amazon Technologies Inc. listed as the original assignee. The document, which would be granted and published in March 2007 as US7197459B1, was titled “Hybrid machine/human computing arrangement.” It outlined a method for software to systematically route around its own limitations. Rather than forcing a machine to struggle with a complex subjective task, the patent described an architecture centered around a central coordinating server that could decompose difficult tasks—such as image or speech comparison—into discrete subtasks suitable for human performance. According to secondary accounts, this technology, which distributed subtasks to networked human workers, was later turned into a marketplace.

This was not merely a description of outsourcing work to a human contractor; it was a formal engineering interface. The patent detailed how a program could request these human performances programmatically. The system utilized a task server, sometimes coordinating with a “Junta Computer,” that received the decomposed subtasks and dispatched them to human-operated nodes. Crucially, this hybrid arrangement was exposed through an application programming interface. The API allowed the requesting software to define the exact parameters of the judgment it needed: the nature of the task, the specific input data to be analyzed, the expected accuracy of the result, the necessary security level, the maximum amount of time to be spent on a subtask, and the cost to be incurred for a task.

Those parameters are what made the design more than a queue of odd jobs. A normal human work request is thick with negotiation: who is qualified, how long the job should take, what the worker is allowed to see, what level of error is acceptable, and what the work is worth. The patent compressed those questions into fields a machine could pass to another machine. Time became a limit, cost became a bound, accuracy became a stated expectation, and security became part of the request rather than an afterthought. The request did not need to describe a whole employment relationship. It needed to describe a subtask tightly enough that it could be handed to a person and returned as data.

By defining human labor in terms of input data, latency limits, and unit costs, the patent treated human cognition as an addressable component within a larger software system. It was a formal mechanism for incorporating human judgment directly into a computational loop. This represented a profound shift in infrastructure design. It meant that a software developer confronting a difficult categorization or comparison problem no longer needed to build a perfect algorithm or hire a dedicated staff of analysts. Instead, they could write a program that issued an API call, bounded by a specified budget and time limit, and wait for a human-operated node to return the result. The architecture turned human judgment into a service that could be queried, setting the stage for a new kind of computational labor.

On November 2, 2005, Amazon Web Services took this internal infrastructure and publicly launched it as a new, available product: Amazon Mechanical Turk. The language of the launch announcement read like a miniature manifesto about the limits of computation and the enduring, indispensable value of human perception. The announcement explicitly noted that humans still outperformed powerful computers at many tasks, offering the specific and prescient example of identifying objects in photographs. The rhetorical turn of the launch was a profound inversion of the established relationship between people and software. The announcement asked what would happen if the normal request flow were reversed. Instead of a human user typing a query and waiting for a computer to perform a task and return a result, what if a computer program could ask a human being to perform a task and return the results?

Amazon framed this service as providing a web-services API for computers to integrate “Artificial Artificial Intelligence” directly into their processing. The phrase was a deliberate and self-aware irony, pointing to the persistent gap between the ambition of artificial intelligence and the reality of what software could currently achieve on its own. The product’s name itself, Mechanical Turk, pointed back to an eighteenth-century illusion. As historical accounts explain, the original Mechanical Turk was a famous chess-playing automaton that appeared to be an autonomous, thinking machine, but actually hid a person inside its cabinet who operated it. Amazon’s modern namesake followed the same logic as an Internet service: software systems could present an automated, intelligent facade to the world, while relying on a distributed network of human workers to handle the cognitive steps the code could not execute.

The launch announcement’s reversal of request flow was more than a clever turn of phrase. It changed where a human could sit in a software architecture. In the ordinary consumer model, the person was outside the system, using a machine as a tool. In the Mechanical Turk model, the person could be inside the system, receiving a bounded task from code and sending a result back into the calling process. The worker did not have to know the larger application, and the application did not have to expose itself as a workplace. It only had to turn the unsolved part of the computation into a human-readable question.

The importance of the 2005 launch lay squarely in its interface design. Amazon Mechanical Turk was not presented as a traditional staffing agency brochure, a crowdsourcing portal, or a volunteer network. It was an API. A software developer, now termed a “requester,” did not need to negotiate a contract, interview candidates, or manage a physical room of annotators. Their software could programmatically create tasks, set piece-rate rewards, collect the answers, and fold those human outputs back into a larger computation seamlessly. It made paid, distributed human cognition look callable, meterable, and scalable. According to some accounts, Jeff Bezos personally originated the website concept, though the precise division of labor between the chief executive and the patent’s inventors remains a complex piece of the company’s internal history. Regardless of its exact genesis, the public launch transformed an internal tool—originally conceived for catalog maintenance—into a generalized utility for anyone who needed programmatic access to human judgment.

That attribution problem is itself revealing. The patent record names Harinarayan, Rajaraman, and Ranganathan on the hybrid computing design; later explainers place Bezos closer to the public website concept; secondary accounts connect the system to Amazon’s internal catalog work. The cleanest history is therefore not a lone-inventor story. Mechanical Turk emerged from an organization that had both a practical retail headache and an unusual habit of turning internal machinery into general developer services. The product did not need a single origin myth to matter. Its historical force came from the way several strands—catalog maintenance, web-service engineering, executive platform strategy, and the old dream of machine intelligence—were folded into one interface.

The context of Amazon Mechanical Turk’s introduction is vital for understanding its historical impact: it arrived exactly as Amazon was fundamentally redefining itself as an infrastructure provider. On September 27, 2006, Jeff Bezos delivered a keynote presentation at the Emerging Technologies Conference on the Massachusetts Institute of Technology campus. Contemporary reports of the event described Bezos offering a look at the “hidden Amazon”—the vast operational machinery and internal software tools that the retailer had built to run its own massive e-commerce business, which were now being turned outward and exposed to external developers as a suite of services.

During this presentation, Bezos grouped Mechanical Turk together with the Simple Storage Service (S3) and the Elastic Compute Cloud (EC2). This juxtaposition was the crucial, load-bearing move. Storage, compute power, and human judgment were all being introduced to the developer community in the exact same breath, using the same web-services vocabulary. They were presented not as separate business ventures, but as modular, metered utilities belonging to a unified platform. By the time of the MIT keynote, according to contemporary reports, approximately 200,000 developers had registered to use Amazon’s ten different web services. Bezos framed the appeal of these offerings around low barriers to entry and frictionless experimentation. He described them as “pay-by-drink” services, emphasizing that developers could scale their usage up or down without heavy upfront investments or long-term commitments.

The phrase mattered because it described an economic interface as much as a technical one. A developer no longer had to buy a server fleet before trying a storage-intensive idea, and no longer had to build a staff before trying a judgment-intensive one. S3 made storage granular. EC2 made computation granular. Mechanical Turk made certain human judgments granular. The analogy was imperfect, because a worker is not a disk block or a virtual machine, but the pricing and request model encouraged developers to reason across all three resources in similar terms: call the service, meter the use, pay for the amount consumed, and discard the result if it failed the caller’s standard.

That was the deeper significance of placing MTurk beside the other early AWS services. The service inherited credibility from the platform around it. A strange labor marketplace looked less strange when it arrived in the same vocabulary as storage buckets, server instances, web-service calls, and small experimental bills. For researchers and developers, this mattered because the cost of trying a data-collection idea could become small enough to justify experiments that would previously have required a hiring plan. The service did not need to promise that every task would work. It made failure cheap enough that task design itself could become iterative.

This framing cemented Mechanical Turk’s status as infrastructure. Bezos explained that the web services were essentially functions that Amazon already had to perform internally, which they were now making available to others. By packaging human labor in the same delivery vehicle as server instances and database storage, Amazon normalized the idea that human cognition could be purchased as an elastic resource. It was a radical abstraction that allowed software engineers to treat human workers as another node in a distributed computing architecture, a node that could be reached via an API call whenever an algorithm encountered an edge case it could not resolve. It was no longer just a clever website; it was part of the early cloud-services vocabulary.

This new, callable infrastructure arrived as machine learning researchers faced a growing need for labels. The bottleneck in natural language processing and computer vision was not only algorithmic design; it was also the acquisition of annotated data to train supervised learning models. Human linguistic annotation was crucial for many NLP tasks, but relying on graduate students or hired linguistic experts was expensive and time-consuming. In October 2008, a paper presented at the Empirical Methods in Natural Language Processing conference provided an early and rigorous academic validation of Mechanical Turk as a research method. Authored by Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng, the study asked a straightforward question: “Cheap and Fast - But is it Good?”

Snow and his colleagues evaluated Mechanical Turk across five specific natural language tasks: affect recognition, word similarity, textual entailment, temporal event ordering, and word sense disambiguation. The researchers recognized that Mechanical Turk operated as an online labor market, complete with its own specific operational mechanics. Requesters posted Human Intelligence Tasks, or HITs, offering a specific monetary reward for completion. Workers, colloquially known as Turkers, chose which tasks to accept based on the description and the payment offered. Requesters could require certain qualifications before a worker could accept a task, and they retained the power to approve or reject the submitted work, with the financial transactions—including base pay and bonuses—entirely mediated by Amazon.

In that workflow, the unit of annotation was not a classroom exercise or a laboratory session. It was a posted task with a title, a description, a price, and a completion rule. A requester could group many similar judgments together, ask for a specified number of annotations, and decide afterward which submissions to accept. Workers could browse the available tasks and choose among them. This market design mattered for machine learning because it separated annotation from the institutional setting that had traditionally supplied it. A data set no longer had to wait for a small pool of experts or students to move through a queue by hand; it could be broken into many tiny questions and offered to a distributed workforce.

The choice of tasks was important because the paper was not testing one toy labeling problem. Affect recognition asked workers to judge emotional content. Word similarity asked for semantic closeness. Textual entailment required deciding whether one sentence followed from another. Temporal event ordering asked for judgments about sequence in language. Word sense disambiguation required choosing which meaning of a word was intended in context. Together, those tasks sampled the kinds of small interpretive decisions that made supervised natural language processing expensive. They were easy to describe as individual questions, but hard to automate reliably without examples, and the examples themselves required human judgment.

The paper did not argue that a single anonymous click from the internet was equivalent to the nuanced judgment of a trained linguist. Instead, the researchers relied on careful task design and deliberate redundancy. In their affect recognition experiment, for example, they collected ten independent annotations per item. By using this built-in redundancy, they could study how aggregating multiple non-expert opinions improved the reliability of the final label. The economic results of this approach were striking. The team reported paying 2 dollars to collect 7,000 non-expert annotations for the affect task. They interpreted this rate as yielding 3,500 non-expert labels per US dollar. Even after accounting for the need for redundancy and bias correction, they calculated that this provided at least 875 expert-equivalent labels per dollar spent.

Repeated labeling changed the meaning of the individual worker response. A single answer could be noisy, hurried, or mistaken, but a set of answers could be treated statistically. The requester could compare workers to one another, measure agreement, infer which items were difficult, and use aggregation to move from raw clicks toward a usable label. The human API therefore did not eliminate expertise by magic. It replaced one expensive, high-trust judgment with several cheaper, lower-trust judgments plus a method for combining them. That bargain was only attractive because the platform made it cheap enough to ask the same question multiple times.

The study also made visible the operational discipline required to use MTurk well. Qualifications could filter who saw a task, approval and rejection gave requesters a blunt quality lever, and bonuses could shape incentives. But these mechanisms did not guarantee quality on their own. The paper’s result depended on matching the task to the crowd, collecting enough judgments, and correcting for systematic worker bias. That is why its title mattered. “Cheap and fast” was not the same as “automatically good.” The point was that, for many of the tasks the authors tested, a small number of non-expert annotations per item could reach expert-level performance when the annotation process was designed as a measurement system rather than a pile of isolated answers.

The conclusion of the Snow et al. study sent a powerful signal to the machine learning community. They found that for many tasks, only a small number of non-expert annotations per item were needed to equal the performance of an expert annotator. The paper demonstrated that aggregated non-expert labels, when managed with proper redundancy and bias control, could approach gold-standard quality at a fraction of the traditional cost and time. Mechanical Turk was no longer just an Amazon web service; early examples like this paper showed it could serve as a viable, scalable research annotation infrastructure, offering a pragmatic way around one of supervised learning’s recurring data bottlenecks.

While natural language processing researchers were proving the viability of the human API for linguistic tasks, computer vision researchers were preparing to test its limits at an unprecedented scale. In 2009, a team led by Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei published a paper at the Conference on Computer Vision and Pattern Recognition detailing the construction of ImageNet, a large-scale hierarchical image database. The ambition of the project was massive: they aimed to populate most of the approximately 80,000 noun synonym sets, or synsets, defined by the WordNet lexical database with 500 to 1,000 clean, full-resolution images each. By the time of the 2009 publication, they had already organized 12 subtrees encompassing 5,247 synsets and 3.2 million images.

ImageNet’s dependence on WordNet gave the dataset a particular shape. It was not just a folder of many images scraped from the web. It was a hierarchy of concepts, where each synset represented a noun meaning and where the database’s ambition was to attach many clean visual examples to those meanings. That structure made the labeling problem more precise and more difficult at the same time. A candidate picture was not merely “interesting” or “photographic.” It either did or did not contain an object corresponding to the target synset, and the worker needed enough definition to make that judgment consistently.

The target numbers explain why the project needed a different method. Populating even a small fraction of WordNet with hundreds of images per synset quickly turns into millions of visual decisions. A web search engine could supply candidates, but it could not certify that the candidate matched the intended noun sense. A photograph returned for an animal name might show a toy, a logo, a person wearing a costume, a drawing, or a scene in which the target object was absent. The data set’s scientific value depended on removing those mismatches, and the removal step was exactly the kind of repetitive perceptual judgment that MTurk had been designed to expose.

The authors explicitly stated that building a dataset of this magnitude meant they could no longer rely on traditional data-collection methods. Gathering millions of candidate images from the web was only the first step; the critical, labor-intensive challenge was verifying that the downloaded images actually contained the objects they were supposed to represent. The missing machine for this verification step was a crowd of human workers, and the ImageNet team used Amazon Mechanical Turk to construct their cleaning pipeline. Workers were presented with sets of candidate images along with the WordNet definition of the target synset and asked to verify the presence of the specified object.

The pipeline began before the worker ever saw a HIT. Candidate images had to be found, and the paper described a collection scheme built around WordNet synsets, image search, query expansion, and translations into multiple languages. That produced volume, but volume was not the same as data quality. Web search returned irrelevant images, ambiguous images, drawings, logos, and cases where a word’s visual referent shifted with context. Mechanical Turk entered at the cleaning stage because the project needed to turn a large candidate pool into a database of images that could support computer-vision research.

The quality control process for this vast undertaking was rigorous. The researchers did not blindly trust raw votes from individual workers. They relied on multiple independent users evaluating the images. Initial subsets required at least ten votes per image to establish baselines of difficulty and reliability, and the system used confidence score tables and thresholds to determine whether to continue or stop labeling remaining candidate images based on user agreement. This distributed, carefully managed human effort yielded an impressive reported average precision of approximately 99.7 percent on the sampled synsets. ImageNet showed that crowd labor could become part of the operating system for assembling massive visual datasets.

The thresholds mattered because not all categories were equally easy. Some synsets corresponded to objects that workers could recognize quickly. Others depended on fine distinctions, cluttered scenes, unusual viewpoints, or ambiguous candidate images returned by search. A fixed number of votes for every image would waste effort on obvious cases and under-sample difficult ones. By using agreement patterns and confidence thresholds, the ImageNet team treated human labor as an adaptive verification process. More judgment could be spent where the category or candidate pool demanded it; less could be spent where the evidence was already strong enough.

The reported 99.7 percent average precision on sampled synsets should be read in that context. The number was not evidence that crowd work required no care; it was evidence that crowd work, wrapped in a carefully designed verification process, could support a dataset whose scale would have been prohibitive by older collection methods. The result depended on structure at every level: WordNet supplied the concept hierarchy, image search supplied candidate pools, MTurk supplied distributed human decisions, and the research team’s thresholds turned those decisions into an acceptance process.

This is the point at which Mechanical Turk becomes difficult to separate from the later history of artificial intelligence, even though the direct causal chain must be kept narrow. ImageNet’s 2009 paper was not yet the story of the 2012 deep-learning breakthrough. It was the story of how a large, structured visual dataset could be built at all. The contribution of MTurk in that story was not conceptual glamour. It was the unglamorous ability to ask many people, many times, whether a candidate image matched a defined category, then turn those answers into a cleaner corpus than automated web search alone could supply.

However, the success of ImageNet and the broader utility of Mechanical Turk relied on an abstraction that profoundly obscured the reality of the labor involved. The API made human judgment look frictionless, callable, and meterable to the requester, but the human-operated nodes executing the subtasks were actual people operating within a precarious labor market. The costs of this abstraction were later detailed in an empirical analysis by Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, Chris Callison-Burch, and Jeffrey P. Bigham. By recording 2,676 workers performing 3.8 million tasks, they found that the median hourly wage was roughly 2 dollars, with only 4 percent of workers earning more than the federal minimum wage of 7.25 dollars per hour. Their analysis highlighted the unpaid work components hidden behind the API call: the uncompensated time workers spent searching for tasks, the labor lost when work was rejected by requesters, and the effort expended on tasks that were ultimately not submitted. To the software developer, the human API was a cheap and fast function call; to the workers fulfilling the requests, it was an environment characterized by low median wages, constant search time, and unpaid friction. This hidden labor force provided the crucial verification layer for the datasets that would soon reshape the trajectory of artificial intelligence research.

That closing tension is the durable meaning of Mechanical Turk in AI history. The same design that made human judgment usable by software also made the worker easy to overlook. In the requester view, the worker appeared as latency, price, approval rate, and output. In the research view, the worker often appeared as a row in an annotation table or a vote in an aggregation rule. Yet the labels that made supervised systems learn did not fall out of the web by themselves. They were produced through a market whose interface hid search time, rejection risk, and low effective pay behind the smooth language of services.

Hara and colleagues’ wage analysis was later than the launch and later than ImageNet’s first paper, so it should not be treated as a direct measurement of every earlier annotation campaign. Its value here is different. It shows the kind of cost that the interface was structurally good at hiding. The requester saw the price of a completed task. The worker experienced the time spent finding acceptable work, the possibility of rejection, and the effort lost when a task was abandoned or not submitted. The same system that made annotation cheap for a researcher could make the labor market opaque for the person doing the annotating.

Mechanical Turk did not invent crowdsourcing, and it did not by itself cause the next wave of machine learning. Its more specific achievement was to make paid human judgment fit the operational grammar of the cloud. Storage could be rented. Compute could be rented. For certain stubborn perceptual and linguistic judgments, people could be queried through the same developer-facing style of interface. That abstraction helped researchers build data sets that earlier methods could not have assembled so easily, and it left a moral remainder that the abstraction itself could not erase: the intelligence in “Artificial Artificial Intelligence” was never artificial all the way down.