Chapter 32: The DARPA SUR Program

Cast of characters

Name	Lifespan	Role
Allen Newell	1927–1992	CMU University Professor; chair of the ARPA-IPTO study group and principal author of the May 1971 charter.
John R. Pierce	1910–2002	Bell Telephone Laboratories engineer; chaired the 1966 ALPAC report; author of the October 1969 JASA letter “Whither Speech Recognition?”
Raj Reddy	b. 1937	CMU faculty; principal investigator of CMU’s SUR contracts; member of the 1971 study group; co-author of the 1979 HARPY chapter.
Bruce T. Lowerre	—	CMU PhD student under Reddy; principal architect of HARPY; April 1976 dissertation; co-author of the 1979 Trends in Speech Recognition chapter.
Frederick Jelinek	1932–2010	Manager of IBM Yorktown Heights’ Continuous Speech Recognition Group from 1972; co-author of the 1983 PAMI paper; first uttered “Whenever I fire a linguist…” in December 1988.
Charles L. Wayne	—	DARPA program manager who restarted US speech-and-language funding around shared evaluation in the mid-1980s.

Timeline (1966–1988)

timeline
    title The DARPA SUR Program and Its Methodological Heir
    1966 : ALPAC report (Pierce, chair) ends US machine-translation funding
    1969 : Pierce publishes "Whither Speech Recognition?" in JASA
    1971 : Newell et al. study-group report (May, CMU technical report) : ARPA-IPTO awards SUR contracts to CMU, BBN, SDC
    1971-1976 : SUR program runs as five-year multi-contractor effort
    1976 : HARPY becomes operational before the September four-system demonstration and meets Final Specs
    1977 : Klatt publishes "Review of the ARPA Speech Understanding Project" in JASA
    1983 : Bahl, Jelinek, Mercer "Maximum Likelihood Approach" in PAMI
    1985 : Mercer at Arden House — "There is no data like more data"
    1988 : Jelinek's December Wayne PA talk — "Whenever I fire a linguist..."

Plain-words glossary

Speech-understanding system — A research goal beyond pure transcription: a machine that recognises spoken words and extracts enough meaning to carry out a defined task (data retrieval, document lookup) within a constrained vocabulary and grammar.
Final Specifications (Newell 1971) — The reduced, measurable target adopted by the May 1971 study group: ~1,000 words, an artificial syntax, a quiet room, cooperative speakers of General American dialect, less than 10% semantic error, demonstrable by 1976.
Semantic error — A task-level scoring rule that judges whether the system’s interpretation would cause the wrong action.
Common task method — A shared-evaluation framework in which multiple research groups test systems on the same defined task.
Maximum-likelihood decoding — IBM’s framing of speech recognition as recovering the most probable word sequence given an acoustic signal. The probability factors into a language model (how likely is this text?) and an acoustic model (how likely are these sounds given that text?).
Perplexity — An information-theoretic measure of task difficulty introduced for ASR by Bahl, Jelinek, and Mercer in 1983. Higher perplexity means the language model leaves more plausible next words at each step. Vocabulary size alone, the IBM group argued, was “practically useless” by comparison.
Forward-Backward Algorithm — The iterative procedure the IBM group used to estimate the parameters of a hidden-Markov speech model from training data. The same machinery underlies modern HMM training pipelines.

In 1966, John Pierce—then a senior figure at Bell Telephone Laboratories—chaired the Automatic Language Processing Advisory Committee (ALPAC) for the National Academy of Sciences. The committee’s resulting report effectively ended United States government funding for machine-translation research for over a decade. The significance for speech recognition was not that translation and speech were the same technical problem. It was that Pierce had shown how an influential committee could turn a field’s promises into a funding question: What exactly had been demonstrated, what remained speculative, and why should the government keep paying?

Three years later, in October 1969, Pierce published a sharply skeptical Letter to the Editor in The Journal of the Acoustical Society of America titled “Whither Speech Recognition?” In it, he characterized automatic-speech-recognition research as being “attractive to money,” writing that “the attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, or going to the moon.” The sentence had a Bell Labs engineer’s dry cruelty. It did not say that speech recognition was uninteresting. It said that the subject had become good at attracting support before it had become good at explaining its own progress.

As Jelinek transmits the text in his later retrospective, Pierce argued that “most recognizers behave not like scientists, but like mad inventors or untrustworthy engineers.” He further claimed that performance “will continue to be very limited unless the recognizing device understands what is being said with something of the facility of a native speaker (that is, better than a foreigner fluent in the language).” Pierce’s target was the gap between a laboratory recognizer and the ordinary linguistic competence that humans used without noticing: accents, syntax, context, false starts, the knowledge that let a listener infer what a speaker must have meant. Pierce closed the letter with a line broadly applicable to artificial intelligence as a whole: “Any application of the foregoing discussion to work in the general area of pattern recognition is left as an exercise for the reader.”

Pierce’s critique is easiest to caricature as hostility to speech recognition, but the narrower reading is more useful. He was attacking the habit of treating limited recognition stunts as evidence that general recognition was close. His native-speaker comparison made the missing middle explicit: a recognizer that could not use knowledge of language and situation would remain trapped in constrained demonstrations. That was precisely the sort of criticism a government sponsor had to answer before it could justify a new program.

The 1966 ALPAC report and Pierce’s 1969 letter together created a funding climate in which any new commitment to speech-understanding research by the Advanced Research Projects Agency’s Information Processing Techniques Office (ARPA-IPTO) had to be rigorously and defensively argued. The question facing ARPA-IPTO was not simply whether a speech recognizer could be built. The narrower question was whether a research program could be specified tightly enough that success and failure would be visible to outsiders. When ARPA-IPTO commissioned a study group before the May 1971 report, the charge itself reflected this post-Pierce skepticism.

That distinction mattered. A field could survive a failed prototype if it had also built a better way to measure prototypes. It could not survive indefinitely on demonstrations that only their builders understood. The SUR program began, therefore, with an unusual institutional premise: before ARPA paid for five years of speech-understanding work, a study group had to say what the work would count as achieving.

The Newell Charter

In May 1971, the study group—chaired by Carnegie-Mellon University’s Allen Newell and including J. Barnett, J. Forgie, C. Green, D. Klatt, J. C. R. Licklider, J. Munson, R. Reddy, and W. Woods—published Speech-Understanding Systems: Final Report of a Study Group. The cover made the institutional setting explicit: the work was sponsored and supported by ARPA-IPTO. The roster also mattered. Newell brought the authority of Carnegie-Mellon’s artificial-intelligence tradition; Licklider connected the exercise to the older IPTO culture of ambitious computing research; Reddy connected the report to the CMU speech group that would become one of the program’s central contractors; and Klatt would later write the canonical review of the finished program. The report acknowledged the defensive nature of its task, noting: “We were charged with determining the feasibility of demonstrating a speech recognition system with useful capabilities and greater power than current isolated word recognition programs (e.g., Vicens, Gold).”

The Newell group concluded that a short-term effort would fail. Conclusion 1 of the report stated flatly: “Three years is not enough time to achieve a system with the initial specifications.” However, Conclusion 2 offered a path forward: “Five years provides a reasonable chance of success for the system with the final specifications. The system would be a research prototype, though it would be capable of extensive operation for exploration and testing.”

Those two conclusions did more than choose a schedule. They separated aspiration from obligation. A three-year crash project would invite another Pierce-style reckoning: a large promise, a public demonstration, and a likely gap between the two. A five-year program, with reduced specifications and explicit measurement, gave the field a way to ask a harder but answerable question. Could several research groups build connected-speech systems that handled a defined task, a defined vocabulary, defined speakers, and a defined error rate by 1976?

Section 1.2 of the report laid out a side-by-side comparison of the ambitious Initial Specifications against the Final Specifications that the proposed five-year program would actually commit to. The Initial Specifications had called for a 10,000-word vocabulary over telephone-quality audio, demonstrable by 1973. The Final Specifications downgraded the hardware assumptions and lengthened the timeline, committing contractors to accept continuous speech from many cooperative speakers of the General American dialect, in a quiet room, over a good-quality microphone. The system would allow slight tuning of the system per speaker but require only natural adaptation by the user. It would permit a slightly selected vocabulary of 1,000 words with a highly artificial syntax and a task like data management or computer status (but not the computer consultant task). Finally, the system would utilize a simple psychological model of the user, provide graceful interaction, tolerate less than 10% semantic error in a few times real time, and be demonstrable in 1976 with a moderate chance of success.

Read as a technical document, the table was a set of compromises. Read as a funding document, it was a discipline machine. Telephone audio disappeared; a good microphone took its place. A 10,000-word vocabulary became roughly 1,000 words. A broad conversational assistant became a constrained data-management or computer-status task. A real-world user population became cooperative speakers of General American dialect. The syntax was not natural language in the broad sense; it was explicitly artificial. The result was not consumer speech recognition, and the report did not pretend otherwise. It was a research prototype designed so that its performance could be tested without allowing the task to dissolve into anecdote.

The error measure was equally revealing. The report did not ask for perfect transcription. It asked for less than 10% semantic error. That choice moved the goal away from phonetic purity and toward task performance: could the system understand enough of the utterance to carry out the intended operation? Even this was bounded by the rest of the table. A system could satisfy the target only inside the chosen vocabulary, syntax, acoustic setting, and speaker assumptions. The table’s caution was not timidity; it was what made the target falsifiable.

More enduring than the specifications themselves was the institutional choreography the report proposed. Conclusion 5 noted that “success requires widespread involvement by several technical communities,” demanding that attention “be focussed on the ultimate problem of a speech-understanding system through some form of cooperative and evaluative endeavor.” The wording joined two ideas that often pull apart in research programs. The groups would compete, because each system would be measured. They would also cooperate, because the measurements would be meaningful only if tasks, data, and definitions were shared.

This cooperative endeavor was formalized in Section 8.6, titled “Public Data and Public Analysis.” In this section, the Newell study group articulated the methodological stance that would outlive the program itself: “A major instrument for progress on speech-understanding systems will be good data of suitable variety, prepared so that it is possible to relate how different systems and algorithms process it. Claims will be made about a wide variety of systems and subsystems over a wide variety of communication situations. If the claims are not made against a background of publicly available high quality data of known structure, it will never be possible to understand the claims or their basis. The issue is not one primarily of assigning credit, but of making progress by understanding success and failure.”

The passage is easy to miss because it is administrative rather than theatrical. It does not announce a new algorithm. It does not predict a machine that can converse. Instead it names the infrastructure a young field lacked: data of known structure, task descriptions clear enough to be reused, and measurements detailed enough that two systems could fail differently. This was a direct answer to Pierce’s complaint. If recognizer builders were to stop looking like solitary inventors, they needed a public way to show what their inventions did.

The study group extended this methodology in Section 8.7 by calling for adequate task descriptions, the instrumenting of both hardware and software systems to take appropriate measurements, and the operation of total speech-understanding systems as a baseline for measuring progress. Instrumentation here meant more than logging final scores. Hardware and software had to be observable. Researchers had to know where an error arose: in acoustic processing, lexical search, syntactic constraint, semantic interpretation, or dialogue management. The “total system” mattered because a component that looked good in isolation could still fail when attached to the rest of a speech-understanding pipeline.

The resulting Speech Understanding Research (SUR) program ran from 1971 to 1976 as a multi-contractor effort across CMU, BBN, SDC, and SRI, all racing toward the Final Specifications with cross-site comparison acting as the ultimate criterion of success. That design left room for very different architectures. It did not require the contractors to agree about how speech should be recognized. It required them to submit their disagreements to the same task.

The September 1976 Demonstration

Six weeks before the September 1976 ARPA demonstration, on August 13, 1976, CMU’s HARPY system became operationally configured for a 1,011-word document-retrieval task. The date matters because it turns the 1971 table into a deadline rather than a wish. Five years earlier, Newell’s study group had asked whether a connected-speech research prototype could be demonstrated in 1976 with a moderate chance of success. By mid-August, CMU had a system ready for the task that would answer that question.

At the demonstration, HARPY was measured against the Newell 1971 Final Specifications and met them. As CMU’s own 1977 retrospective carefully phrased it, the system “not only satisfies the original goals, but exceeds some of the stated objectives.”

Specifically, HARPY recognized speech from the 1,011-word vocabulary at a 5% semantic error rate, easily clearing the target of “less than 10% semantic error.” The environment was still narrow: a document-retrieval task, cooperative speakers, a close-talking microphone, and a controlled room rather than noisy public speech. But that was exactly the point of the Final Specifications. HARPY was not being asked to solve everyday conversation. It was being asked to solve the task the field had agreed would be meaningful.

CMU’s target-versus-performance table makes the narrowness concrete. The speaker set was five people, three men and two women. The quiet-room assumption became a computer terminal room. The good microphone became a close-talking microphone. The 1,000-word vocabulary became 1,011 words, reported without post-selection. The machine used 256K of 36-bit words, and CMU estimated the processing cost at about five dollars per sentence. These are not incidental bookkeeping details. They show what a successful 1976 speech-understanding demonstration physically looked like: expensive, constrained, and still impressive because it satisfied a public specification.

It achieved this while running 80 times real-time on a .35 MIPS PDP-KA10. Lowerre and Reddy later translated that into about 28 MIPSS, far below the 200 to 500 MIPSS implied by the “few times real time” target on an assumed 100 MIPS processor. Because the Newell study group’s speed target had been benchmarked against much more powerful hardware than CMU actually used, HARPY’s speed was primarily an algorithmic achievement. It required 20 to 30 sentences per talker for tuning—which satisfied the specification, although retrospective reports varied on whether this constituted “slight” or “substantial” tuning.

The system was also not free to use unlimited grammar. CMU’s retrospectives describe the recognition task as constrained by syntactic and semantic knowledge, with average branching-factor figures reported differently across the 1977 and 1979 accounts. The discrepancy is less important than the shared point: HARPY’s success depended on narrowing the search space. It won the 1976 target by compiling a great deal of knowledge into a form that could be searched efficiently on weak hardware.

HARPY’s design lineage flowed from CMU’s earlier Hearsay-I system, passing through a system called Dragon before becoming HARPY. Built by James K. Baker, Dragon had utilized a Markov-network knowledge representation; HARPY adopted Dragon’s integrated network representation but dropped its a priori transition probabilities. Lowerre and Reddy described Dragon’s representation and delayed-decision techniques as the most important intellectual legacy behind HARPY. This lineage complicates the clean story in which SUR was simply symbolic and IBM was simply statistical. Inside CMU’s own contractor work, Baker had already introduced a statistical exception. HARPY’s final form, however, used that inheritance for a different purpose: a tightly compiled network search that could satisfy the ARPA specification.

While HARPY met the targets, its CMU sibling system, HEARSAY-II, did not. HEARSAY-II is historically important for reasons the demonstration table alone does not capture: its blackboard architecture became a major idea in AI system design. But the September 1976 evaluation was not rewarding architectural influence. It was measuring the Final Specifications. Tested on the same 1,011-word vocabulary task, HEARSAY-II achieved an approximate 16% semantic error rate on the simple AIX05 grammar and a 26% semantic error rate on the more complex AIX15 grammar, running between two and twenty times slower than HARPY.

The contractor balance remains uneven in the surviving record. CMU’s systems are documented in detail through CMU’s own 1977 report and the later Lowerre-Reddy HARPY chapter. BBN’s HWIM (Hear What I Mean) and SDC’s system are treated more tersely in the accessible summaries. Those summaries report that HWIM and SDC also failed to meet the Newell 1971 Final Specifications, but their detailed demonstration numbers are not needed for the central point. Of the four contractor systems, only HARPY successfully achieved the demonstration targets.

Yet the September 1976 demonstration was an institutional victory. The methodological choreography proposed in Newell 1971 Section 8.6 had worked. ARPA was able to compare HARPY’s 5% semantic error against HEARSAY-II’s 16% error rate on the same AIX05-style task family, while placing BBN and SDC in the same program-level comparison. That did not make every result flattering. It made the results legible. A sponsor could see not merely that one demonstration impressed an audience, but that one architecture satisfied a published target while others did not.

This is why the 1976 demonstration resists a single verdict. By the program’s own technical criterion, SUR succeeded: HARPY met the Final Specifications. By the broader institutional criterion of whether ARPA immediately continued the same line of speech-understanding funding, the result looked much less triumphant. Those two judgments are not contradictions. They are judgments at different scales.

The IBM Parallel Track

While the SUR contractors raced toward the 1976 demonstration, a separate effort was underway at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York. Throughout the 1970s, IBM’s Continuous Speech Recognition Group—led by Frederick Jelinek and including Lalit Bahl, Robert Mercer, and later Jim Baker—pursued speech recognition on a track parallel to the DARPA SUR program, operating without ARPA funding. This separation is essential. IBM was not a fifth SUR contractor. Its work developed alongside the ARPA program, watched its tasks, and eventually used one of those tasks for comparison, but it belonged to a different institutional world.

The IBM group’s approach was fundamentally different from the predominantly rule-based systems of the SUR contractors. As formalized in their 1983 paper “A Maximum Likelihood Approach to Continuous Speech Recognition,” published in IEEE Transactions on Pattern Analysis and Machine Intelligence, the IBM team framed speech recognition not as a problem of artificial syntax but as a problem of maximum-likelihood decoding over a noisy communication channel. In this view, a text generator produces words, an acoustic channel converts those words into a noisy signal, and a linguistic decoder must recover the most likely word string from that signal.

The mathematical form of the idea can be stated without deriving the later hidden-Markov machinery. Given an acoustic observation, the recognizer seeks the word sequence with the highest probability. That probability can be decomposed into two parts: how likely the words are as language, and how likely the observed sounds are if those words were spoken. This was the speech-recognition version of a communication-theory argument. It did not deny that syntax, pronunciation, and linguistic knowledge mattered. It insisted that they enter the machine as model structure and estimated probabilities rather than as hand-authored rules whose weights were chosen by intuition.

Within this framework, statistical Markov-source models replaced hand-coded linguistic rules. The system modeled sequences of word-emitting state transitions, with the model’s parameters estimated automatically from large amounts of training data via the iterative Forward-Backward Algorithm. The practical force of this move was not that the machine suddenly understood speech like a person. It was that errors became opportunities for reestimation. If the model’s parameters were wrong, the remedy was not only to argue over a better linguistic rule. It was to expose the model to more data and let the estimation procedure adjust the numbers.

To measure task difficulty, the 1983 paper introduced an information-theoretic metric called perplexity. The authors noted that “although vocabulary size is almost always mentioned in the description of an artificial task, by itself it is practically useless as a measure of difficulty. In this section we describe perplexity, a measure of difficulty based on well established information theoretic principles.” For the Raleigh Language task used by the IBM group, the perplexity was 7.27 against a vocabulary of several hundred words.

Perplexity solved a problem that the Newell table had exposed but not fully resolved. A 1,000-word vocabulary could be easy if the grammar allowed only a few plausible next words at each step, and difficult if the grammar allowed many. Vocabulary size was a public number, but it could mislead. Perplexity made the constraint itself measurable. It asked, in effect, how many alternatives the recognizer faced at each choice point after the language model had done its work. That was the IBM track’s technical counterpart to SUR’s institutional insistence on comparable tasks.

Crucially, the IBM group actively sought to measure their statistical approach against the SUR baseline. The 1983 paper explicitly notes that they ran experiments on the CMU-AIX05 task because it was “the task used by Carnegie-Mellon University in their Speech Understanding System to meet the ARPA specifications.” This sentence is the bridge between the parallel tracks. IBM did not need to be a SUR contractor for SUR’s measurement culture to matter. The CMU task had become a benchmark against which an outside statistical group could test its own assumptions.

The IBM team’s philosophical stance was that linguistic intuition could inform the structure of a model, but the parameter values had to be derived from data. As Jelinek later framed it: “Linguistic intuition combined with ability to extract information will determine the structure of models and their parameterization. Parameter values will be estimated from (annotated) data. We will rely on advice of linguists to create resources.” The formulation is more subtle than the slogan for which Jelinek later became famous. It does not ban linguists from the room. It gives them a different job: create resources, help choose representations, and then let observed data estimate the values.

According to a later retrospective slide by Jelinek, this data-driven approach yielded striking results: while phonetic baseforms with expert-estimated statistics achieved 35% accuracy, switching to automatically estimated statistics raised accuracy to 75%. Using orthographic baseforms with automatically estimated statistics achieved 43%. The exact experimental setting is not developed in that retrospective, so the numbers should be read as Jelinek’s later illustration of a methodological turn rather than as a full benchmark report. Their point is still sharp. The argument was not that linguistic categories were useless. It was that manually assigned statistics could be worse than statistics learned from data.

As Jelinek observed years later, the 1971-1976 ARPA SUR project was “dominated by AI,” with Jim Baker’s Dragon system at CMU serving as the principal statistical exception. But as the 1970s ended, the methodological center of gravity was beginning to shift. It did not shift because HARPY had failed. HARPY had succeeded. It shifted because the evaluation culture SUR helped create made room for another kind of system to show, task by task, that estimating from data could beat carefully engineered intuition.

The Methodology That Outlived SUR

Although HARPY met the 1976 Final Specifications, DARPA did not extend SUR funding into a follow-on program. United States government research funding for both machine translation and speech recognition largely collapsed, entering what has been characterized as a funding winter from 1975 to 1986. The available public retrospectives do not reduce that pause to a single internal ARPA memo or a single disappointed program manager. It is safer, and more accurate, to hold two facts together: the demonstration produced one clear technical winner, and the government did not immediately turn that result into another large speech-understanding program.

The IBM statistical-speech group, having never relied on DARPA funding, continued its work through the winter. The publication of the landmark Bahl, Jelinek, and Mercer paper in March 1983 occurred seven years into this funding drought. Even then, the statistical approach was not magic. The 1983 paper itself acknowledged that recognition often required many seconds of CPU time for each second of speech. The method was data-hungry and compute-hungry; its later dominance required more than a clever formulation.

By 1985, at an Arden House workshop, Robert Mercer articulated the group’s emerging methodological maxim: “There is no data like more data.” The line is often quoted because it sounds like provocation. In this history it is also infrastructure. More data mattered only if the field had a way to distribute tasks, score systems, and compare results without reinventing the evaluation every time.

In the mid-1980s, DARPA program manager Charles Wayne restarted US speech-and-language research funding by introducing what became known as the “common task method.” Wayne’s framework relied on well-defined objective evaluation, a neutral evaluator (the National Institute of Standards and Technology, or NIST), and shared datasets. Structurally, this was a return to the “Public Data and Public Analysis” proposal of Newell 1971 Section 8.6, strengthened by a more explicit evaluation institution. Newell’s group had asked for public data and public analysis; Wayne’s revival supplied the neutral scoring machinery that would make that practice routine.

The continuity should not be overstated as a documented handoff from Newell to Wayne. What survives in the record is a structural resemblance, not a memo of inheritance. But the resemblance is strong enough to matter. Both programs treated evaluation as a research instrument rather than as an after-the-fact contest. Both forced systems into common tasks. Both made it possible for a sponsor to reward methods that worked rather than methods that sounded theoretically attractive.

Under the Wayne-era common-task framework, statistical methods began to demonstrate decisive advantages. In December 1988, at the Workshop on Evaluation of NLP Systems in Wayne, Pennsylvania, Frederick Jelinek delivered a talk titled “Applying Information Theoretic Methods: Evaluation of Grammar Quality.” It was at this venue that Jelinek famously quipped: “Whenever I fire a linguist our system performance improves.”

The date is important. The quip did not belong to the SUR years. It came twelve years after the September 1976 demonstration, after the funding winter, after IBM’s statistical formulation had matured, and during the evaluation culture Wayne was rebuilding. Put back in its own chronology, the joke is not the origin of the statistical turn. It is a late, abrasive summary of a shift already visible in models, data, and benchmarks.

The success of statistical methods on Wayne’s shared benchmarks drove a major DARPA revival in the late 1980s and 1990s, yielding systems like Sphinx from CMU, BYBLOS from BBN, and DECIPHER from SRI. CMU’s Sphinx system explicitly integrated the statistical method of hidden Markov models with the network search strength of the earlier HARPY system. That combination is a useful historical correction. The later statistical era did not simply erase the SUR systems. It reused part of their engineering inheritance while changing the way acoustic and language uncertainty were modeled.

The scope of this methodological and statistical inheritance is visible in the publication record. At the Association for Computational Linguistics (ACL) conference in 1990, of the 39 papers presented, only one utilized statistical methods. By ACL 2003, 48 of the 62 papers presented were statistical. Those numbers belong to computational linguistics broadly, not only to speech recognition, but they capture the scale of the turn. Once shared data and shared evaluation became normal, methods that improved measured performance could spread with unusual speed.

The DARPA SUR contractor systems—HARPY, HEARSAY-II, HWIM, and SDC—are now largely retired as historical research artifacts. The program’s most durable contribution was not the software that ran on CMU’s PDP-10 in September 1976, nor even the fact that one system met a 1,000-word target under constrained conditions. The enduring legacy of SUR was the measurement methodology that Allen Newell’s study group named in 1971, that ARPA proved workable in 1976, and that Charles Wayne re-institutionalized a decade later to underwrite the statistical revolution.

That legacy is quieter than the machines. It is not a demo photograph or a famous slogan. It is the expectation that a speech recognizer should be tested on a known task, against known data, with known metrics, in a way that lets outsiders understand why it succeeded or failed. Pierce had warned that speech recognition attracted money before it deserved trust. Newell’s group answered by designing a program in which trust would have to be earned by measurement. HARPY earned it once. The statistical systems that followed learned to earn it repeatedly.