Chapter 15: The Gradient Descent Concept

Cast of characters

Name	Lifespan	Role
Augustin-Louis Cauchy	1789–1857	French mathematician. Presented the three-page note “Méthode générale pour la résolution des systèmes d’équations simultanées” to the Académie des Sciences on 18 October 1847. Promised a follow-up Mémoire with convergence proofs that never appeared.
Herbert Robbins	1915–2001	American mathematical statistician at the University of North Carolina. Co-author of the September 1951 Annals of Mathematical Statistics paper “A Stochastic Approximation Method.”
Sutton Monro	1919–1995	Robbins’s co-author. The 1951 paper proposed a recursive update procedure that drives a parameter toward the root of an unknown monotone function from noisy observations.
Bernard Widrow	1929–	Stanford EE professor; built the lunch-pail Adaline. Co-author of the August 1960 IRE WESCON paper “Adaptive Switching Circuits.”
Marcian E. “Ted” Hoff	1937–	Widrow’s doctoral student in 1960. Co-author on Adaline; later (out of scope here) co-designer of the Intel 4004.
Seppo Linnainmaa	1945–	Finnish computer scientist. His 1970 University of Helsinki master’s thesis on cumulative rounding errors gave the first explicit, efficient reverse-mode chain-rule machinery for arbitrary computational graphs — without ever mentioning neural networks.

Timeline (1847–1970)

timeline
    title The four nodes of gradient descent, 1847–1970
    18 October 1847 : Cauchy presents "Méthode générale pour la résolution des systèmes d'équations simultanées" to the Académie des Sciences in Paris : Three-page note in Comptes Rendus; promised follow-up Mémoire never appeared
    September 1951 : Robbins and Monro publish "A Stochastic Approximation Method" in Annals of Mathematical Statistics 22(3), pp. 400–407 : Root-finding under noise; vanishing-step-size convergence in probability
    August 1960 : Widrow and Hoff present "Adaptive Switching Circuits" at the IRE WESCON convention : Adaline; explicit "method of steepest descent" on a parabolic mean-square-error surface; ONR-funded
    1970 : Linnainmaa submits master's thesis at the University of Helsinki on the cumulative rounding error of an algorithm as a Taylor expansion : First explicit reverse-mode automatic differentiation on arbitrary graphs, with FORTRAN code; never mentions neural networks

Plain-words glossary

Gradient descent — An iterative procedure for minimising a function: at each step, move the current point in the direction opposite to the gradient (the vector of partial derivatives), which is the locally steepest descent direction.
Steepest descent (line-search variant) — The variant of gradient descent that, at each step, searches along the descent direction for the locally best step length. Cauchy 1847 distinguishes this from a fixed-step variant on the same page.
Stochastic approximation — Robbins and Monro’s 1951 framework: drive a parameter toward the root of an unknown monotone function using only noisy observations of the function at chosen levels, with a vanishing step size that dampens noise without freezing the procedure.
LMS rule (Least Mean Square) — Widrow and Hoff’s 1960 learning rule: adjust the Adaline’s gains in proportion to the gradient of the mean-square error before the final threshold. The first published learning rule for an analog hardware classifier framed explicitly as gradient descent.
Adaline (adaptive linear neuron) — Widrow and Hoff’s lunch-pail-sized analog device. Binary ±1 inputs, linear combination through adjustable gains stored in MAD magnetic-core memory, hard threshold output. Sibling architecture to Rosenblatt’s perceptron (Ch14), distinguished by the continuous-error LMS rule.
Reverse-mode automatic differentiation — Linnainmaa’s 1970 contribution: a procedure that computes the partial derivatives of a final output with respect to every intermediate variable in a computational graph at essentially the same cost as the forward computation. The machinery later instantiated by neural-network backpropagation (Ch24).
Computational graph — The dependency structure of an algorithm, drawn as nodes (intermediate variables) and edges (the elementary operations producing them). Linnainmaa’s reverse-mode procedure walks this graph backward to accumulate derivatives.

The math, on demand

The four nodes can be stated compactly. None of these formulae appears in only one of the chapter’s sources; together they show what each scene contributed.

Cauchy’s 1847 step. For $u = f(x, y, z, \dots)$ with partial derivatives $X, Y, Z, \dots$ at the current point and a small positive $\theta$ , Cauchy proposed increments $\alpha = -\theta X,\ \beta = -\theta Y,\ \gamma = -\theta Z,\ \dots$ — the first published descent rule.
Cauchy’s least-squares extension. A system of equations $u = 0,\ v = 0,\ w = 0,\ \dots$ becomes the single non-negative target $u^2 + v^2 + w^2 + \dots$ , attacked by the same iteration.
Robbins–Monro step-size conditions. Later expositions write the convergence-in-probability requirement as $\sum c_n = \infty$ and $\sum c_n^2 < \infty$ — large enough that the procedure does not freeze, small enough that noise washes out.
Widrow–Hoff LMS rule (1960). With $a_1, \dots, a_n$ the adjustable gains, the rule changes each gain in proportion to the gradient of the mean square error with respect to that gain — the parabolic surface in gain-space the chapter pivots on.

The Note Cauchy Never Followed Up

The idea now known as gradient descent did not arrive in the study of learning machines fully formed. It was not the sudden invention of a single mind working on artificial intelligence, nor was it the product of a unified research program aimed at optimizing complex models. Instead, the concept accumulated across a hundred and twenty-three years from four loosely connected, historically distinct pieces of work. It began as a three-page sketch by a French mathematician proposing an iterative technique for calculating the orbits of heavenly bodies. A century later, it was adapted by mathematical statisticians to drive a parameter toward an unknown root under conditions of noise. Before the 1960s had begun, the technique was wired into an analog hardware classifier that explicitly named the method as it searched a parabolic error surface. Finally, at the dawn of the 1970s, the chain-rule machinery required to compute the gradient through arbitrary computational graphs was codified in a master’s thesis about rounding errors. None of these researchers believed they were inventing neural network training. The engine of modern machine learning is an accumulation of historical accidents, bridging deterministic astronomy, applied statistics, electrical engineering, and numerical analysis.

On October 18, 1847, Augustin-Louis Cauchy presented a brief note to the Académie des Sciences in Paris. Published in the Comptes Rendus de l’Académie des Sciences under the title “Méthode générale pour la résolution des systèmes d’équations simultanées,” the three-page communication was motivated not by abstract optimization, but by the practical demands of contemporary astronomy. According to the historical documentation of optimization researcher Claude Lemaréchal, Cauchy noted that astronomical calculations of the period were normally very voluminous. Cauchy sought a more direct way to compute the orbit of a heavenly body. Rather than reducing the complex differential equations that governed celestial motion, he proposed treating the six elements of the orbit themselves as the unknown variables to be solved for simultaneously.

That setting matters because the first appearance of the method was not a machine procedure. There was no electronic computer, no numerical library, and no automatic differentiator waiting behind the notation. The pressure was human time spent on arithmetic. Cauchy’s note promised a way to replace a bulky calculation with a repeatable paper procedure: choose current values for the orbital elements, compute how the expression changes with respect to each one, and move the current guess in the direction that makes the expression smaller. In a century that still made astronomy through tables, notebooks, and patient calculation, the method’s appeal was not speed in the modern sense. It was order.

To approach this system, Cauchy outlined a mathematical procedure that would systematically adjust an initial guess. For a non-negative continuous function defined as $u = f(x, y, z, \dots)$ with partial derivatives $X, Y, Z, \dots$ evaluated at the current point, Cauchy proposed calculating the next step by defining increments $\alpha = -\theta X, \beta = -\theta Y, \gamma = -\theta Z, \dots$ for some small positive value of $\theta$ . Using a first-order argument, Cauchy demonstrated that stepping in the direction opposite to the partial derivatives would cause the value of the function to decrease.

The notation is compact, but the idea is already recognizably iterative. Cauchy did not ask for the exact solution in one algebraic blow. He asked for a rule that could take an approximate solution and improve it. The derivatives $X, Y, Z$ encode the local tilt of the function at the present point; the negative signs turn that tilt into a descent direction; the scalar $\theta$ controls how far the calculation advances before the derivatives must be recomputed. What later textbooks often present as an algorithmic primitive appears here as a small local argument about decrease. The method works one step at a time, not by solving the original system directly.

It is a common modern simplification to state that Cauchy straightforwardly invented the method of steepest descent in this paper. The text of the 1847 note, however, is more ambivalent. Cauchy actually proposed two distinct variants of the iteration on the same page. The first was an approach where the step size $\theta$ is determined by evaluating the function along the search direction, seeking a decrease in the expression $\Theta = f(x - \theta X, y - \theta Y, z - \theta Z, \dots)$ , a technique that presages what numerical analysts would later call an Armijo-type line search. The second was the steepest-descent variant proper, defined by the univariate stationarity condition that the derivative of $\Theta$ with respect to $\theta$ must equal zero ( $\Theta'_{\theta} = 0$ ).

The distinction is small enough to disappear in a survey sentence and large enough to matter historically. In one variant, the method asks only for a step that improves the value. In the other, it searches along the descent ray for the locally best step length. That is why Cauchy’s note is better read as the published source of a family of gradient methods than as the clean birth certificate of one later textbook algorithm. He had identified the directional logic, but he had not yet separated all of the convergence questions that would occupy optimization theory afterward.

Cauchy also proposed a method for extending this approach to a system of simultaneous equations $u = 0, v = 0, w = 0, \dots$ by minimizing the sum of their squares, $u^2 + v^2 + w^2 + \dots = 0$ . This least-squares extension touched upon older mathematical territory, echoing the earlier dispute between Adrien-Marie Legendre and Carl Friedrich Gauss over the method of least squares, but Cauchy applied his new iterative descent idea to it directly. A system of equations became a single non-negative expression; making that expression smaller became a way to drive all the residuals toward zero together. The move looks natural now because optimization has absorbed so much of applied mathematics. In 1847, it was an economical way to turn a messy simultaneous problem into a repeated local calculation.

Crucially, Cauchy did not provide a rigorous proof that his iterative method would universally converge to a solution. In the text of the note, he explicitly admitted that he was only outlining the principles of the method. He wrote that he intended to return to the subject with more detail in a follow-up Mémoire. According to Lemaréchal’s historical tracing, that promised follow-up paper never appeared. Cauchy seems to have underestimated the mathematical difficulty of proving general convergence for the method, and then moved on to other problems.

Within the context of Cauchy’s vast mathematical output, the 1847 note was a marginal contribution. Cauchy was producing a paper nearly every week, with his central and enduring work focused on analysis, complex functions, mechanics, and other parts of nineteenth-century mathematics. The gradient note was not a masterpiece of his career; it was a pragmatic sketch that he abandoned. Yet, it stands as the published-on-paper origin point of the mathematical technique that would eventually reshape computational learning.

Robbins, Monro, and the Stochastic Bridge

The deterministic descent method Cauchy sketched assumed that the function and its exact partial derivatives could be evaluated perfectly at any point. When learning from empirical data, however, exact gradients are rarely available; the system must instead rely on noisy, imperfect observations. The conceptual bridge between Cauchy’s pure mathematics and the reality of learning from data was built in September 1951, when Herbert Robbins and Sutton Monro published “A Stochastic Approximation Method” in The Annals of Mathematical Statistics.

Robbins and Monro were operating in the environment of mid-century applied mathematical statistics, where experiments often involved physical sampling, clinical trials, quality-control measurements, or response-curve estimation. The problem they framed was not one of multidimensional surface optimization, but of root-finding under noise. They posited a monotone, unknown regression function $M(x)$ . The experimenter’s goal was to find the specific level $x^*$ where the function equals a given constant $\alpha$ . The severe constraint was that the experimenter could not evaluate $M(x)$ directly; they could only take successive noisy observations of the function at chosen levels $x_1, x_2, \dots$ , one at a time.

This is a different picture of computation from Cauchy’s. The central object is no longer a formula whose derivative can be written down at a desk. It is an experiment, and each observation arrives with error. The procedure has to decide not only which direction to move, but how much trust to place in an observation that may be misleading. The recursive form of the answer is the historical bridge: update the current guess, observe again, reduce the step size over time, and let the accumulated evidence overpower the noise.

To solve this, Robbins and Monro proposed a recursive update procedure that would drive the sequence of guesses $x_n$ toward the true root $x^*$ in probability. The critical mechanism that allowed their procedure to converge despite the constant injection of experimental noise was a vanishing step-size sequence. Under the restrictions of their stochastic-approximation setup, the step sizes had to remain large enough for the procedure not to freeze too early, but small enough for the effects of noise to wash out. In later expositions, this balance is usually written as the dual requirement that the sum of the step sizes diverges ( $\sum c_n = \infty$ ) while the sum of their squares converges ( $\sum c_n^2 < \infty$ ).

The 1951 paper does not use the word “gradient,” nor does it mention neural networks, which did not yet exist as a formalized computational paradigm. It is strictly a root-finding paper based on noisy observations. However, by demonstrating that an iterative update rule could provably converge on an unknown target even when every individual step was corrupted by noise, Robbins and Monro provided the exact mathematical framework that the gradient-descent concept later inherited. Modern hindsight routinely applies the label “stochastic gradient descent” to their discovery, but historically, they built the stochastic bridge without knowing what algorithmic traffic would eventually cross it.

Widrow, Hoff, and the Lunch-Pail Adaline

The translation of these abstract mathematical concepts into explicitly named learning algorithms on physical hardware occurred in August 1960. At the IRE WESCON convention, Stanford University electrical engineering professor Bernard Widrow and his doctoral student Marcian E. Hoff presented a paper titled “Adaptive Switching Circuits.” Supported by a contract with the Office of Naval Research at the Stanford Department of Electrical Engineering, their paper introduced a built physical device that merged learning theory with contemporary analog electronics.

Widrow and Hoff’s paper did not appear as a detached mathematical note. It came out of an electrical-engineering program already thinking about adaptive sampled-data systems, filters, and switching circuits. The language of the paper is consequently closer to feedback control than to later machine-learning textbooks. The independent variables are gain settings. The observations arrive as voltages. The desired classification is set by a reference switch. The error is something a person can read from a meter and a circuit can use to change stored analog levels. This is why the paper matters so much for the chapter’s story: it did not merely reinterpret a learning rule after the fact. It built a learning rule into a device and described the adjustment process in the vocabulary of gradients.

They called their adaptive pattern classification machine the “Adaline,” short for adaptive linear neuron. Widrow and Hoff’s paper described the built Adaline device as being about lunch-pail sized. Its architecture was straightforward and openly acknowledged its lineage, citing a theoretical neuron model previously introduced by John von Neumann. The Adaline accepted binary inputs valued at $+1$ or $-1$ . These inputs were fed into a linear combination through a set of adjustable gains, labeled $a_1$ through $a_n$ , alongside a level-control gain $a_0$ whose input was permanently fixed at $+1$ . The neuron output a final decision by passing the weighted sum of these inputs through a hard thresholding step: if the sum was positive, the machine output $+1$ ; otherwise, it output $-1$ .

That last threshold is the reason the Adaline can look, from a distance, like a perceptron. Both systems take signed inputs, combine them with adjustable weights, and produce a binary decision. The difference that matters here is what Widrow and Hoff chose to optimize. The hard output may be only $+1$ or $-1$ , but the internal linear response and the error voltage are continuous quantities. Their learning rule operates on that continuous error before the threshold has reduced the machine’s behavior to a binary answer. The result is a squared-error surface in the adjustable gains, not merely a rule for correcting a mistaken classification.

The training procedure for the lunch-pail Adaline was an operator-in-the-loop analog exercise. An operator would set a crude geometric pattern into the machine using a four-by-four array of toggle switches. They would then flip a reference switch to tell the machine whether the desired output for that specific pattern was $+1$ or $-1$ . A physical meter on the device displayed the neuron’s output, and when the reference switch was engaged, the meter displayed the error voltage. The operator would then wait while gain adjustments were applied to bring the error toward zero. Widrow and Hoff emphasized that this routine was mechanical rather than thoughtful; the next step they wanted was full electronic automation.

The scene is easy to romanticize, but the important detail is more prosaic. Adaline learning was not a symbolic program making inferences. It was a sequence of physical adjustments to gain settings. A pattern appeared on switches, a desired answer was set, an error voltage appeared, and the machine altered stored analog levels by a controlled amount. That loop gave the abstract gradient idea a material form. The descent was not only a line in an equation; it was a meter reading and a change in magnetic storage.

What makes the 1960 WESCON paper a decisive conceptual node in the history of machine learning is the vocabulary Widrow and Hoff used to describe what their hardware was doing. Unlike other cybernetic pioneers of the era, they explicitly mapped their learning rule to numerical optimization. In describing their Least Mean Square (LMS) rule, they identified the feedback error signal as the “gradient of mean square error” with respect to the adjustments.

They did not invent the gradient method, but they were the first to explicitly frame a learning rule for an analog hardware classifier as a gradient method on a continuous loss surface. The paper described common gradient methods as procedures that search a surface for stationary points by changing independent variables in proportion to measured partial derivatives. It also connected those procedures to geometric decay on second-degree surfaces. Widrow and Hoff recognized that the mean-square error surface for their adaptive neuron was precisely such a parabolic, quadratic function of the gain adjustments. Within that analysis, the descent rule had a calculable terminus rather than a merely heuristic appeal. Their chosen search method was the “method of steepest descent,” with vector changes made in the “direction of the gradient.”

The WESCON paper quantified the dynamics of this descent. The physical Adaline update rule reduced the error by a fixed fraction on each step. In their four-by-four input case, the per-pattern step shrank the error magnitude by a factor of exactly 1/17. Widrow and Hoff mathematically derived the time constant for this learning process; with a specific proportionality constant, the time constant $\tau$ equaled $n+1$ patterns. For the Adaline’s sixteen input lines plus its level control, the hardware exhibited a measurable time constant of roughly 17 patterns, giving the physical machine predictable, exponential convergence dynamics.

That number is historically useful because it keeps the Adaline from dissolving into abstraction. The learning rule was not only a theorem about least mean squares. It had a pace. A single pattern would not instantaneously set the correct gains, but repeated presentations would drive the corresponding error downward at a predictable rate. For one pattern at a time, Widrow and Hoff could speak in the engineering language of time constants, the same language used to describe filters and feedback systems. The method of steepest descent had become part of an adaptive circuit’s temporal behavior.

The physical storage of these adjustable gains relied on the analog memory technology available to the Stanford laboratory at the time. The Adaline gains were stored in multi-aperture-device (MAD) magnetic-core elements, cited in the paper through H. D. Crane’s work on magnetic logic. These cores mattered because the stored levels were not just binary bits. They could hold multiple levels, be read continuously without destruction, and be nudged upward or downward by small controlled amounts. That combination made them suitable for a machine whose “memory” had to behave like a set of analog coefficients. Widrow and Hoff’s paper envisioned a future of fully automatic microelectronic neurons built with thin ferromagnetic films.

This detail also prevents a common backward projection. The 1960 WESCON paper does not make the famous later memory device the storage element of the original Adaline. In this window, the documented storage technology is MAD magnetic core, with thin-film successors imagined as the next step. The Adaline of this chapter is therefore a magnetic-core analog learning machine, not the later mythology compressed into a single remembered word.

The Stanford Adaline program of 1960 represents a coherent, parallel sibling to Frank Rosenblatt’s Cornell Aeronautical Laboratory perceptron program. Both were electromechanical learning devices funded by the Office of Naval Research during the late 1950s and early 1960s, though through separate institutional contracts. Both treated learning as something that could be embodied in hardware rather than only simulated on a general-purpose computer. But where Rosenblatt approached learning through the lens of neurodynamics and statistical mechanics, Widrow and Hoff framed their machine through electrical engineering and the method of steepest descent.

That difference should not be exaggerated into a rivalry or a clean victory. The perceptron and Adaline were neighboring answers to the same period problem: how to build a machine that changes its own parameters from examples. Rosenblatt’s line produced a convergence theorem under separability assumptions and a public model of trainable pattern recognition. Widrow and Hoff’s line produced an explicit continuous error surface and a hardware rule for descending it. The modern unification arrives only later. In 1960, they were still two local engineering cultures solving overlapping problems in different vocabularies.

Yet, the Adaline was fundamentally a single-layer device. It computed a direct linear combination of its inputs. To apply the method of steepest descent to deeper, multi-stage networks where the inputs are themselves the outputs of previous layers, the algorithm requires a way to efficiently calculate the chain rule of calculus through an arbitrary graph of operations.

Linnainmaa and the Chain Rule on Arbitrary Graphs

The machinery to accomplish this was formalized in 1970 by Seppo Linnainmaa at the University of Helsinki. Linnainmaa’s master’s thesis, written in Finnish under the title “The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors,” had nothing to do with artificial intelligence or learning hardware. It was an investigation into how floating-point rounding errors propagate through complex numerical algorithms.

That origin is important. Linnainmaa was not trying to solve the credit-assignment problem in a multilayer neural network. He was asking how errors introduced locally by finite-precision arithmetic accumulate through an algorithm. To answer that question, the algorithm itself had to be treated as a structured object: a sequence of elementary operations, with intermediate quantities depending on earlier intermediate quantities. Once an algorithm is represented that way, the chain rule becomes a procedure that can move through the dependency structure. The later neural-network interpretation depends on this representational step, but it was not the setting in which the step first appeared.

According to the historical documentation of computer scientist Jürgen Schmidhuber, Linnainmaa’s 1970 thesis provided the first explicit, efficient reverse-mode chain-rule machinery for arbitrary, discrete, possibly sparsely connected computational graphs. Linnainmaa’s work included early FORTRAN code, which matters more than the particular machine on which it ran. The chain rule had been made executable as a program procedure. It could track the derivatives of a final output with respect to every intermediate variable in a complex sequence of operations.

The conceptually decisive property of Linnainmaa’s algorithmic machinery, restated in English in a 1976 paper in the journal BIT Numerical Mathematics, is what is now called the reverse mode of automatic differentiation. In reverse-mode differentiation, the computational costs of the forward pass, which calculates the results of the algorithm, essentially equal the costs of the backward pass, which calculates all the partial derivatives. This is the fact that later made gradients through large layered systems practical. If every parameter required a separate forward calculation, the cost would explode. Reverse mode instead lets one backward sweep carry derivative information through the graph at roughly the same order of work as the original computation.

Schmidhuber’s history places Linnainmaa’s abstraction downstream of a cluster of optimal-control research from the early 1960s. Researchers working within the framework of Euler-Lagrange equations and the Calculus of Variations, including Henry J. Kelley in 1960, Arthur E. Bryson in 1961, and Stuart E. Dreyfus in 1962, had developed methods for optimizing multi-stage dynamic systems. Dreyfus’s 1962 paper, in particular, is presented in that history as a simplified derivation using the chain rule only. The point is not that these optimal-control researchers were secretly writing neural-network algorithms. It is that multistage systems had already forced mathematicians and engineers to think about how derivative information travels backward through a sequence of dependent operations.

Linnainmaa generalized that reverse movement of derivative information into an abstract procedure applicable to arbitrary computational graphs. The shift is subtle but decisive. A flight path, a control policy, and a numerical algorithm can all be drawn as chains of dependent operations. Once the graph itself is the object, the same derivative machinery can operate without caring whether the graph represents a trajectory, a rounding-error calculation, or, later, a layered neural network. Crucially, Linnainmaa did this entirely without reference to neural networks.

The Modern Reinterpretation, and Where Ch15 Stops

The gradient descent concept, as it is understood today, is a synthesis of these isolated discoveries. In modern reading, the perceptron learning rule and the Adaline LMS rule can both be understood as instances of stochastic gradient descent operating on different loss surfaces. The perceptron rule can be read as descent on a piecewise-linear loss function, while the Adaline rule descends a squared-error loss function.

That modern sentence is powerful precisely because it would have sounded unnatural to the actors in the earlier scenes. Cauchy was not writing about data. Robbins and Monro were not writing about neural hardware. Rosenblatt was not describing his perceptron as an optimizer. Widrow and Hoff were not proposing a general theory of multilayer representation learning. Linnainmaa was not solving artificial intelligence. The unity belongs to the later mathematical reconstruction, not to a single historical project unfolding with one destination in mind.

Historically, however, only Widrow and Hoff explicitly recognized and documented their algorithm as a gradient method on a continuous loss surface in 1960. Rosenblatt did not describe the perceptron in the vocabulary of gradient descent or piecewise-linear loss; the unification of these two parallel cybernetic hardware programs into a single mathematical framework is a retrospective, modern reinterpretation.

This distinction changes how the perceptron and Adaline should be compared. The perceptron rule responds to classification error: if the machine misclassifies a pattern, the weights move in a direction that helps that pattern. Under separability conditions, the convergence theorem from Rosenblatt’s line says the procedure will find a separating solution. The LMS rule instead minimizes a continuously measured squared error before the final threshold. Its surface can be analyzed as a parabola in the gains, and Widrow and Hoff could therefore describe the adjustment as steepest descent. Both rules update from examples. Both can be folded into the later family of stochastic gradient methods. But only one of the two was printed in 1960 as a gradient method on a mean-square-error surface.

The history of gradient descent is therefore an accumulation of partial tools. Cauchy gave the descent iteration as a mathematical sketch on paper, tied to the labor of astronomical calculation and left without the promised convergence treatment. Robbins and Monro provided the statistical framework for driving a sequence toward an unknown target when each observation is noisy and the step size must gradually fade. Widrow and Hoff built a physical hardware classifier whose learning rule was explicitly framed as the method of steepest descent on a squared-error surface. Linnainmaa supplied the general reverse-mode automatic differentiation machinery required to pass gradients through arbitrary computational graphs with linear cost.

Each step removed a different obstacle. Cauchy answered the local-direction question: which way should an approximate solution move if the function is differentiable? Robbins and Monro answered the noise question: how can a recursive procedure keep learning when each observation is unreliable? Widrow and Hoff answered the hardware-learning question: how can an adaptive classifier turn error into physical coefficient changes? Linnainmaa answered the graph question: how can all the necessary derivatives be computed efficiently when the computation has many dependent stages? Modern machine learning needs all four answers, but history delivered them in separate disciplines.

None of these four milestones were intended as a comprehensive algorithm for training artificial neural networks. The realization that Linnainmaa’s reverse-mode chain rule could be applied specifically to neural networks to train multiple layers of representations was first published by Paul Werbos in 1981, several years before the technique was widely popularized. That later chain, running from reverse-mode automatic differentiation into neural-network backpropagation, belongs to Chapter 24. Here, the story stops at the hinge: by 1970, the pieces existed, but they had not yet been assembled into the familiar training algorithm for multilayer networks.