Chapter 29: Support Vector Machines

Цей контент ще не доступний вашою мовою.

Cast of characters

Name	Lifespan	Role
Vladimir Vapnik	1936–	Central figure in VC theory and support-vector networks; coauthor of Boser/Guyon/Vapnik 1992 and Cortes/Vapnik 1995.
Alexey Chervonenkis	1938–2014	Vapnik’s collaborator on uniform-convergence theory and the VC-dimension framework.
Bernhard Boser	—	Coauthor of the 1992 optimal-margin training algorithm that brought kernels into the maximum-margin setting.
Isabelle Guyon	—	Coauthor of the 1992 algorithm; part of the Bell Labs OCR and SVM ecosystem.
Corinna Cortes	—	Coauthor of the 1995 support-vector networks paper; central to soft-margin SVMs and OCR benchmark experiments.
Christopher Burges	—	Author of the 1998 tutorial that transmitted SVMs to the broader pattern-recognition community.

Timeline (1964–1998)

timeline
    title Support Vector Machines, 1964–1998
    1964 : Aizerman, Braverman, and Rozonoer publish potential-function work — later cited as kernel lineage in Cortes and Vapnik
    1971 : Vapnik and Chervonenkis publish uniform-convergence theory : foundation for capacity and generalization vocabulary
    1982 : Vapnik's Estimation of Dependences consolidates optimal-hyperplane foundations
    1992 : Boser, Guyon, and Vapnik publish optimal-margin classifier training at COLT : kernels and maximum margins combined
    1995 : Cortes and Vapnik publish support-vector networks : soft margins for non-separable data; USPS and NIST OCR benchmarks
    1998 : Burges publishes widely used SVM tutorial in Data Mining and Knowledge Discovery

Plain-words glossary

Support vector — A training example that lies on or nearest to the decision boundary; the final classifier depends only on these examples, not on the full training set.
Hyperplane — A flat decision surface used to separate classes in a feature space.
Margin — The width of the empty strip between the two classes on either side of a separating hyperplane.
Kernel function — A function that computes an inner product after an implicit feature transformation.
Convex optimization — An optimization problem in which any local solution is also a global solution.

The math, on demand

Maximum-margin objective: Given labeled training points $(x_i, y_i)$ with $y_i \in \{-1, +1\}$ , find a weight vector $w$ and bias $b$ such that $y_i(w \cdot x_i + b) \geq 1$ for all $i$ , while minimizing $\|w\|^2$ . Minimizing $\|w\|^2$ is equivalent to maximizing the margin $2/\|w\|$ between the two class boundaries.
Support vectors: The training points for which $y_i(w \cdot x_i + b) = 1$ exactly — they sit on the margin boundary and are the only examples that determine $w$ and $b$ . All other training points are irrelevant to the final classifier.
Dual formulation: Introducing Lagrange multipliers $\alpha_i \geq 0$ for each constraint transforms the problem into maximizing $\sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)$ . The solution satisfies $\alpha_i = 0$ for non-support vectors (Karush-Kuhn-Tucker complementary slackness). Only the inner products $x_i \cdot x_j$ appear.
Kernel trick: Because the dual depends only on inner products, replace $x_i \cdot x_j$ with a kernel $K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j)$ for some feature map $\phi$ . The machine behaves as if it operates in the high-dimensional space defined by $\phi$ without ever constructing $\phi(x)$ explicitly. Mercer’s theorem gives the condition under which $K$ is a valid inner product in some feature space.
Soft-margin slack: Introduce slack variables $\xi_i \geq 0$ so that the constraint becomes $y_i(w \cdot x_i + b) \geq 1 - \xi_i$ . The objective becomes $\min \frac{1}{2}\|w\|^2 + C \sum_i \xi_i$ , where $C$ trades off margin width against training violations. The problem remains convex quadratic, preserving the global-solution guarantee.
Structural risk minimisation: VC theory bounds test error by training error plus a capacity term depending on the VC dimension $h$ and sample size $n$ : roughly $\sqrt{(h(\log(2n/h)+1) - \log(\delta/4))/n}$ with probability $1 - \delta$ . SVMs control $h$ implicitly by maximising the margin, linking geometry to generalization theory.

Chapter 29: Support Vector Machines

After the second AI winter, the field did not need another promise that a machine could be intelligent someday. It needed methods that could be trusted on data the machine had not already seen.

That shift changed the emotional center of machine learning. The old expert systems had been impressive when their rule bases captured a narrow domain, but their maintenance costs made hand-coded knowledge look fragile. Neural networks had returned with backpropagation and convolutional architectures, but they still carried questions about local minima, architecture choices, training dynamics, and generalization. The next credible method would have to sound different. It would have to say not only “this system learned the training examples,” but “here is why the boundary should generalize.”

Support Vector Machines fit that moment. They did not look like another symbolic knowledge base. They did not ask the reader to trust a hidden layer because it had found useful internal features. They offered a stricter bargain: choose the separating surface with the largest margin, control the capacity of the classifier, use kernels to work with high-dimensional feature spaces without explicitly constructing them, and solve a convex optimization problem instead of wandering through a landscape of local training choices.

That bargain made SVMs one of the defining machine-learning technologies of the 1990s. Their importance was not that they defeated neural networks. LeNet remained real. Backpropagation remained real. Neural networks continued in research and industrial pockets. The point is subtler: SVMs made learning look mathematically governed at a time when funders, reviewers, and engineers were newly sensitive to overclaiming. They turned generalization into an engineering language.

[!note] Pedagogical Insight: Trust Moved to the Boundary SVMs did not promise intelligence by capturing expert rules. They promised a disciplined decision boundary: wide margin, controlled capacity, kernelized representation, and a convex training problem.

The Generalization Problem

The winter had made one lesson hard to ignore: a system that performs on selected examples is not automatically a system that survives contact with the world. Expert systems exposed this through maintenance. Their rules could work inside a bounded configuration problem, then become expensive when products, exceptions, and operating conditions changed. Learning systems exposed the same danger differently. A classifier could fit the data in front of it and still fail on the next batch.

That distinction between fitting and generalizing was not new in the 1990s. Vladimir Vapnik and Alexey Chervonenkis had framed a deep version of the problem in their 1971 work on uniform convergence. Their question was not whether the frequency of one event would converge to its probability. The harder question was whether one sample could support reliable statements over an entire class of events. They introduced machinery for talking about the complexity of such classes, including induced subsamples, the index of a system, and the growth function. The technical details belong in a probability text. The historical point is that statistical learning theory gave machine learning a language for asking when experience with a finite sample could support trust beyond that sample.

That language mattered after the winter because it changed what a learning paper could sound like. The authors of an SVM paper did not have to say, “Our classifier fits the training set; please admire the result.” They could connect the shape of the classifier to a theory of capacity and generalization. Cortes and Vapnik’s 1995 paper explicitly framed support-vector networks around high-dimensional mappings, linear decision surfaces in those spaces, and generalization. Burges’s later tutorial made the same promise teachable for a broader pattern-recognition audience: SVMs were attractive because they combined a geometric classifier with a theory that could discuss capacity, margin, and generalization bounds.

This does not mean SVMs solved overfitting once and for all. That would be too strong. They supplied a set of controls and arguments that made overfitting less mysterious. The machine was not merely drawing any boundary that separated the examples. It was looking for a boundary with a special geometric property. It was also making a bet about the class of functions the method allowed.

The credibility came from that combination. One part was philosophical: learning should be judged by its behavior on unseen examples. One part was mathematical: capacity and margin could be discussed in a disciplined way. One part was practical: the method could be trained and compared on real pattern recognition tasks. SVMs became persuasive because all three parts reinforced one another.

This also explains why SVMs could sound conservative and ambitious at the same time. The conservative move was to narrow the claim. An SVM did not claim to contain common sense, model a mind, or discover a symbolic ontology. It claimed to build a classifier from labeled examples under a specific capacity-control principle. The ambitious move was to make that narrow claim mathematically serious. Once the problem was framed as learning a function from empirical data, the old question “Does the machine understand?” could be replaced by questions that were easier to test: how well does the classifier perform on held-out examples, how complex is the class of decision surfaces, and how is the boundary chosen among many possible separators?

That change in vocabulary was itself a historical event. AI had often gained attention by making broad claims about reasoning, expertise, or intelligence. Statistical learning methods gained trust by making narrower claims and then measuring them. SVMs were exemplary because their narrowness was not a weakness. It was what made the method credible after a period when too many systems had promised more than their infrastructure could support.

That is why the story belongs immediately after the second winter. The winter had punished systems whose claims outran their operational proof. SVMs answered with a narrower but stronger promise. They did not say, “The machine knows.” They said, “The machine has chosen a boundary whose geometry gives us a reason to trust it.”

The Margin as a Contract

The central picture is simple enough to draw on a board. Imagine two kinds of points scattered on a plane. Many lines might separate them. A nervous method might pick any line that puts one class on one side and the other class on the other. A maximum-margin method asks for more. It wants the line that leaves the widest empty strip between the two classes.

That empty strip is the margin. The points that touch or determine the strip are the support vectors. They matter because the final classifier is not shaped equally by every training example. Once the boundary is chosen, many points sit comfortably away from it. The critical examples are the ones near the edge, the ones that determine how wide the margin can be. The classifier depends on those supporting patterns.

Boser, Guyon, and Vapnik’s 1992 paper made this idea operational for optimal margin classifiers. The goal was not just to separate examples, but to choose a classifier whose capacity was tuned by the data and whose representation depended on supporting patterns. Cortes and Vapnik later formalized support-vector networks in a way that made this geometric story central. Burges then turned the idea into a tutorial-level explanation for the pattern recognition community: the support vectors are the training points that define the margin, and the margin is the reason the method has its distinctive generalization story.

The phrase “contract” is useful because the margin imposes a discipline on the learner. The classifier cannot merely snake around the examples to satisfy the training set. In the simplest separable case, it must choose the separating hyperplane that maximizes the distance to the nearest examples. That is a constraint on the kind of solution the algorithm may accept.

The constraint also changes how the method feels compared with some neural network training of the same era. A neural network trained by gradient descent could be powerful, but the training story depended on iterative adjustment and architecture choices. The SVM story was more austere. Define the optimization problem. Find the maximum-margin separator. Let the support vectors determine the decision surface. That austerity was part of the appeal.

It would be wrong, however, to present the margin as magic. A large margin does not guarantee that every future example will be classified correctly. It is a principle for choosing among possible classifiers, not a prophecy. It works inside assumptions about the data, the feature representation, and the allowed function class. But as an engineering rule, it is powerful. When many boundaries fit the examples, prefer the one with more room.

That principle made SVMs teachable. A reader did not have to begin with a full derivation of the dual problem or the Karush-Kuhn-Tucker conditions to grasp the historical point. The method turned a vague desire for generalization into a visible geometry. The support vectors gave the classifier a sparse signature. The margin gave the classifier a reason.

This is one of the reasons SVMs became so attractive in the 1990s. They offered an answer to a practical anxiety: if a learner can fit the training data in many ways, which fit should we believe? The SVM answer was not the only answer, but it was unusually crisp. Believe the boundary that separates with room to spare, and understand its complexity through the examples that press against that room.

The Kernel Move

The margin story becomes more powerful when the data are not cleanly separated by a straight line in the original input space. Real data rarely arrives in a shape that makes the simplest geometry enough. A digit image, for example, is not just a dot on a two-dimensional plane. It is a pattern of pixel values, and the useful distinction between one digit and another may depend on combinations of those values.

One response is to transform the data into a richer feature space. In that space, a linear separator might become useful even if no simple separator worked in the original coordinates. The problem is that explicitly building a very high-dimensional feature space can be expensive or impossible to manage. The kernel move solves this in a beautifully machine-shaped way. Instead of constructing the high-dimensional vectors directly, the algorithm computes inner products through a kernel function.

Boser, Guyon, and Vapnik’s 1992 paper brought this idea into the optimal-margin classifier setting. Their algorithm could work with nonlinear decision surfaces through kernel evaluations while still preserving the maximum-margin training logic. Cortes and Vapnik later connected this to a longer lineage of potential functions and Mercer’s theorem, explicitly citing earlier work by Aizerman, Braverman, and Rozonoer. That lineage matters. Kernels were not invented from nothing in 1992; SVMs made the kernelized maximum-margin classifier central and practical.

This was a major infrastructure idea. A kernel lets the machine behave as if it has entered a high-dimensional feature space without paying the full cost of building that space explicitly. The classifier can draw a complicated boundary in the original input space while remaining, in the transformed space, a linear maximum-margin separator. That is why the method could be both mathematically disciplined and practically flexible.

The digit-recognition setting makes the appeal concrete. Handwritten digits vary in thickness, slant, position, and style. A purely linear boundary in raw pixel space is rarely enough to capture the structure. A polynomial or radial basis function kernel can allow richer surfaces while keeping the training problem inside the SVM framework. Boser and colleagues reported handwritten digit experiments with high-order polynomial and radial basis classifiers. Cortes and Vapnik later reported larger OCR experiments using polynomial support-vector networks.

The kernel move also helps explain why SVMs felt different from both symbolic AI and hand-engineered vision pipelines. The method did not require a human to write a rule for every visual distinction. It also did not necessarily require a deep learned hierarchy of features. It offered a middle path: choose a kernel that defines a space of possible distinctions, then let the optimization find the margin.

That middle path had limits. Kernel choice mattered. Scaling to very large data sets could become difficult. A kernel method could be elegant and still depend on a representation decision made by the researcher. But in the 1990s, before the deep-learning scale of data and compute arrived, this was a strong compromise. SVMs could be flexible without looking uncontrolled.

The most important historical point is the way the kernel trick aligned with the post-winter mood. It did not ask the reader to trust a black box in the same way. It said: here is the space we are implicitly working in; here is the optimization criterion; here are the support vectors; here is why the boundary has the shape it has. That made complexity feel governed rather than merely large.

Soft Margins and Real Data

The cleanest maximum-margin story assumes the data can be separated without errors. Real data is rarely that polite. Labels can be noisy. Classes can overlap. Measurement can be imperfect. A handwritten digit can be ambiguous even to a human. A classifier that insists on perfect separation may end up choosing a brittle boundary just to satisfy a few difficult examples.

Cortes and Vapnik’s 1995 support-vector network paper made the real-data issue central by extending the method to non-separable training data. The soft-margin idea allowed some training errors or margin violations while still penalizing them. The method no longer had to pretend the world was perfectly separable. It could trade off margin size against violations.

That extension is historically important because it made the method less like a mathematical curiosity and more like an engineering tool. A hard-margin classifier is elegant when the assumptions hold. A soft-margin classifier is useful when the data pushes back. The support-vector network could still preserve the discipline of margin maximization, but it could now operate in messier domains.

The optimization story mattered too. SVM training was formulated as a convex quadratic programming problem. In a convex problem, local solutions are global solutions. That does not mean training was free or trivial. Large problems could still be computationally demanding, and Burges’s tutorial discusses implementation and scaling issues. But the qualitative contrast was powerful. The method came with an optimization objective whose shape could be trusted in a way that many nonconvex training stories could not.

This is where SVMs acquired much of their 1990s prestige. They connected several forms of discipline at once. The geometry disciplined the boundary. The kernel disciplined the representation space. The soft-margin formulation disciplined the response to noisy data. Convex optimization disciplined the training process. Statistical learning theory disciplined the claim that success on training examples might generalize.

None of those disciplines made SVMs universal. A method can be mathematically clean and still be the wrong tool for a particular scale, data type, or deployment constraint. Later deep learning would gain from representation learning, larger data sets, GPUs, and software ecosystems that made massive neural models practical. SVMs did not make that future unnecessary.

But the SVM package was exceptionally coherent. It told engineers what problem they were solving and why the solution had a principled form. In a period when AI had recently been punished for brittle promises, that coherence mattered. It made machine learning look less like a bag of tricks and more like a discipline.

The support-vector metaphor also helped. The classifier’s decision surface was not a mysterious average over every observation. It was anchored by the cases near the boundary. Those cases carried the pressure of the problem. They were the examples that could not be ignored. This gave the method a narrative clarity that reviewers and practitioners could understand even when the full optimization derivation was left aside.

That clarity explains why SVMs could travel. They were not merely a paper result. They became a teachable method, a benchmark competitor, and a standard tool in pattern recognition and machine learning. Burges’s 1998 tutorial is evidence of that transition. A method that needs a careful tutorial has already become important enough for a wider community to learn it.

OCR as Proof Terrain

Handwritten digit recognition gave SVMs a concrete stage. It was bounded enough to benchmark, difficult enough to matter, and familiar enough that different methods could be compared. The task also connected directly to the larger story of 1990s machine learning: vision systems needed to move beyond fragile hand engineering, but deep learning had not yet reached its later scale.

Cortes and Vapnik’s 1995 experiments used standard digit-recognition data. For USPS, they report a split of 7,300 training examples and 2,000 test examples, with 16-by-16 images. For NIST, they report 60,000 training examples and 10,000 test examples, with 28-by-28 images. Those numbers matter because they show the benchmark culture SVMs entered. This was not a story about a classifier looking good on a hand-picked toy example. It was a story about measurable performance on shared recognition tasks.

The results made SVMs hard to ignore. Cortes and Vapnik report strong polynomial support-vector network performance and a 1.1 percent test error on the NIST task for a fourth-degree polynomial multiclass setup built from one-vs-rest binary classifiers. They also compare their results with LeNet and other classifiers. The LeNet comparison should be read carefully: in this case, those comparison figures are Cortes and Vapnik’s report, not an independent reanalysis. SVMs entered the same OCR terrain as neural networks and performed credibly enough to become part of the central 1990s machine-learning conversation.

This is where the false “SVMs defeated neural networks” story becomes tempting. The benchmarks were real. The SVM results were impressive. The method’s theory was attractive. It is easy, with hindsight, to turn that into a simple replacement story: neural networks returned in the 1980s, SVMs outclassed them in the 1990s, and deep learning returned later to take revenge. That story is too neat.

LeNet remained an important architecture. Backpropagation remained an important training method. Neural networks continued in document processing and other niches. SVMs became highly credible because they solved a different trust problem: they offered strong benchmark performance while making the training and generalization story feel more controlled.

The OCR setting also shows why SVMs belonged to the infrastructure history of AI. The support-vector papers did not merely propose a classifier. They joined a benchmarking practice: common data sets, reported test errors, comparison tables, and reproducible problem definitions. That practice had become more important after the winter. A system could no longer rely on a beautiful internal story. It had to show performance in a shared arena.

The data sets also made the method’s abstractions less abstract. A margin is a geometric object, but in OCR it had to answer a mundane question: when the next messy handwritten image arrives, will the classifier call it the right digit? A kernel is an implicit feature-space operation, but in OCR it had to justify itself on pixel arrays. A support vector is a mathematical object, but in OCR it became one of the hard cases that shaped the decision surface. The benchmark turned theory into a public comparison.

That public comparison was valuable precisely because the task was limited. Digit recognition was not general intelligence. It was not language understanding, planning, or robotics. But it was hard enough to expose weak classifiers and standardized enough to compare strong ones. That combination made it a useful proving ground for the kind of machine learning that would dominate the post-winter period: bounded tasks, shared data, explicit metrics, and methods that could explain why performance on the test set was not merely luck.

SVMs were well suited to that arena. They could handle high-dimensional image vectors through kernels. They could use soft margins for imperfect data. They could produce sparse support-vector representations. They could be explained as convex optimization rather than as a collection of heuristic adjustments. For OCR, that package was compelling.

The support-vector count itself was part of the message. Cortes and Vapnik emphasized that complexity depended on the number of support vectors rather than directly on the dimensionality of the feature space. This is a subtle but important point. The classifier could behave as if it lived in a rich feature space while its final representation was tied to a subset of training examples. That made high-dimensional classification feel less reckless.

By the late 1990s, SVMs had therefore become more than a clever algorithm. They were a standard answer to a question the field was newly asking: how do we build learning systems whose success can be measured, compared, and explained without pretending that all knowledge has been written down by hand?

What SVMs Changed

SVMs did not end the 1990s with a permanent victory. No method did. Conversely, neural networks did not simply lose the decade. Both lineages continued, and both would shape later AI. The difference is that SVMs gave the 1990s a particularly persuasive form of statistical discipline.

They changed the terms of trust. Expert systems had asked organizations to trust encoded expertise. Neural networks asked researchers to trust trained representations and iterative weight adjustment. SVMs asked the community to trust a boundary chosen by margin, capacity control, kernelized representation, and convex optimization. That was a powerful rhetorical and engineering package.

The method also helped make machine learning look like an autonomous field rather than a rescue project for older AI ambitions. The center of gravity moved toward data, benchmarks, optimization, and generalization. The important question was no longer whether a machine had captured an expert’s rules. It was whether a learner could use examples to make reliable distinctions on new cases.

That shift connects Ch29 to the surrounding chapters. Chapter 28 showed the maintenance and economic pressure that made hand-coded knowledge less glamorous. SVMs answered with a statistical method whose knowledge came from examples and whose credibility came from margins and benchmarks. The next chapters will show other answers to the same post-winter demand: empirical speech recognition, reinforcement learning, web-scale data, multicore and distributed compute, human annotation markets, and eventually the ImageNet-era return of large neural models.

SVMs sit in the middle of that transition. They are not a detour from deep learning history. They are part of the reason later AI had to speak the language of evaluation, generalization, and data. When deep learning returned at scale, it did not return to the older culture of demonstration alone. It had to win benchmarks, exploit infrastructure, and show empirical gains. SVMs had helped make that the expected language.

The historical lesson is therefore not that the maximum-margin classifier was the final architecture of intelligence. It was not. The lesson is that after a period of overbuilt promises, a method can become powerful by narrowing its claim. SVMs did not claim to contain human expertise. They did not claim to model the brain. They claimed to choose a disciplined boundary from data.

For the 1990s, that was enough to matter deeply. The field was learning how to trust learning itself.

Sources

Primary and Near-Primary

Vladimir N. Vapnik and Alexey Ya. Chervonenkis, “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities”, Theory of Probability and Its Applications 16(2), 264-280 (1971): conceptual anchor for uniform convergence over a class of events, growth functions, and sample-size framing. Used for conceptual historical prose, not theorem derivation.
Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers”, Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144-152 (1992): optimal-margin classifier training, kernels, support patterns, quadratic optimization, and handwritten digit experiments.
Corinna Cortes and Vladimir Vapnik, “Support-Vector Networks”, Machine Learning 20, 273-297 (1995): support-vector networks, soft margins, kernels, USPS/NIST OCR benchmarks, LeNet comparison as reported by the authors, and the support-vector complexity claim.
Christopher J. C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery 2, 121-167 (1998): tutorial anchor for margins, support vectors, convex quadratic programming, kernels, Mercer conditions, implementation concerns, VC dimension, and generalization.