Chapter 47: The Depths of Vision

Цей контент ще не доступний вашою мовою.

Cast of characters

Name	Lifespan	Role
Kaiming He	—	First author of the ResNet paper at Microsoft Research
Xiangyu Zhang	—	Co-author of the ResNet paper at Microsoft Research
Shaoqing Ren	—	Co-author of the ResNet paper at Microsoft Research
Jian Sun	—	Co-author of the ResNet paper at Microsoft Research
Alex Krizhevsky	—	First author of the 2012 AlexNet paper; anchors the chapter’s starting point
Karen Simonyan	—	Co-author of the VGG paper that pushed ImageNet CNNs to 16–19 weight layers

Timeline (2012–2016)

timeline
    title The Depths of Vision
    2012 : Krizhevsky, Sutskever, and Hinton publish AlexNet — five convolutional and three fully connected layers, two GTX 580 GPUs, five to six days of training
    September 2014 : Simonyan and Zisserman release the VGG paper — ImageNet depth pushed to 16–19 weight layers with 3x3 filters
    2014 : GoogLeNet and VGG make "very deep" ImageNet models the state of the art, with leading architectures at roughly 16 to 30 layers
    February–July 2015 : Batch Normalization appears at ICML 2015, addressing gradient instability and enabling tens-layer networks to begin converging
    December 10 2015 : He, Zhang, Ren, and Sun post ResNet to arXiv — residual learning makes very deep ImageNet models trainable
    ILSVRC / COCO 2015 : ResNet team wins first place in ImageNet classification, detection, localization, COCO detection, and COCO segmentation
    June 2016 : ResNet paper published at CVPR 2016, pp. 770–778

Plain-words glossary

Optimization degradation — The phenomenon where adding more layers to a neural network makes training harder or causes training error to get worse.

Residual mapping — A formulation where a stack of layers learns a correction to its input rather than the whole desired transformation from scratch.

Shortcut connection — A path in the neural network’s computational graph that carries activations around learned layers so they can be combined later.

Bottleneck block — A residual block design that reduces arithmetic cost while preserving enough depth for very large networks.

FLOP (floating-point operation) — A single arithmetic step a processor performs. FLOP counts estimate how much computation a model requires per forward pass.

Projection shortcut — A shortcut connection that applies a learned linear transformation to the input before addition, used when the input and output of a residual block have different dimensions. The ResNet paper showed identity shortcuts were sufficient to resolve degradation; projection provides only marginal gains.

ILSVRC (ImageNet Large Scale Visual Recognition Challenge) — The annual competition on the ImageNet dataset that benchmarked visual recognition systems from 2010 through 2017. Winning results from this challenge drove major architectural advances in convolutional neural networks.

In 2012, visual recognition was transformed by the realization that scale, both in model capacity and computational power, could unlock performance previously thought unattainable. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton introduced an architecture for the ImageNet Large Scale Visual Recognition Challenge that firmly established deep convolutional neural networks as the dominant approach. Their model, often referred to as AlexNet, was large for its era. It consisted of five convolutional layers followed by three fully connected layers, containing roughly sixty million parameters and 650,000 neurons. Training such a network required specialized infrastructure. The team utilized two NVIDIA GTX 580 GPUs, each with three gigabytes of memory, and ran the training process for five to six days. The resulting performance on ImageNet demonstrated that a deep architecture, given enough data and parallel compute, could achieve breakthrough accuracy.

The lesson was not simply that neural networks had become fashionable again. AlexNet made the machinery of scale visible. A contest built around more than a million natural images had become a proving ground for learned feature hierarchies, and the winning system depended on both algorithmic decisions and the practical ability to keep a large convolutional model moving through data. Its split across two GPUs was not an incidental footnote; it showed that the architecture was already pressing against memory and throughput limits. The five convolutional layers extracted increasingly abstract visual patterns, while the fully connected layers turned those patterns into ImageNet class decisions. The system’s success made a once-risky proposition feel empirical: if the data and compute were available, deeper learned vision systems could outperform hand-engineered recognition pipelines.

Following this success, the field naturally asked how far the principle of scale could be pushed. If an eight-layer network produced a massive leap in capability, perhaps a deeper network could do even better. This intuition drove subsequent research, most notably by Karen Simonyan and Andrew Zisserman at the University of Oxford. Their work on Very Deep Convolutional Networks, known as VGG, pushed the depth of ImageNet models to between sixteen and nineteen weight layers. To manage the complexity of deeper networks, the VGG architecture standardized the use of small, three-by-three convolutional filters, demonstrating that stacking more layers of simple filters improved ImageNet accuracy compared to fewer layers of larger filters.

VGG’s contribution was especially important because it made depth itself look like the controlled variable. Rather than mixing many architectural changes at once, the Oxford models used a repeated local pattern: small three-by-three filters stacked in sequence. A pair of three-by-three convolutions could cover the effective receptive field of a larger filter while inserting additional nonlinearities between the input and output of the block. Three such convolutions extended that receptive field further while preserving the simple design rule. The result was not a mysterious architecture but a disciplined escalation. The network grew deeper by repeating a compact operation, and the reported ImageNet results improved as the configurations moved from shallower variants toward sixteen and nineteen learned layers.

However, the cost of this depth was steep. The infrastructure required to train VGG-level models reflected a significant escalation in computational demands. The Oxford team trained their networks using a multi-GPU implementation derived from the publicly available Caffe toolbox, which they heavily modified. Operating on a system equipped with four NVIDIA Titan Black GPUs, training a single network took two to three weeks depending on the specific architectural configuration.

That training bill mattered historically. VGG did not say that depth was cheap. It said that, when the cost could be paid, a plain and carefully regularized stack of small convolutions could reach state-of-the-art accuracy. The Caffe-derived implementation, the four Titan Black GPUs, and the two-to-three-week training runs made the bargain concrete. Depth had become a route to better recognition, but it was a route paved with long experiments, scarce hardware, and architectural choices that tried to keep the computation tractable.

By late 2014, the consensus was clear: depth was the road to higher accuracy. Leading ImageNet results from models like VGG and GoogLeNet relied on “very deep” architectures ranging from roughly sixteen to thirty layers. The evidence made depth look highly productive, cementing the assumption that increasing the number of layers was a reliable mechanism for improving visual recognition. The narrative of progress became intertwined with the sheer depth of the networks. If adding layers consistently yielded better representations and lower error rates, the obvious path forward was simply to keep building deeper. This raised an immediate, practical question: what happens when a network scales not just to twenty layers, but to fifty, or even over a hundred? What limits the field from indefinitely stacking more layers to improve recognition?

The question was practical before it was philosophical. ImageNet had rewarded systems that combined representation learning with enough compute to train them, and the leading architectures had made depth a visible marker of ambition. But a training run that takes weeks already narrows the range of experiments a group can afford. A model that is deeper only on paper, but too expensive or too fragile to train reliably, does not advance the leaderboard. The next step in vision therefore required more than confidence in scale. It required a way to make additional layers behave productively inside the optimizer.

The pursuit of extreme depth quickly encountered a wall, but it was not the wall the field had historically expected. For years, the primary obstacle to training deep neural networks had been the problem of vanishing or exploding gradients. As errors were propagated backward through many layers during training, the gradient signals tended to either shrink to zero, halting learning, or grow exponentially, causing instability. By 2015, however, this specific pathology had been largely addressed. The introduction of normalized initialization techniques and intermediate normalization layers, most notably batch normalization, ensured that forward signals remained nonzero and backward gradients maintained healthy norms. These techniques enabled networks with tens of layers to start converging.

The distinction is crucial. If a very deep network simply failed to begin learning, the diagnosis would have been familiar: unstable signal propagation through a long chain of transformations. The ResNet experiments were more unsettling because the networks did begin learning. The authors explicitly separated this case from the older vanishing-gradient story. They were not looking at a dead system whose gradients had disappeared before useful training could start. They were looking at a system that entered the training process, moved downhill, and still ended up worse as layers were added.

Yet, when researchers at Microsoft Research, Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, began pushing ImageNet architectures to unprecedented depths, they observed a strange and counter-intuitive phenomenon. The models could begin converging, showing that the gradient signals were not simply disappearing, but as depth increased, accuracy saturated and then rapidly degraded. Crucially, this degradation was not caused by overfitting. In a typical overfitting scenario, a highly complex model achieves very low training error but performs poorly on unseen test data. The Microsoft Research team found the opposite: adding more layers to a suitably deep model led to higher training error.

Their experiments laid the problem bare. When evaluating plain convolutional networks on ImageNet, a thirty-four-layer network exhibited significantly higher training error than an eighteen-layer network of the same design. It also suffered from higher validation error. The comparison mattered because the two models belonged to the same architectural family. The deeper network was not being penalized for an exotic new component or a radically different training setup; it was a straightforward extension of the shallower plain model. The deeper model was fundamentally worse at fitting the training data, despite having far more parameters and capacity. The authors confirmed that this difficulty was unlikely to be caused by vanishing gradients, as they verified that the batch normalization layers kept the variances of the signals healthy and the gradients did not vanish. The networks were optimizing, but they were optimizing poorly.

The CIFAR example in the paper made the same phenomenon even easier to see. In a smaller-image setting, the authors compared plain networks at different depths and found the same pattern: increasing depth could make the training error worse, not merely the test error. Figure 1 in the paper placed the contradiction in a simple curve. The deeper plain network did not just generalize less well; it fitted the training set less well. That observation removed the most convenient explanation. A model with more layers might overfit because it has too much capacity for the data, but overfitting would normally drive training error down. Here, capacity was not being converted into a better fit.

This optimization degradation presented a profound theoretical contradiction. The authors framed the problem by construction: consider a shallower architecture and its deeper counterpart, where the deeper model is created by taking the shallower model and simply adding more layers. In principle, there exists a constructed solution where the deeper model perfectly copies the shallower model. To achieve this, the added layers would merely need to act as identity mappings, passing their inputs directly to their outputs without modification. If the solver could find this solution, the deeper model would have exactly the same training error as the shallower one. It should never perform worse.

The argument did not require the added layers to discover any new visual representation. It asked only that they do nothing harmful. In a linear algebra sense, the identity function is the plainest possible transformation: output the input unchanged. But a stack of convolutions, nonlinearities, and normalization operations does not become an identity map merely because such a map would be useful. The weights still have to land in a configuration that makes the composed function behave that way across the relevant data distribution. The degradation result showed that ordinary stochastic gradient descent, applied to the plain architecture, could fail even at this minimal assignment.

The fact that the thirty-four-layer network had higher training error than the eighteen-layer network meant that the stochastic gradient descent solver was failing to find this identity-mapping solution, let alone a better one. The optimization landscape for the deeper plain network was somehow so difficult that the solver could not even match the performance of the shallower baseline. The wall inside the training curves was not a lack of capacity, nor was it a gradient pathology; it was an optimization degradation that prevented deep networks from utilizing their theoretical power. Solving this degradation would require rethinking how layers within a deep network were expected to behave.

That framing also changed the emotional shape of the problem. The disappointment was not that an ambitious model had failed because the task was too hard. The disappointment was that a larger model could not reliably reproduce the behavior of a smaller one. The deeper network had an obvious escape route in principle: preserve the useful transformations already available to the shallower network and let the extra layers collapse to identities. Its failure made the identity function, usually treated as trivial, into the center of the story. The next architecture would succeed by making that trivial path explicit.

The solution proposed by He, Zhang, Ren, and Sun was a mathematical detour that fundamentally changed the task assigned to the stacked layers. Instead of hoping the solver would force stacked nonlinear layers to learn an identity mapping when necessary, they altered the network’s topology so that identity became the default state.

In a standard plain network, a sequence of stacked layers is expected to learn some desired underlying mapping, which can be denoted as H(x), where x is the input to those layers. If the optimal mapping is an identity function, the layers must adjust their weights to explicitly construct that identity, a task that the degradation problem showed was empirically difficult for deep networks.

The Microsoft Research team reformulated the problem through residual learning. They proposed that instead of asking the stacked layers to directly fit the target mapping H(x), they should let the layers fit a residual mapping, defined as F(x) := H(x) - x. The original desired mapping is then recast into F(x) + x. This operation is implemented in a neural network by creating a shortcut connection that skips one or more layers. The input x is carried around a stack of two or three learned layers and added to their output.

This small architectural rerouting changed the optimization dynamics. If the optimal mapping is closer to an identity function than to a zero mapping, it should be easier for the solver to drive the residual function F(x) toward zero than to fit an identity mapping from a stack of nonlinear transformations. In the residual block, “doing nothing” no longer has to be synthesized by the learned layers. The shortcut already carries x forward. The learned branch can then specialize in the difference between the input and the desired output of the block. By providing an identity shortcut, the network is free to use the stacked layers merely to learn a small correction, the residual, around that identity.

The residual block was formalized elegantly as y = F(x, {Wi}) + x, where F(x, {Wi}) represents the residual mapping to be learned by the weight layers. This design, relying on identity shortcuts, added neither extra parameters nor computational complexity to the model. The entire network could still be trained end-to-end by stochastic gradient descent with backpropagation. Furthermore, because the operation was simple addition, it could be implemented using common, existing libraries like the Caffe toolbox without requiring any modifications to the core solver.

The simplicity was part of the force of the idea. The shortcut did not introduce a gate, a learned controller, or a second optimization problem. It was a branch in the computational graph, followed by an elementwise addition. If the dimensions matched, the branch could remain the identity function all the way through. The residual branch contained the trainable convolutions; the shortcut carried the reference signal. The network therefore kept the familiar machinery of convolution, normalization, rectified nonlinearities, backpropagation, and SGD, while changing the target that each local stack was asked to approximate.

In cases where the input and output dimensions of the residual block needed to match, for instance when the spatial dimensions of the feature maps were reduced, the authors introduced projection shortcuts. Formalized as y = F(x, {Wi}) + Ws x, these shortcuts applied a linear projection to align the dimensions before the addition. However, the paper’s comparison between identity and projection shortcuts revealed that projection was not essential for addressing the degradation problem. While projection shortcuts provided slight improvements, identity shortcuts were sufficient to resolve the training difficulties, leading the authors to favor identity shortcuts as the default, economical choice.

The experiments on shortcut types kept the result precise. Projection shortcuts could be used for all shortcuts, or only where dimensions changed, or the model could rely mainly on identity shortcuts and zero-padding when dimensions increased. The paper reported small gains from projection in some configurations, but the essential reversal of the degradation problem did not depend on turning every shortcut into a learned projection. That mattered for both interpretation and engineering. The core mechanism was not a large new parameterized subnetwork hidden in the skip path. It was the availability of a direct identity route through the block.

The impact of this reformulation was immediate and stark. When identity shortcuts were added to the previously failing thirty-four-layer network, the situation reversed entirely. The thirty-four-layer residual network easily outperformed the eighteen-layer residual network, demonstrating substantially lower training error. The degradation problem had been bypassed. The identity detour proved that depth was not inherently flawed; it required a path that allowed the optimizer to succeed. Similar shortcut ideas were being explored concurrently in the field, sometimes discussed under frameworks like Highway Networks, but those methods are best treated here as related neighbors rather than as a documented causal source. The ResNet paper’s parameter-free identity shortcut provided the mechanism that scaled ImageNet convolutional neural networks to unprecedented depth.

This is why the residual block became more than a convenient diagram. It converted a negative experimental result into a reusable architectural rule. If a deep stack has trouble learning a useful transformation directly, give the input an unimpeded route around that stack and let the stack learn the residual correction. The rule is local, but the effect accumulates across the entire network. Every few layers, the model receives another chance to preserve what has already been computed while adding a learned adjustment. In a very deep architecture, that repeated opportunity changes the character of optimization.

Proving that a thirty-four-layer network could optimize properly was a significant conceptual breakthrough, but scaling the architecture to dominate ImageNet required meticulous engineering under strict compute constraints. The infrastructure story of ResNet is not just about drawing a new diagram; it is about managing floating-point operations and dataset scale to build models of unprecedented depth.

The ImageNet experiments were conducted at a massive scale, utilizing 1.28 million training images. The models were evaluated on a validation set of 50,000 images, and the final results were reported from the test server on 100,000 test images. To train effectively at this scale, the team employed a rigorous optimization recipe. The implementation utilized scale augmentation and 224x224 crops to ensure robust feature learning. Crucially, batch normalization was applied immediately after each convolution and before the activation function. The network weights were initialized using the method designed by He, and training proceeded with stochastic gradient descent using a mini-batch size of 256. The training utilized a weight decay of 0.0001 and a momentum of 0.9, and the architecture intentionally omitted dropout layers, relying entirely on batch normalization for regularization.

This recipe also explains why the degradation result was so compelling. The networks were not being trained with an obviously obsolete setup. They used the initialization and normalization practices that the paper itself credited with making tens-layer networks trainable. Batch normalization after each convolution stabilized intermediate activations; the initialization scheme matched rectified networks; mini-batch SGD, momentum, and weight decay were conventional enough to make comparisons meaningful. The residual architecture was therefore tested inside a serious ImageNet training pipeline rather than as an isolated diagram.

The report is specific about these algorithmic and computational choices, but it does not give the same kind of hardware vignette that AlexNet and VGG provide. For ResNet’s ImageNet runs, the paper specifies mini-batch size, dataset scale, augmentation, FLOP counts, and model configurations, not an exact GPU model, GPU count, or wall-clock training duration. That absence is important. The infrastructure story here is not a scene of a named machine running for a named number of days. It is a story told through the quantities the paper does report: image counts, crop sizes, normalization placement, optimization hyperparameters, and the arithmetic cost of different architectures.

As the team pushed toward fifty, one hundred, and eventually one hundred and fifty-two layers, computational efficiency became the binding constraint. The thirty-four-layer plain baseline required 3.6 billion floating-point operations (FLOPs). While substantial, this was only about 18 percent of the 19.6 billion FLOPs required by VGG-19. That comparison showed that a deeper-looking network could still be cheaper than an earlier very deep model if its layers were designed economically. However, simply extending the basic residual blocks to 152 layers would have caused the training time to become prohibitive.

To afford these extreme depths, the authors redesigned the residual block for networks of fifty layers and deeper, introducing a bottleneck architecture. Instead of a stack of two three-by-three convolutions, the bottleneck block utilized three layers: a one-by-one, a three-by-three, and another one-by-one convolution. The first one-by-one layer was responsible for reducing the dimensions, effectively creating a “bottleneck” before the three-by-three convolution, while the final one-by-one layer restored the dimensions.

This bottleneck design drastically reduced the computational cost, allowing the network to scale in depth without a corresponding explosion in FLOPs. It also reinforced the reliance on parameter-free identity shortcuts. The authors noted that if the identity shortcuts in the bottleneck architectures were replaced with projection shortcuts, the time complexity and model size would double, as the shortcut would connect high-dimensional spaces. By strictly maintaining identity shortcuts, the bottleneck design remained economical.

The bottleneck also made the shortcut decision more consequential. In the shallower residual blocks, a projection shortcut was an additional cost; in the deeper bottleneck blocks, replacing identity shortcuts with projections across all blocks would have doubled both time complexity and model size. The architecture therefore paired two forms of discipline. The residual formulation made very deep optimization tractable, and the bottleneck kept the arithmetic budget within reach. Without both pieces, the 50-, 101-, and 152-layer designs would have been much less practical.

The sequence of models also mattered. The paper did not leap from a failing thirty-four-layer plain network straight to the 152-layer winner as if depth alone had been vindicated. It first used comparable eighteen- and thirty-four-layer plain and residual networks to expose the degradation problem and then to show its reversal. Only after that did the design move into the bottleneck family for 50, 101, and 152 layers. The argument therefore had a layered structure: diagnose the failure in plain networks, show that residual shortcuts repair it at comparable depth, then use the repaired block and bottleneck economy to scale.

The resulting 152-layer ResNet was an engineering triumph. Despite its massive depth, it required 11.3 billion FLOPs, a total complexity that was still lower than both VGG-16, which required 15.3 billion FLOPs, and VGG-19, which required 19.6 billion. The fifty-, one-hundred-and-one-, and one-hundred-and-fifty-two-layer residual networks trained smoothly, showing no signs of the degradation problem that had plagued the plain networks. The architecture proved that with the right structural path and careful management of computational complexity, depth could be escalated far beyond previous limits.

The comparison with VGG should not be flattened into a simple morality tale about old and new models. VGG had shown how far a disciplined plain stack could go, and its cost was part of the evidence that depth was becoming a systems problem. ResNet answered a different question. It showed that a much deeper network could be made both optimizable and arithmetically competitive. The crucial move was not merely adding 133 more layers beyond VGG-19. It was changing the local task of those layers so that a hundred-layer-scale model did not have to behave like a fragile tower of transformations.

The engineering culminated in a decisive victory that redefined the state of the art in computer vision. The 152-layer ResNet, evaluating a single model, achieved a top-5 validation error of just 4.49 percent. When combined into an ensemble of six models of varying depths, the system reached a 3.57 percent top-5 error on the test set. This remarkable performance secured first place in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 classification task.

The numbers need to be read carefully. The 4.49 percent figure belonged to a single 152-layer model on the validation set. The 3.57 percent figure came from an ensemble on the test set. That distinction matters because ImageNet competition systems often combined multiple models and evaluation choices to squeeze out additional accuracy. Even with that caveat, the result marked a clear transition. The winning classification entry was no longer a sixteen- or twenty-two-layer system. It was built around residual networks whose depths had recently looked implausible for plain convolutional training.

The success of the residual representations extended well beyond pure classification. The official ILSVRC 2015 results page listed the MSRA team at the top of the object detection standings by mean average precision, as well as at the top of the classification and localization task with a classification error of 0.03567. The ResNet paper reported sweeping first-place finishes across ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The representations learned by the extremely deep networks proved highly transferable; replacing the VGG-16 backbone with a ResNet-101 architecture in the Faster R-CNN framework improved the standard COCO detection metric by 6.0 points, representing a 28 percent relative improvement.

That transfer result is historically important because it shows what the paper meant by representations. A classifier trained on ImageNet was not valuable only because it chose among ImageNet labels. Its convolutional layers could become the visual backbone of a detector, where the task was to locate and classify objects in an image rather than assign a single image-level category. In the Faster R-CNN comparison, the detection framework remained the frame of reference while the backbone changed from VGG-16 to ResNet-101. The gain therefore pointed to the learned visual features themselves. Residual depth produced internal representations useful beyond the classification leaderboard.

However, the authors were careful to identify the limits of their approach, ensuring that depth was not presented as limitless magic. To test the extreme boundaries of residual learning, they trained an astounding 1202-layer ResNet on the CIFAR-10 dataset. The model successfully optimized, achieving a training error below 0.1 percent without encountering the degradation problem that halted plain networks. The optimizer was capable of handling over a thousand layers.

Yet, this optimization success did not translate into a generalization victory. The 1202-layer model produced a test error of 7.93 percent, which was worse than the performance of a shallower 110-layer residual model. The authors straightforwardly attributed this drop in test performance to overfitting, noting that the 1202-layer network was likely unnecessarily large for the CIFAR-10 dataset. This was the clean inverse of the earlier degradation problem. In the plain networks, more layers could not even reduce training error properly. In the 1202-layer residual network, training error nearly vanished, but test performance worsened. Residual learning had solved one failure mode while exposing another.

This boundary defined the true legacy of the paper. Residual connections did not make arbitrary depth automatically useful for generalization, nor did they guarantee that more layers would always improve performance on unseen data. Instead, they separated the problem of optimization from the problem of capacity. Identity shortcuts made very deep networks trainable by changing what the stacked layers had to learn. Once that obstacle moved, the remaining questions became the familiar ones of data, regularization, computational cost, and whether additional capacity actually served the task. ResNet transformed extreme depth from a brittle aspiration into an engineering choice, turning shortcut connections from a local design option into a default grammar for deep learning.