History of AI & Machine Learning

AI/ML Engineering Track | Complexity: [MEDIUM] | Time: 4-5

The Night Everything Changed

It was December 2012, a chilly night at Stanford University at precisely 11:47 PM, when Geoffrey Hinton realized everything he’d fought for over forty years was finally vindicated. The researcher hadn’t slept in three days. His team—two graduate students working from a tiny Toronto apartment—had just crushed the ImageNet competition. Their neural network, AlexNet, didn’t just win. It demolished the competition by 10 percentage points—the largest margin in the competition’s history.

The room was silent as the results came in. Then: pandemonium.

“They called us crazy for years,” Hinton said quietly, his British accent thickened by exhaustion. “Said neural nets were a dead end. Said we were wasting our careers.”

He’d been working on neural networks since 1972—four decades of ridicule, funding denials, and watching colleagues abandon the field. He’d survived two AI Winters, academic exile, and the rise of competing approaches that seemed to make his life’s work obsolete.

Now, in a single night, everything had changed. Within weeks, Google would acquire Hinton’s company for $44 million. Within years, neural networks would be everywhere—recognizing faces, translating languages, driving cars, generating art.

But none of it would have happened without the pioneers who came before: Turing, who asked if machines could think. McCulloch and Pitts, who drew the first artificial neuron. Rosenblatt, who built the first one that learned. And countless others who kept the flame alive through the winters.

This is their story.

“We were in the wilderness for decades. Then, almost overnight, we weren’t.” — Geoffrey Hinton, interview, 2013

What You’ll Be Able to Do

By the end of this module, you will:

Understand the complete evolution of AI from 1943 to present
Know the key personalities who shaped the field
Learn the lessons from AI Winters and why they matter today
Understand the Bitter Lesson and its implications
Appreciate why deep learning succeeded where other approaches failed

Introduction: Why History Matters

Before you build the future, you must understand the past.

The history of AI is not a linear march of progress. It’s a story of bold visions, crushing disappointments, unexpected breakthroughs, and hard-won lessons. Understanding this history helps you:

Avoid repeating mistakes - The same overpromising that caused AI Winters echoes in today’s hype cycles
Appreciate why things work - Modern techniques aren’t arbitrary; they emerged from decades of trial and error
Make better predictions - Patterns from the past illuminate likely futures
Stand on the shoulders of giants - Every tool you use today was built by brilliant researchers who deserve recognition

As George Santayana wrote: “Those who cannot remember the past are condemned to repeat it.”

The journey you’re about to take spans more than eight decades—from handwritten calculations in the 1940s to chatbots that can write code and generate art. Along the way, you’ll meet brilliant minds who dared to dream of thinking machines, stubborn visionaries who kept working when everyone said neural networks were dead, and entrepreneurs who turned academic research into products used by billions. This isn’t just history—it’s the foundation for understanding where AI is going next.

Part 1: The Pre-Dawn (1943-1955)

The First Artificial Neuron (1943)

The story begins not with computers, but with a paper and pencil model of how neurons might compute.

In 1943, neurophysiologist Warren McCulloch and logician Walter Pitts published “A Logical Calculus of the Ideas Immanent in Nervous Activity.” They proposed a simple model:

McCulloch-Pitts Neuron (1943)
=============================

Inputs: x₁, x₂, x₃ (binary: 0 or 1)
Weights: w₁, w₂, w₃ (can be +1 or -1)
Threshold: θ

Output = 1 if (w₁x₁ + w₂x₂ + w₃x₃) ≥ θ
         0 otherwise

This is essentially: weighted sum → threshold → binary output

This looks trivially simple today, but it was revolutionary. McCulloch and Pitts showed that networks of these simple units could compute any logical function. The brain, they suggested, might be a biological computer.

Imagine the McCulloch-Pitts neuron like a single voter in an election—it receives inputs (arguments for and against), weighs them, and then casts a binary vote (yes or no). Just as a single voter can’t make complex decisions alone, but millions of voters together can elect governments, individual artificial neurons seem limited but networks of them can solve surprisingly complex problems.

Did You Know? Walter Pitts was entirely self-taught. He ran away from home at 15, lived homeless in Chicago, and taught himself logic by reading Bertrand Russell’s Principia Mathematica in the library. When he found an error in Russell’s work, he wrote to Russell, who was so impressed he invited Pitts to study at Cambridge. Pitts declined—he was only 15.

Alan Turing’s Foundation (1950)

In 1950, Alan Turing published “Computing Machinery and Intelligence,” one of the most influential papers in the history of AI. In it, he proposed what we now call the Turing Test:

“I propose to consider the question, ‘Can machines think?’”

Rather than defining “thinking,” Turing proposed a practical test: If a machine can fool a human judge into believing it’s human through conversation, it should be considered intelligent.

The Turing Test (1950)
======================

    Human Judge
         |
    [Conversation]
       /    \
   Human    Machine

If the judge can't reliably distinguish
the machine from the human, the machine
"passes" the test.

Turing also predicted that by the year 2000, machines would be able to fool 30% of human judges in 5-minute conversations. (He was roughly correct—early chatbots occasionally achieved this, though they exploited loopholes rather than demonstrating true intelligence.)

Did You Know? The same paper where Turing proposed his famous test also addressed objections to machine intelligence, including the “Lady Lovelace Objection” (machines can only do what they’re programmed to do) and the “Argument from Consciousness.” Turing’s responses remain relevant to AI debates today.

The Shannon-Turing Chess Connection (1950s)

Both Claude Shannon (father of information theory) and Alan Turing independently worked on chess-playing algorithms in the late 1940s and early 1950s.

Shannon’s 1950 paper “Programming a Computer for Playing Chess” laid out the minimax algorithm that would dominate game AI for decades:

Minimax Algorithm
=================

        MAX (your move)
       /    |    \
     MIN   MIN   MIN (opponent's moves)
    / \   / \   / \
  MAX MAX MAX MAX   (your responses)

Assume opponent plays optimally.
Choose the move that maximizes your
minimum guaranteed outcome.

Turing went further and actually wrote a chess program by hand—before computers existed that could run it. He called it “Turochamp” and executed it by hand, taking about 30 minutes per move. The program played weak but legal chess.

Did You Know? Turing’s hand-executed chess program played a game against one of his colleagues in 1952. The program lost, but it’s the first recorded game ever played by a chess “program.”

Part 2: The Birth of AI (1956)

The Dartmouth Conference

In the summer of 1956, a small group of researchers gathered at Dartmouth College for a two-month workshop. The proposal, written by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon, coined the term “Artificial Intelligence”:

“We propose that a 2-month, 10-man study of artificial intelligence be carried out… The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

The attendees included:

John McCarthy (coined “AI,” invented Lisp)
Marvin Minsky (neural nets, frames, co-founded MIT AI Lab)
Claude Shannon (information theory)
Herbert Simon (bounded rationality, Nobel laureate)
Allen Newell (cognitive architecture, GPS)
Arthur Samuel (machine learning pioneer, checkers)

Dartmouth Conference (1956)
===========================

Location: Dartmouth College, Hanover, NH
Duration: Summer 1956 (planned 2 months)
Attendees: ~10 researchers
Funding: $13,500 from Rockefeller Foundation

Output: The field of "Artificial Intelligence" was born

Did You Know? The Dartmouth proposal optimistically stated: “We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.” They believed human-level AI might be achievable within a generation. Nearly 70 years later, we’re still working on it.

The Name “Artificial Intelligence”

John McCarthy chose the name “Artificial Intelligence” deliberately, though he later had second thoughts. Alternative names considered included:

Automata Studies (too narrow)
Complex Information Processing (too vague)
Machine Intelligence (used in Britain)
Computational Rationality (McCarthy’s later preference)

McCarthy chose “Artificial Intelligence” partly to distinguish the field from cybernetics, the existing field studying feedback and control systems led by Norbert Wiener. McCarthy wanted a fresh start.

Did You Know? McCarthy later regretted the name. He said “Artificial Intelligence” sounded too much like “artificial diamonds”—a mere imitation of the real thing. He came to prefer “computational rationality,” which emphasized that machines should make good decisions rather than imitate human thought.

Part 3: The Golden Age (1956-1969)

The Perceptron (1957)

In 1957, psychologist Frank Rosenblatt at Cornell invented the Perceptron, the first machine that could learn from data:

Rosenblatt's Perceptron (1957)
==============================

Inputs: x₁, x₂, ..., xₙ (real numbers)
Weights: w₁, w₂, ..., wₙ (learned!)
Bias: b

Output = 1 if (w₁x₁ + w₂x₂ + ... + wₙxₙ + b) > 0
         0 otherwise

LEARNING RULE:
If output is wrong:
  - Should be 1 but was 0: add inputs to weights
  - Should be 0 but was 1: subtract inputs from weights

The key innovation was that weights were learned from examples, not hand-coded. Picture the perceptron’s learning process like a musician tuning an instrument—each time the output is wrong, you adjust the weights slightly, just as a guitarist turns the tuning pegs until the note sounds right. Over many examples, the weights converge to values that produce correct outputs.

Rosenblatt built the Mark I Perceptron, a hardware implementation using 400 photocells and potentiometers:

Mark I Perceptron (1958)
========================

400 photocells (20x20 image input)
     ↓
512 "association units" (random connections)
     ↓
8 output units (learned weights)
     ↓
Classification decision

Size: Filled a room
Power: Considerable
Speed: Slow by modern standards
But: It LEARNED from data!

The New York Times headline read: “NEW NAVY DEVICE LEARNS BY DOING” and predicted the machine would eventually “walk, talk, see, write, reproduce itself and be conscious of its existence.”

Did You Know? Rosenblatt made bold predictions: “The embryo perceptron… is the first machine which is capable of having an original idea.” Such overenthusiastic claims would come back to haunt the field.

Symbolic AI: Logic and Search

While Rosenblatt pursued neural approaches, most AI researchers focused on symbolic AI—using logic, rules, and search algorithms.

Logic Theorist (1956): Created by Newell and Simon, this program proved 38 of the 52 theorems in Chapter 2 of Principia Mathematica. For one theorem, it found a proof more elegant than Russell’s original.

General Problem Solver (GPS, 1959): Newell and Simon’s attempt to create a general-purpose reasoning program using means-ends analysis:

General Problem Solver (1959)
=============================

Goal: Transform current state into goal state

Method (Means-Ends Analysis):
1. Find difference between current and goal
2. Find operator that reduces this difference
3. If operator's preconditions aren't met,
   recursively solve subproblem
4. Apply operator
5. Repeat until goal reached

Example: Missionaries and Cannibals puzzle
- Current: 3M, 3C on left bank
- Goal: All on right bank
- Constraint: Cannibals can't outnumber missionaries

ELIZA (1966): Joseph Weizenbaum created ELIZA, a program that simulated a Rogerian psychotherapist. It worked by pattern matching and simple transformations:

ELIZA Pattern Matching
======================

User: "I am sad"
Pattern: "I am <X>"
Response: "How long have you been <X>?"
Output: "How long have you been sad?"

User: "My mother hates me"
Pattern: "My <relation> <X>"
Response: "Tell me more about your <relation>"
Output: "Tell me more about your mother"

Weizenbaum was horrified when people, including his secretary, formed emotional attachments to ELIZA and wanted to confide in it privately. He spent the rest of his career warning about the dangers of computers in human decision-making.

Did You Know? ELIZA spawned the first “chatbot” competitions. The annual Loebner Prize (1991-2019) awarded prizes to programs that best fooled judges in Turing Test conversations. Winners typically used ELIZA-style tricks rather than true understanding.

The Optimism of the Era

The 1960s were a time of boundless optimism in AI. Researchers made predictions that seem laughable today:

Researcher	Prediction	Year
Herbert Simon	”Within 20 years machines will be capable of doing any work a man can do”	1965
Marvin Minsky	”Within a generation the problem of creating ‘artificial intelligence’ will be substantially solved”	1967
Marvin Minsky	”In from three to eight years we will have a machine with the general intelligence of an average human being”	1970

Why were they so wrong? They underestimated:

The complexity of common sense - “Easy” things like vision and language were actually hardest
The limits of symbolic AI - Logic couldn’t capture fuzzy, uncertain real-world knowledge
The computational requirements - Moore’s Law would need decades more progress

Part 4: The First AI Winter (1969-1980)

Think of AI Winters like ice ages for technology—long periods where progress slows to a crawl, funding evaporates, and researchers either abandon the field or rebrand their work to survive. Just as ice ages were caused by specific triggers (orbital changes, volcanic eruptions), AI Winters had specific causes: overpromising, underfunding, and crushing critiques that made the entire field seem hopeless. Understanding these winters is crucial because the conditions that caused them—hype cycles followed by disappointment—can happen again.

The Perceptrons Bombshell (1969)

In 1969, Marvin Minsky and Seymour Papert published Perceptrons, a mathematical analysis of what Rosenblatt’s perceptrons could and couldn’t do.

The book proved that single-layer perceptrons cannot learn XOR:

The XOR Problem
===============

Input A  Input B  Output
   0        0        0
   0        1        1
   1        0        1
   1        1        0

No single line can separate
the 1s from the 0s:

  B
  1 |  1    0
    |
  0 |  0    1
    +--------
       0    1   A

The 1s (at [0,1] and [1,0]) aren't
linearly separable from the 0s.

This seems trivial, but XOR represents any problem where the output depends on a complex combination of inputs rather than simple thresholds. Real-world patterns often have this structure.

Minsky and Papert suggested (though didn’t prove) that multi-layer perceptrons would be “sterile”—no one knew how to train them. This perception—partly from the book itself, partly from how it was interpreted—devastated neural network research funding.

Did You Know? Minsky and Papert were aware that multi-layer networks could solve XOR. The book’s epilogue mentioned this possibility. But the damage was done—funding dried up, and neural net research entered a decade-long “winter” in the US.

The Lighthill Report (1973)

In 1973, the British government commissioned mathematician James Lighthill to assess AI research. His report was devastating:

“In no part of the field have discoveries made so far produced the major impact that was then promised.”

Lighthill criticized AI’s failure to handle the combinatorial explosion—as problems got bigger, the time to solve them grew exponentially. He recommended cutting AI funding drastically.

The Combinatorial Explosion
===========================

Chess positions:     ~10^120 possible games
Go positions:        ~10^170 possible games
English sentences:   Essentially infinite

Brute-force search fails!
1960s AI had no good answer.

The Lighthill Report led to the collapse of AI research in Britain for nearly a decade. Combined with the Perceptrons backlash in the US, the first AI Winter had arrived.

What Survived

Not all AI research stopped. Expert systems began emerging—programs that captured human expertise in narrow domains using if-then rules:

DENDRAL (1965-1983)
===================

Domain: Chemical structure identification
Input: Mass spectrometry data
Output: Possible molecular structures

Method: Encoded rules from expert chemists
"If peak at mass X with intensity Y,
 then likely contains functional group Z"

Success: Performed as well as human experts
         in its narrow domain

MYCIN (1972) diagnosed bacterial infections and recommended antibiotics. It outperformed many human doctors on its specific task—but couldn’t do anything outside that narrow domain.

Part 5: The Expert Systems Boom (1980-1987)

Expert Systems Go Commercial

In the early 1980s, AI came roaring back—but in a different form. Expert systems promised to capture human expertise and package it in software:

Expert System Architecture
==========================

KNOWLEDGE BASE          INFERENCE ENGINE
(domain facts &    +    (applies rules
 if-then rules)         to reach conclusions)
      ↓                        ↓
              USER INTERFACE
              (explains reasoning)

Example Rule (MYCIN):
IF: The infection is primary-bacteremia
AND: The site of culture is sterile
AND: The suspected portal is GI tract
THEN: There is evidence (0.7) that
      the organism is Bacteroides

Companies like Symbolics, Lisp Machines Inc., and Teknowledge sold specialized AI hardware and software. Japan’s Fifth Generation Computer Project (1982) promised to build “intelligent computers” in 10 years, sparking fear in the US and Europe of being left behind.

R1/XCON at Digital Equipment Corporation showed the commercial potential:

R1/XCON (1980-1989)
===================

Task: Configure VAX computer systems
Rules: Started with 750, grew to 10,000+
Savings: $40 million/year for DEC
Impact: Validated expert systems commercially

Investment in AI exploded. By 1985, the AI industry was worth over $1 billion annually.

Did You Know? The Fifth Generation project ultimately failed to meet its goals. Japan invested roughly $400 million over 10 years but couldn’t achieve the promised breakthroughs in logic programming and parallel processing. The project is now considered a cautionary tale about government-directed AI research.

The Lisp Machine Era

For a brief period, specialized Lisp machines were the hardware of choice for AI:

Symbolics 3600 (1983)
=====================

Purpose: Run Lisp (the AI language) fast
CPU: Custom tagged architecture
Memory: 256KB - 8MB
Display: High-resolution graphics
Price: $100,000+
Buyers: AI labs, Wall Street, defense

Peak: ~7,000 Lisp machines sold

These machines had specialized hardware for running Lisp efficiently, with features like:

Tagged memory (every word knew its type)
Hardware garbage collection (automatic memory management)
Integrated development environments (Symbolics Genera was legendary)

Did You Know? The first internet domain ever registered was symbolics.com on March 15, 1985. Symbolics eventually went bankrupt, and the domain passed through several owners. It’s now a historical curiosity.

Part 6: The Second AI Winter (1987-1993)

The Collapse

The expert systems boom collapsed even faster than it had grown:

Timeline of Collapse
====================

1987: Lisp machine market crashes
      - General-purpose workstations got fast enough
      - Apple, Sun, HP offered better price/performance

1988: Expert system limitations become clear
      - Couldn't learn from data
      - Knowledge acquisition "bottleneck"
      - Brittle: failed on edge cases

1989: Fifth Generation Project scaling back
      - Goals unmet
      - Parallel logic programming proved impractical

1990-1993: "AI Winter" in full effect
      - Funding cuts across government and industry
      - Researchers avoid using "AI" in proposals

The fundamental problems:

Knowledge Acquisition Bottleneck: Extracting expert knowledge was slow and expensive
Brittleness: Systems failed ungracefully on inputs outside their training
No Learning: Expert systems couldn’t improve from experience
Maintenance Nightmare: Thousands of rules became impossible to maintain

Did You Know? During the AI Winter, many researchers rebranded their work. “Machine learning” sounded less grandiose than “AI.” “Intelligent systems” replaced “expert systems.” The substance was often similar, but the labels changed to avoid the AI stigma.

What Kept Going

Despite the winter, important research continued:

Backpropagation Revival (1986): Rumelhart, Hinton, and Williams published “Learning representations by back-propagating errors,” showing how to train multi-layer neural networks:

Backpropagation (1986)
======================

Forward Pass:
Input → Hidden Layer → Output → Loss

Backward Pass:
Loss → ∂L/∂output → ∂L/∂hidden → ∂L/∂input

Key Insight: Chain rule lets us compute
gradients for hidden layers!

Update: weights -= learning_rate × gradient

This solved the problem Minsky and Papert had identified—but it would take decades and much more compute before neural networks dominated.

Statistical Methods Rise: While symbolic AI struggled, statistical approaches gained ground:

Hidden Markov Models (HMMs) for speech recognition
Support Vector Machines (SVMs) for classification
Probabilistic graphical models for uncertainty

Did You Know? Geoffrey Hinton kept working on neural networks throughout the AI Winter, even when it was unfashionable. He called this period “the wilderness years.” His persistence would pay off spectacularly in 2012.

Part 7: The Machine Learning Renaissance (1990s-2000s)

The Quiet Revolution

While “AI” remained a dirty word, machine learning researchers made steady progress:

1997: Deep Blue Beats Kasparov

IBM’s Deep Blue defeated world chess champion Garry Kasparov in a six-game match:

Deep Blue (1997)
================

Hardware: 30 IBM RS/6000 processors
          480 custom chess chips
Speed: 200 million positions/second
Method: Alpha-beta search + evaluation function
Depth: 6-12 moves (up to 40 in some lines)

Game 6: Kasparov resigned after 19 moves
(He missed a drawing line in a complex position)

This was a symbolic AI triumph—Deep Blue used brute-force search, hand-crafted evaluation functions, and specialized hardware. It couldn’t learn or adapt; it was the pinnacle of the “Good Old-Fashioned AI” approach.

Did You Know? Kasparov accused IBM of cheating, claiming human intervention in Game 2. IBM denied this and refused a rematch, then dismantled Deep Blue. The machine’s logs were never fully released, fueling conspiracy theories for years.

1997: LSTM Invented

While Deep Blue made headlines, Sepp Hochreiter and Jürgen Schmidhuber published “Long Short-Term Memory,” solving the vanishing gradient problem for recurrent neural networks:

LSTM Cell (1997)
================

The Problem:
Standard RNNs can't learn long-range dependencies
Gradients vanish or explode over many timesteps

The Solution: Gated memory cells

Cell State: ──────────────────────────→
                ↑          ↑
            Forget Gate   Input Gate
            (what to      (what to
             forget)       remember)

Output: controlled by Output Gate

Key: Gradients can flow unchanged through cell state!

LSTMs would become crucial for speech recognition, machine translation, and text generation—but that was 15+ years away.

The Data Revolution

The late 1990s and 2000s brought something AI had always lacked: massive amounts of data.

The Data Explosion
==================

Year   Data Available
1990   ~3 exabytes total worldwide
2000   ~12 exabytes
2010   ~2 zettabytes (2000 exabytes!)
2020   ~64 zettabytes

Key Enablers:
- Internet growth
- Digital cameras
- Social media
- Smartphones
- Web crawling

ImageNet (2009): Fei-Fei Li’s team at Stanford created ImageNet, a database of 14 million labeled images across 21,841 categories:

ImageNet
========

Images: 14+ million
Categories: 21,841 (WordNet synsets)
Labeled By: Amazon Mechanical Turk
Cost: Years of effort
Purpose: Enable visual recognition research

ImageNet Challenge (ILSVRC):
- Started 2010
- 1000 categories
- 1.2 million training images
- Became THE benchmark for computer vision

Did You Know? Fei-Fei Li had to fight to get ImageNet taken seriously. Other researchers thought the dataset was too big and the labeling too noisy. She proved them wrong—ImageNet became the catalyst for the deep learning revolution.

Support Vector Machines Dominate

Before deep learning’s resurgence, SVMs were the go-to algorithm for classification:

SVM Key Idea
============

Find the hyperplane that maximizes
the margin between classes:

Class A:  o  o     |     x  x  :Class B
          o  o     |     x  x
               o   |   x
          ←margin→ |

The "support vectors" are the points
closest to the decision boundary.

Kernel Trick: Map data to higher dimensions
where it becomes linearly separable!

SVMs had solid theoretical foundations (from Vapnik’s statistical learning theory) and worked well on small-to-medium datasets. They dominated competitions in the 2000s.

Part 8: The Deep Learning Revolution (2006-2012)

Think of the deep learning revolution like the Wright Brothers’ first flight—a moment when decades of failed attempts suddenly gave way to success, and everything that seemed impossible became merely difficult. The neural networks that had been written off as “dead ends” in the 1990s turned out to be just waiting for enough data and compute to reach their potential. Once those conditions were met, progress accelerated at a pace that shocked even the true believers.

Hinton’s Breakthrough (2006)

In 2006, Geoffrey Hinton and his collaborators published “A Fast Learning Algorithm for Deep Belief Nets.” The key insight: pre-train layers one at a time, then fine-tune the whole network:

Deep Belief Net Training (2006)
===============================

Step 1: Train first layer unsupervised
Input → Layer 1 (RBM)

Step 2: Train second layer on Layer 1's output
Layer 1 output → Layer 2 (RBM)

Step 3: Continue stacking...

Step 4: Fine-tune entire network with backprop

This solved the vanishing gradient problem by giving each layer a good starting point. Deep networks could finally be trained!

Did You Know? Hinton had been working on neural networks since the 1970s. He co-invented backpropagation, Boltzmann machines, and dozens of other techniques. For decades, most of AI ignored his work. In 2006, he was 59 years old—and his biggest contributions were still ahead.

The AlexNet Earthquake (2012)

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered the ImageNet competition with a deep convolutional neural network:

AlexNet Results (2012)
======================

ILSVRC 2012 Top-5 Error Rate:

AlexNet:        15.3%  ← WINNER
Second place:   26.2%
                ↑
          11 percentage points!

This wasn't a small improvement.
This was a paradigm shift.

AlexNet Architecture:
- 8 layers (5 conv, 3 fully-connected)
- 60 million parameters
- ReLU activations (not sigmoid!)
- Dropout regularization
- Trained on 2 GPUs for 6 days

The AI community was stunned. Within two years, nearly everyone in computer vision had switched to deep learning.

Why Did It Work Now?

Three factors converged:

Data: ImageNet provided millions of labeled images
Compute: GPUs (designed for gaming) happened to be perfect for neural networks
Algorithms: ReLU, dropout, and better initialization helped training

The Deep Learning Trinity
=========================

        DATA
       /    \
      /      \
COMPUTE ─── ALGORITHMS

All three had to reach critical mass together.
2012 was the year they did.

Did You Know? NVIDIA’s GPUs weren’t designed for AI—they were designed for video games. But the matrix operations for rendering graphics turned out to be exactly what neural networks needed. NVIDIA’s Jensen Huang recognized this early and pivoted the company toward AI, making NVIDIA the most valuable chipmaker in the world.

Part 9: The Modern Era (2012-Present)

The Deep Learning Tsunami (2012-2017)

After AlexNet, progress accelerated exponentially:

ImageNet Progress
=================

Year    Network      Top-5 Error    Layers
2012    AlexNet      15.3%          8
2014    VGGNet       7.3%           19
2014    GoogLeNet    6.7%           22
2015    ResNet       3.6%           152
2017    SENet        2.3%           Lots

Human Performance: ~5.1%

By 2015, machines surpassed humans
on ImageNet classification!

Key Innovations:

VGGNet (2014): Showed that depth matters—just stack 3×3 convolutions
GoogLeNet (2014): Inception modules—parallel paths of different sizes
ResNet (2015): Skip connections—train networks with 1000+ layers
Batch Normalization (2015): Normalize activations, train faster

Did You Know? ResNet’s skip connections were inspired by a simple insight: at worst, a deep network should be able to learn the identity function (just pass inputs through unchanged). Skip connections made this easy, allowing gradients to flow through very deep networks.

The Transformer Revolution (2017)

In June 2017, a team at Google published “Attention Is All You Need,” introducing the Transformer architecture:

Transformer Key Ideas
=====================

1. SELF-ATTENTION
   Every token attends to every other token
   Query, Key, Value projections
   Attention(Q,K,V) = softmax(QK^T/√d)V

2. NO RECURRENCE
   Process entire sequence in parallel
   Position encodings tell model about order

3. ENCODER-DECODER (for translation)
   Encoder: process source language
   Decoder: generate target language
   Cross-attention connects them

Result:
- Much faster training (parallelizable)
- Better long-range dependencies
- New SOTA on machine translation

The Transformer would prove to be one of the most important architectural innovations in AI history.

Did You Know? The paper’s title “Attention Is All You Need” was a bold claim—and it was right. Transformers replaced recurrence, convolution, and most other architectural elements with pure attention. The simplicity was part of the power.

BERT and GPT: Language Transformers (2018)

BERT (October 2018): Google’s “Bidirectional Encoder Representations from Transformers” showed that pre-training on massive text corpora created powerful language representations:

BERT Training
=============

Pre-training Tasks:
1. Masked Language Model (MLM)
   "The [MASK] sat on the mat" → "cat"

2. Next Sentence Prediction
   Do these sentences follow each other?

Pre-training Data: Wikipedia + Books
Parameters: 110M (base), 340M (large)

Fine-tuning: Add task-specific head
- Classification: [CLS] token → label
- QA: Predict start/end span
- NER: Token → entity type

GPT (June 2018): OpenAI’s “Generative Pre-trained Transformer” took a different approach—autoregressive language modeling. Compare it to how a novelist writes: rather than understanding a whole sentence bidirectionally like BERT, GPT predicts one word at a time, building the story word by word:

GPT Training
============

Task: Predict next token
"The cat sat on the" → "mat"

Just predict next word, over and over.
Train on internet text (WebText).

Generation: Sample from predictions
Input: "The cat"
P(next) = {sat: 0.3, ran: 0.2, meowed: 0.15, ...}
Sample → "sat"
Continue...

The GPT approach would prove to be the path to powerful language AI.

The Scaling Era (2019-2022)

GPT-2 (2019): OpenAI showed that simply scaling up GPT produced remarkable results:

GPT Scaling
===========

Model      Parameters    Training Data
GPT-1      117M         ~5GB (BooksCorpus)
GPT-2      1.5B         40GB (WebText)
GPT-3      175B         570GB (Common Crawl+)

GPT-2 could:
- Write coherent paragraphs
- Answer questions (sometimes)
- Do simple math
- Generate code

OpenAI initially didn't release it,
citing concerns about misuse.

GPT-3 (2020): At 175 billion parameters, GPT-3 demonstrated emergent abilities—capabilities that seemed to appear suddenly with scale:

GPT-3 Emergent Abilities
========================

Few-Shot Learning:
Give model a few examples, it generalizes!

Input:
"Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
cheese => "

Output: "fromage"

No fine-tuning needed!
Just demonstrate the task.

Did You Know? The term “emergent abilities” sparked debate. Are these abilities really emergent (appearing suddenly at scale), or do they appear gradually but we only notice them past a threshold? The answer affects how we predict future AI capabilities.

ChatGPT: AI Goes Mainstream (November 2022)

On November 30, 2022, OpenAI released ChatGPT, a chatbot based on GPT-3.5 with RLHF (Reinforcement Learning from Human Feedback):

ChatGPT's Secret Sauce
======================

1. Large Language Model (GPT-3.5)
   Pre-trained on internet text

2. Instruction Fine-Tuning
   Train to follow instructions
   "Summarize this" → summary

3. RLHF (Reinforcement Learning from Human Feedback)
   Humans rank responses
   Train reward model on rankings
   Fine-tune LLM to maximize reward

Result: Helpful, harmless, honest
        (mostly)

ChatGPT reached 100 million users in 2 months—the fastest-growing consumer application in history at that time.

Did You Know? ChatGPT was originally intended as a “research preview.” OpenAI was surprised by its viral growth. The success accelerated the entire industry—Google, Anthropic, Meta, and others rushed to release their own chatbots.

The Current Moment (2023-Present)

We are now in an era of rapid capability growth:

Current State of AI (2024)
==========================

Frontier Models:
- gpt-5 (OpenAI) - Multimodal, 1T+ params (rumored)
- Claude 3 (Anthropic) - Strong reasoning
- Gemini (Google) - Multimodal native
- Llama 4 (Meta) - Open weights

Capabilities:
- Pass bar exam (90th percentile)
- Write working code
- Analyze images and video
- Multi-step reasoning
- Tool use and agents

Limitations:
- Hallucinations (confident errors)
- Limited context/memory
- No true world model
- Alignment challenges

Part 10: The Key Personalities

The Founding Fathers

Name	Contribution	Era
Alan Turing	Turing Test, computation theory	1936-1954
John McCarthy	Coined “AI,” invented Lisp, Stanford AI Lab	1956-2011
Marvin Minsky	Neural nets, frames, MIT AI Lab	1956-2016
Claude Shannon	Information theory, communication	1940s-2001
Herbert Simon	Bounded rationality, GPS, Nobel laureate	1956-2001
Allen Newell	Cognitive architecture, GPS, Soar	1956-1992

The Deep Learning Pioneers

Name	Contribution	Era
Geoffrey Hinton	Backprop, Boltzmann machines, deep learning	1970s-present
Yann LeCun	CNNs, LeNet, self-supervised learning	1980s-present
Yoshua Bengio	RNNs, attention, deep learning theory	1990s-present
Jürgen Schmidhuber	LSTM, meta-learning, curiosity	1990s-present
Sepp Hochreiter	LSTM, vanishing gradients	1990s-present

Hinton, LeCun, and Bengio won the 2018 Turing Award (the “Nobel Prize of Computing”) for their work on deep learning.

The Modern Era Leaders

Name	Contribution	Era
Fei-Fei Li	ImageNet, AI4ALL, human-centered AI	2000s-present
Andrew Ng	Coursera, Google Brain, Landing AI	2000s-present
Demis Hassabis	DeepMind, AlphaGo, AlphaFold	2010s-present
Sam Altman	OpenAI CEO, ChatGPT	2010s-present
Dario Amodei	Anthropic CEO, AI safety	2010s-present
Ilya Sutskever	AlexNet, GPT, OpenAI Chief Scientist	2010s-present
Andrej Karpathy	Neural net education, Tesla AI	2010s-present

Did You Know? Many of the key figures know each other well. Ilya Sutskever was Hinton’s PhD student. Dario Amodei worked at OpenAI before founding Anthropic. The deep learning community is surprisingly small and interconnected.

Part 11: The AI Winters—Lessons Learned

What Caused the Winters?

AI Winter Causes
================

Winter 1 (1969-1980):
- Perceptrons book highlighted limitations
- Overpromising ("intelligent in a generation")
- Funding bodies lost patience
- Lighthill Report (UK, 1973)

Winter 2 (1987-1993):
- Expert systems didn't scale
- Knowledge acquisition bottleneck
- Lisp machine market collapse
- Fifth Generation Project failure
- Overpromising (again)

Common Pattern:
HYPE → OVERPROMISING → UNDERDELIVERING → BACKLASH → WINTER

Lessons for Today

Capabilities ≠ General Intelligence
- Deep Blue beat Kasparov but couldn’t play checkers
- gpt-5 passes the bar exam but can’t reliably count words
- Don’t confuse impressive demos with AGI
- It’s like watching a savant pianist who can play Chopin but can’t tie their shoes—impressive in one domain doesn’t mean general competence
Extrapolation Is Dangerous
- “In 5 years we’ll have X” is almost always wrong
- Progress is uneven—breakthroughs in one area don’t guarantee others
The Pendulum Swings Both Ways
- Hype leads to winter; winter leads to hype
- Current AI enthusiasm may face correction
- Picture the history of AI like a stock market chart—periods of irrational exuberance followed by crashes, then slow recovery and eventual new highs
Honest Assessment Prevents Backlash
- Admitting limitations builds trust
- Overpromising destroys credibility

Did You Know? We may be in an “AI Spring” that could turn to winter. Some researchers worry that current LLM limitations (hallucinations, lack of reasoning, data requirements) will lead to disappointment. Others believe we’re on the cusp of AGI. History suggests caution.

Part 12: The Bitter Lesson

Richard Sutton’s Insight (2019)

In March 2019, reinforcement learning pioneer Richard Sutton published a short essay titled “The Bitter Lesson.” His core argument:

“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.”

The Bitter Lesson
=================

What Researchers Wanted:
- Encode human knowledge
- Build expert systems
- Hand-craft features
- Design clever algorithms

What Actually Worked:
- General-purpose learning
- Massive scale
- Compute, compute, compute
- Let the model figure it out

Examples:
- Chess: Search + hardware beat knowledge
- Go: Learning beat hand-crafted evaluation
- Speech: Statistical models beat phonetics
- Vision: Neural nets beat feature engineering
- Language: Scaling beat linguistics

The "bitter" part: Human knowledge
is less valuable than compute.

Evidence for the Bitter Lesson

Chess: Deep Blue (1997) used brute-force search, not chess knowledge. AlphaZero (2017) learned from scratch, beating Stockfish with no human chess knowledge at all.

Go: AlphaGo (2016) used deep learning + Monte Carlo tree search. AlphaGo Zero (2017) learned from self-play alone, with no human game data, and became even stronger.

Computer Vision: Hand-crafted features (SIFT, HOG) dominated for years. Deep learning replaced them entirely.

NLP: Years of linguistic rules and hand-crafted features were swept away by transformers trained on raw text.

Did You Know? The Bitter Lesson was controversial. Critics argued that human knowledge still guides architecture design, training procedures, and data selection. Others noted that compute-heavy approaches have environmental and accessibility costs. The debate continues.

Part 13: Timeline Summary

Complete AI Timeline
====================

1943: McCulloch-Pitts neuron
1950: Turing Test proposed
1956: Dartmouth Conference - "AI" coined
1957: Perceptron invented (Rosenblatt)
1958: Mark I Perceptron hardware
1965: DENDRAL expert system
1966: ELIZA chatbot
1969: "Perceptrons" book → First AI Winter begins
1973: Lighthill Report (UK cuts funding)

1980: Expert systems boom begins
1982: Japan Fifth Generation Project announced
1986: Backpropagation paper (Rumelhart, Hinton, Williams)
1987: Lisp machine market crashes → Second AI Winter begins
1989: LeCun's CNNs applied to zip codes
1990: "AI Winter" in full effect

1997: Deep Blue beats Kasparov
1997: LSTM invented (Hochreiter & Schmidhuber)
1998: LeNet-5 for digit recognition
2006: Deep Belief Nets (Hinton)
2009: ImageNet created (Fei-Fei Li)
2011: Watson wins Jeopardy!
2012: AlexNet wins ImageNet → Deep Learning revolution

2014: GANs invented (Goodfellow)
2015: ResNet (152 layers!)
2016: AlphaGo beats Lee Sedol
2017: "Attention Is All You Need" - Transformer
2018: BERT (Google), GPT (OpenAI)
2019: GPT-2 (1.5B parameters)
2020: GPT-3 (175B parameters)
2021: DALL-E, Copilot
2022: ChatGPT (Nov 30) → AI goes mainstream
2023: gpt-5, Claude 2, Llama 4
2024: Claude 3, Gemini, Llama 4, open-source explosion

Hands-On Exercises

Exercise 1: Build a Timeline Visualization

Create a visual timeline of AI milestones using matplotlib or a tool like TimelineJS:

"""
AI History Timeline Visualization

Create an interactive timeline showing key AI milestones,
with annotations for AI Winters and breakthrough periods.
"""
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from datetime import datetime

# Key milestones
milestones = [
    (1943, "McCulloch-Pitts Neuron", "theory"),
    (1950, "Turing Test Proposed", "theory"),
    (1956, "Dartmouth Conference", "breakthrough"),
    (1957, "Perceptron Invented", "breakthrough"),
    (1969, "Perceptrons Book", "setback"),
    (1986, "Backpropagation Paper", "breakthrough"),
    (1997, "Deep Blue beats Kasparov", "milestone"),
    (2012, "AlexNet wins ImageNet", "breakthrough"),
    (2017, "Transformer Architecture", "breakthrough"),
    (2022, "ChatGPT Released", "breakthrough"),
]

# AI Winters
winters = [
    (1969, 1980, "First AI Winter"),
    (1987, 1993, "Second AI Winter"),
]

fig, ax = plt.subplots(figsize=(15, 8))

# Plot winters as shaded regions
for start, end, label in winters:
    ax.axvspan(start, end, alpha=0.3, color='blue', label=label)

# Plot milestones
colors = {'theory': 'purple', 'breakthrough': 'green',
          'setback': 'red', 'milestone': 'orange'}

for year, event, category in milestones:
    ax.scatter(year, 0.5, c=colors[category], s=100, zorder=5)
    ax.annotate(event, (year, 0.5), xytext=(0, 10),
                textcoords='offset points', ha='center',
                fontsize=8, rotation=45)

ax.set_xlim(1940, 2025)
ax.set_xlabel('Year')
ax.set_title('History of AI: Milestones and Winters')
plt.tight_layout()
plt.savefig('ai_timeline.png', dpi=150)
print("Timeline saved to ai_timeline.png")

Challenge: Extend this to include 50+ events, add tooltips with detailed descriptions, and deploy as an interactive web page.

Exercise 2: Implement a Historical Model

Recreate one of the early AI systems to understand how they worked:

"""
Perceptron Implementation (1957 Algorithm)

Build Rosenblatt's perceptron from scratch and train it
on simple binary classification problems.
"""
import numpy as np

class Perceptron:
    """
    The Perceptron as Rosenblatt described it.

    This is historically accurate to the 1957 algorithm:
    - Binary threshold activation
    - Simple additive weight update rule
    - Learning rate (eta)
    """

    def __init__(self, n_inputs: int, learning_rate: float = 0.1):
        # Initialize weights to small random values
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0.0
        self.lr = learning_rate

    def predict(self, x: np.ndarray) -> int:
        """Binary threshold activation - just like 1957"""
        activation = np.dot(x, self.weights) + self.bias
        return 1 if activation > 0 else 0

    def train(self, X: np.ndarray, y: np.ndarray, epochs: int = 100):
        """
        The perceptron learning rule:
        - If correct: do nothing
        - If should be 1 but was 0: add input to weights
        - If should be 0 but was 1: subtract input from weights
        """
        history = []
        for epoch in range(epochs):
            errors = 0
            for xi, target in zip(X, y):
                prediction = self.predict(xi)
                error = target - prediction

                if error != 0:
                    # Rosenblatt's update rule
                    self.weights += self.lr * error * xi
                    self.bias += self.lr * error
                    errors += 1

            history.append(errors)
            if errors == 0:
                print(f"Converged at epoch {epoch + 1}")
                break

        return history

# Test on AND gate (learnable by perceptron)
X_and = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_and = np.array([0, 0, 0, 1])

perceptron = Perceptron(2)
history = perceptron.train(X_and, y_and)
print(f"AND gate predictions: {[perceptron.predict(x) for x in X_and]}")

# Test on XOR (NOT learnable by single perceptron!)
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

perceptron_xor = Perceptron(2)
history_xor = perceptron_xor.train(X_xor, y_xor, epochs=1000)
print(f"XOR gate predictions: {[perceptron_xor.predict(x) for x in X_xor]}")
print("XOR fails! This is exactly what Minsky & Papert proved in 1969.")

Challenge: Implement a multi-layer perceptron to solve XOR and demonstrate why the 1986 backpropagation paper was so important.

Exercise 3: Analyze the Bitter Lesson

Read Sutton’s “The Bitter Lesson” essay and analyze whether current AI development follows its predictions:

"""
Bitter Lesson Analysis Framework

Evaluate whether recent AI developments follow or contradict
the Bitter Lesson's predictions about compute vs. knowledge.
"""

bitter_lesson_claims = {
    "claim_1": "General methods that leverage computation are most effective",
    "claim_2": "Human knowledge is less valuable than scale",
    "claim_3": "Search and learning beat domain knowledge",
    "claim_4": "Short-term benefits of human knowledge are outweighed by long-term compute gains",
}

# Your analysis framework
def analyze_development(development_name: str, description: str,
                        supports_bitter_lesson: bool, reasoning: str):
    """Document whether a development supports or contradicts the Bitter Lesson"""
    return {
        "development": development_name,
        "description": description,
        "supports_bitter_lesson": supports_bitter_lesson,
        "reasoning": reasoning
    }

# Example analyses
analyses = [
    analyze_development(
        "gpt-5",
        "Massive transformer trained on internet text",
        True,
        "Pure scale and compute, minimal hand-crafted linguistic knowledge"
    ),
    analyze_development(
        "AlphaFold 2",
        "Protein structure prediction using deep learning",
        True,
        "Replaced decades of physics-based approaches with learned patterns"
    ),
    analyze_development(
        "RLHF",
        "Reinforcement Learning from Human Feedback",
        False,  # Arguably contradicts
        "Human preferences guide model behavior - injecting human knowledge"
    ),
    # Add your own analyses...
]

# Tally results
support_count = sum(1 for a in analyses if a["supports_bitter_lesson"])
print(f"\nBitter Lesson Analysis Summary:")
print(f"Developments supporting: {support_count}/{len(analyses)}")
print(f"Developments contradicting: {len(analyses) - support_count}/{len(analyses)}")

# Your conclusion
print("\nYour Analysis:")
print("Does modern AI development follow the Bitter Lesson?")
print("[Write your 2-3 paragraph analysis here]")

Challenge: Write a 500-word essay arguing for or against the Bitter Lesson based on developments since 2019.

Exercise 4: Pioneer Research Deep Dive

Select one AI pioneer and trace their intellectual lineage:

"""
AI Pioneer Research Project

Pick a pioneer, read their seminal papers, and trace
how their ideas evolved and influenced others.
"""

pioneer_template = {
    "name": "Geoffrey Hinton",
    "birth_year": 1947,
    "key_contributions": [
        "Backpropagation (1986)",
        "Boltzmann Machines (1985)",
        "Deep Belief Networks (2006)",
        "Dropout (2012)",
        "AlexNet co-author (2012)",
    ],
    "seminal_papers": [
        "Learning representations by back-propagating errors (1986)",
        "ImageNet Classification with Deep CNNs (2012)",
        "Dropout: A Simple Way to Prevent Overfitting (2014)",
    ],
    "advisors_mentors": ["Christopher Longuet-Higgins"],
    "notable_students": [
        "Yann LeCun", "Ilya Sutskever", "Alex Krizhevsky"
    ],
    "awards": ["Turing Award 2018", "Nobel Prize Physics 2024"],
    "intellectual_lineage": """
    Hinton's work connects:
    - McCulloch-Pitts (1943) -> early neural models
    - Rosenblatt (1957) -> perceptrons
    - Rumelhart (1986) -> backpropagation collaboration
    - Modern deep learning -> through students like Sutskever
    """,
}

# Your assignment: Create profiles for 3 pioneers of your choice
# Suggested: Turing, Minsky, LeCun, Bengio, Schmidhuber, Fei-Fei Li

# Research questions to answer:
questions = [
    "What problem were they trying to solve?",
    "What was the prevailing wisdom they challenged?",
    "How did their background influence their approach?",
    "Who did they collaborate with or learn from?",
    "What is their lasting impact on the field?",
]

print("AI Pioneer Research Project")
print("=" * 40)
for q in questions:
    print(f"• {q}")

Challenge: Create a network graph showing how major AI pioneers are connected through mentorship, collaboration, and intellectual influence.

Reflection Questions

Why did neural networks take so long to succeed? What combination of factors (data, compute, algorithms) was missing earlier?
Could there be another AI Winter? What would cause it? What would need to happen for current progress to stall?
The Bitter Lesson suggests compute beats knowledge. Is this always true? When might human knowledge and inductive biases still matter?
What patterns from AI history might predict the next decade? What lessons should guide our expectations?
Who are the “Turings” and “Hintons” of the future? What problems might the next generation of AI pioneers solve?

Module Summary

You’ve now traced the complete arc of AI history—from McCulloch and Pitts’ paper neurons in 1943 to ChatGPT reaching 100 million users in 2023.

Key Takeaways:

Progress is non-linear: Two AI Winters separated periods of euphoria
The Bitter Lesson: General methods + compute tend to win
Data + Compute + Algorithms: All three must align for breakthroughs
Standing on shoulders: Today’s AI builds on 80 years of work
History suggests humility: Confident predictions often fail

As you continue your AI journey, remember: you’re not just learning techniques—you’re joining a conversation that started 80 years ago and will continue for generations.

Every line of code you write, every model you train, every system you deploy is part of this ongoing story. The pioneers who came before—Turing with his theoretical foundations, Rosenblatt with his learning machines, Hinton with his stubborn faith in neural networks—they paved the way for what you’re doing today. And the decisions you make, the problems you solve, the ethics you uphold will shape what AI becomes tomorrow.

The field has weathered two winters and emerged stronger each time. It has surprised the world with capabilities that seemed impossible just years before. And it faces challenges—ethical, technical, and societal—that will require the best minds of the next generation to solve. Perhaps yours will be among them.

MODULE 55 COMPLETE

NEURAL DOJO CURRICULUM COMPLETE

Congratulations! You’ve completed all 55 modules of the Neural Dojo curriculum. You are now an AI Guru.

Last updated: 2025-11-29 Module 55: The Complete History of AI & Machine Learning